From b01f426f56c83815d3664f3ba69ff758fcdc8772 Mon Sep 17 00:00:00 2001 From: Maysam Yabandeh Date: Fri, 25 Aug 2017 16:09:51 -0700 Subject: [PATCH] Blog post for FlushWAL Summary: Closes https://github.com/facebook/rocksdb/pull/2790 Differential Revision: D5711609 Pulled By: maysamyabandeh fbshipit-source-id: ea103dac013c0a6a031834541ad67e7d95a80fe8 --- docs/_posts/2017-08-24-pinnableslice.markdown | 4 +-- docs/_posts/2017-08-25-flushwal.markdown | 26 +++++++++++++++++++ 2 files changed, 28 insertions(+), 2 deletions(-) create mode 100644 docs/_posts/2017-08-25-flushwal.markdown diff --git a/docs/_posts/2017-08-24-pinnableslice.markdown b/docs/_posts/2017-08-24-pinnableslice.markdown index a5026d5c4..7ac2fec34 100644 --- a/docs/_posts/2017-08-24-pinnableslice.markdown +++ b/docs/_posts/2017-08-24-pinnableslice.markdown @@ -1,5 +1,5 @@ --- -title: PinnableSlice: less memcpy with point lookups +title: PinnableSlice; less memcpy with point lookups layout: post author: maysamyabandeh category: blog @@ -11,7 +11,7 @@ The classic API for [DB::Get](https://github.com/facebook/rocksdb/blob/9e5837111 Similarly to Slice, PinnableSlice refers to some in-memory data so it does not incur the memcpy cost. To ensure that the data will not be erased while it is being processed by the user, PinnableSlice, as its name suggests, has the data pinned in memory. The pinned data are released when PinnableSlice object is destructed or when ::Reset is invoked explicitly on it. -### How good it is? +### How good is it? Here are the improvements in throughput for an [in-memory benchmark](https://github.com/facebook/rocksdb/pull/1756#issuecomment-286201693): * value 1k byte: 14% diff --git a/docs/_posts/2017-08-25-flushwal.markdown b/docs/_posts/2017-08-25-flushwal.markdown new file mode 100644 index 000000000..01f878e87 --- /dev/null +++ b/docs/_posts/2017-08-25-flushwal.markdown @@ -0,0 +1,26 @@ +--- +title: FlushWAL; less fwrite, faster writes +layout: post +author: maysamyabandeh +category: blog +--- + +When `DB::Put` is called, the data is written to both memtable (to be flushed to SST files later) and the WAL (write-ahead log) if it is enabled. In the case of a crash, RocksDB can recover as much as the memtable state that is reflected into the WAL. By default RocksDB automatically flushes the WAL from the application memory to the OS buffer after each `::Put`. It however can be configured to perform the flush manually after an explicit call to ::FlushWAL. Not doing fwrite syscall after each ::Put offers a tradeoff between reliability and write latency for the general case. As we explain below, some applications such as MyRocks benefit from this API to gain higher write throughput with however no compromise in reliability. + +### How much is the gain? + +Using `::FlushWAL` API along with setting `DBOptions.concurrent_prepare`, MyRocks achieves 40% higher throughput in Sysbench's [update-nonindex](https://github.com/akopytov/sysbench/blob/master/src/lua/oltp_update_non_index.lua) benchmark. + +### Write, Flush, and Sync + +The write to the WAL is first written to the application memory buffer. The buffer in the next step is "flushed" to OS buffer by calling fwrite syscall. The OS buffer is later "synced" to the persistent storage. The data in the OS buffer, although not persisted yet, will survive the application crash. By default, the flush occurs automatically upon each call to DB::Put or DB::Write. The user can additionally request sync after each write by setting WriteOptions::sync. + +### FlushWAL API + +The user can turn off the automatic flush of the WAL by setting `DBOptions::manual_wal_flush`. In that case, the WAL buffer is flushed when it is either full or `DB::FlushWAL` is called by the user. The API also accepts a boolean argument should we want to sync right after the flush: `::FlushWAL(true)`. + +### Success story: MyRocks + +Some applications that use RocksDB, already have other machinsims in place to provide reliability. MySQL for example uses 2PC (two-phase commit) to write to both binlog as well as the storage engine such as InnoDB and MyRocks. The group commit logic in MySQL allows the 1st phase (Prepare) to be run in parallel but after a commit group is formed performs the 2nd phase (Commit) in a serial manner. This makes low commit latency in the storage engine essential for acheiving high throughput. The commit in MyRocks includes writing to the RocksDB WAL, which as explaiend above, by default incures the latency of flushing the WAL new appends to the OS buffer. + +Since a storage engine commit is not visible to the users until the group commit finishes, and also because binlog helps in recovering from some failure scenarios, MySQL can provide reliability without however needing a storage WAL flush after each individual commit. MyRocks benefits from this property, disables automatic WAL flush in RocksDB, and manually calls `::FlushWAL` when requested by MySQL.