Summary:
During compaction, we sync the output files after they are fully written out. This causes unnecessary blocking of the compaction thread and burstiness of the write traffic.
This diff simply asks the OS to sync data incrementally as they are written, on the background. The hope is that, at the final sync, most of the data are already on disk and we would block less on the sync call. Thus, each compaction runs faster and we could use fewer number of compaction threads to saturate IO.
In addition, the write traffic will be smoothed out, hopefully reducing the IO P99 latency too.
Some quick tests show 10~20% improvement in per thread compaction throughput. Combined with posix advice on compaction read, just 5 threads are enough to almost saturate the udb flash bandwidth for 800 bytes write only benchmark.
What's more promising is that, with saturated IO, iostat shows average wait time is actually smoother and much smaller.
For the write only test 800bytes test:
Before the change: await occillate between 10ms and 3ms
After the change: await ranges 1-3ms
Will test against read-modify-write workload too, see if high read latency P99 could be resolved.
Will introduce a parameter to control the sync interval in a follow up diff after cleaning up EnvOptions.
Test Plan: make check; db_bench; db_stress
Reviewers: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11115
Summary:
Preliminary! Introduced the --use_multiget=1 and --keys_per_multiget=n
flags for db_bench. Also updated and tested the ReadRandom() method
to include an option to use multiget. By default,
keys_per_multiget=100.
Preliminary tests imply that multiget is at least 1.25x faster per
key than regular get.
Will continue adding Multiget for ReadMissing, ReadHot,
RandomWithVerify, ReadRandomWriteRandom; soon. Will also think
about ways to better verify benchmarks.
Test Plan:
1. make db_bench
2. ./db_bench --benchmarks=fillrandom
3. ./db_bench --benchmarks=readrandom --use_existing_db=1
--use_multiget=1 --threads=4 --keys_per_multiget=100
4. ./db_bench --benchmarks=readrandom --use_existing_db=1
--threads=4
5. Verify ops/sec (and 1000000 of 1000000 keys found)
Reviewers: haobo, MarkCallaghan, dhruba
Reviewed By: MarkCallaghan
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11127
Summary:
This diff simplifies EnvOptions by treating it as POD, similar to Options.
- virtual functions are removed and member fields are accessed directly.
- StorageOptions is removed.
- Options.allow_readahead and Options.allow_readahead_compactions are deprecated.
- Unused global variables are removed: useOsBuffer, useFsReadAhead, useMmapRead, useMmapWrite
Test Plan: make check; db_stress
Reviewers: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11175
Summary:
Completed the implementation for the Redis API for Lists.
The Redis API uses rocksdb as a backend to persistently
store maps from key->list. It supports basic operations
for appending, inserting, pushing, popping, and accessing
a list, given its key.
Test Plan:
- Compile with: make redis_test
- Test with: ./redis_test
- Run all unit tests (for all rocksdb) with: make all check
- To use an interactive REDIS client use: ./redis_test -m
- To clean the database before use: ./redis_test -m -d
Reviewers: haobo, dhruba, zshao
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10833
Summary:
The code was such that if multi-threaded-compactions as well
as seek compaction are enabled then it submits multiple
compaction request for the same range of keys. This causes
extraneous sst-files to accumulate at various levels.
Test Plan:
I am not able to write a very good unit test for this one
but can easily reproduce this bug with 'dbstress' with the
following options.
batch=1;maxk=100000000;ops=100000000;ro=0;fm=2;bpl=10485760;of=500000; wbn=3; mbc=20; mb=2097152; wbs=4194304; dds=1; sync=0; t=32; bs=16384; cs=1048576; of=500000; ./db_stress --disable_seek_compaction=0 --mmap_read=0 --threads=$t --block_size=$bs --cache_size=$cs --open_files=$of --verify_checksum=1 --db=/data/mysql/leveldb/dbstress.dir --sync=$sync --disable_wal=1 --disable_data_sync=$dds --write_buffer_size=$wbs --target_file_size_base=$mb --target_file_size_multiplier=$fm --max_write_buffer_number=$wbn --max_background_compactions=$mbc --max_bytes_for_level_base=$bpl --reopen=$ro --ops_per_thread=$ops --max_key=$maxk --test_batches_snapshots=$batch
Reviewers: leveldb, emayanke
Reviewed By: emayanke
Differential Revision: https://reviews.facebook.net/D11055
Summary: Added logic to make another WriteBatch with Timestamps during the Write function execution in TTL class. Also expanded the ttl_test to test for it. Have done nothing for Merge for now.
Test Plan: make ttl_test;./ttl_test
Reviewers: haobo, vamsi, dhruba
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10827
Summary: the assert was wrong in D11145. Broke build
Test Plan: make db_bench run it
Reviewers: dhruba, haobo, emayanke
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11187
Summary:
The current code prints the name of the InternalKeyComparator
in the log file. We would also like to print the name of the
user-specified comparator for easier debugging.
Test Plan: make check
Reviewers: sheki
Reviewed By: sheki
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11181
Summary: Provide a map of histograms and ticker vs strings. Fb303 libraries can use this to provide the mapping. We will not have to duplicate the code during release.
Test Plan: db_bench with statistics=1
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11145
Summary:
Without this files could be written out to a level greater than the maximum level possible and is the source of the segfaults that wormhole awas getting. The sequence of steps that was followed:
1. WriteLevel0Table was called when memtable was to be flushed for a file.
2. PickLevelForMemTableOutput was called to determine the level to which this file should be pushed.
3. PickLevelForMemTableOutput returned a wrong result because max_mem_compaction_level was equal to 2 even when num_levels was equal to 0.
The fix to re-initialize max_mem_compaction_level based on num_levels passed seems correct.
Test Plan: make all check; Also made a dummy file to mimic the wormhole-file behaviour which was causing the segfaults and found that the same segfault occurs without this change and not with this.
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11157
Summary: These extra options caught some bugs. Will be run via Jenkins now with the crash_test
Test Plan: ./make crashtest
Reviewers: dhruba, vamsi
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11151
Summary:
I think the check for "error" that I added had caused
false alarm. Fixed that.
Test Plan:
Revert Plan: OK
Task ID: #
Reviewers: emayanke, dhruba
Reviewed By: emayanke
Differential Revision: https://reviews.facebook.net/D11139
Summary: Added a target to Makefile called 'tags' that runs ctags and cscope on all *.cc and *.h file
Test Plan:
Run 'make tags'. Then start vim and do
:set tags=./tags
:cs add cscope.out
These commands should give you no error messages. You should then be able to access cscope db and ctags as normal in vim.
Reviewers: dhruba
Differential Revision: https://reviews.facebook.net/D11103
Summary:
Implemented the MultiGet operator which takes in a list of keys
and returns their associated values. Currently uses std::vector as its
container data structure. Otherwise, it works identically to "Get".
Test Plan:
1. make db_test ; compile it
2. ./db_test ; test it
3. make all check ; regress / run all tests
4. make release ; (optional) compile with release settings
Reviewers: haobo, MarkCallaghan, dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10875
Summary: Added the 'score' column to the compaction stats output, which shows the level total size devided by level target size. Could be useful when monitoring compaction decisions...
Test Plan: make check; db_bench
Reviewers: dhruba
CC: leveldb, MarkCallaghan
Differential Revision: https://reviews.facebook.net/D11025
Summary:
This diff adds an option to specify whether PTHREAD_MUTEX_ADAPTIVE_NP will be enabled for the rocksdb single big kernel lock. db_bench also have this option now.
Quickly tested 8 thread cpu bound 100 byte random read.
No fast mutex: ~750k/s ops
With fast mutex: ~880k/s ops
Test Plan: make check; db_bench; db_stress
Reviewers: dhruba
CC: MarkCallaghan, leveldb
Differential Revision: https://reviews.facebook.net/D11031
Summary:
Current posix advice implementation ties up the access pattern hint with the creation of a file.
It is not possible to apply different advice for different access (random get vs compaction read),
without keeping two open files for the same table. This patch extended the RandomeAccessFile interface
to accept new access hint at anytime. Particularly, we are able to set different access hint on the same
table file based on when/how the file is used.
Two options are added to set the access hint, after the file is first opened and after the file is being
compacted.
Test Plan: make check; db_stress; db_bench
Reviewers: dhruba
Reviewed By: dhruba
CC: MarkCallaghan, leveldb
Differential Revision: https://reviews.facebook.net/D10905
Summary: Enhance the statitics to report the number of open db iterators.
Test Plan: make check
Reviewers: haobo, emayanke
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10983
Summary: Overriding block_size_deviation to zero, so that CorruptionTest can pass.
Test Plan: make check
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D10977
Summary: a new option block_size_deviation is added.
Test Plan: run db_test and db_bench
Reviewers: dhruba, haobo
Reviewed By: haobo
Differential Revision: https://reviews.facebook.net/D10821
Summary: a new option block_size_deviation is added.
Test Plan: run db_test and db_bench
Reviewers: dhruba, haobo
Reviewed By: haobo
Differential Revision: https://reviews.facebook.net/D10821
Summary: MaybeDumpStats was causing lock problem
Test Plan: make check; db_stress
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D10935
Summary:
Added an option stats_dump_period_sec to dump leveldb.stats to LOG periodically for diagnosis.
By defauly, it's set to a very big number 3600 (1 hour).
Test Plan: make check;
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb, zshao
Differential Revision: https://reviews.facebook.net/D10761
Summary: There was an artifical limit on the size of the write buffer size.
Test Plan: make check
Reviewers: haobo
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10911
Summary:
The valgrind errors were in the unit tests where we change the
number of levels of a database using internal methods.
Test Plan:
valgrind ./reduce_levels_test
valgrind ./db_test
Reviewers: emayanke
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10893
Summary: Make Statistics usable by client
Test Plan: make check; db_bench
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D10899
Summary:
This is initial version. A few ways in which this could
be extended in the future are:
(a) Killing from more places in source code
(b) Hashing stack and using that hash in determining whether to crash.
This is to avoid crashing more often at source lines that are executed
more often.
(c) Raising exceptions or returning errors instead of killing
Test Plan:
This whole thing is for testing.
Here is part of output:
python2.7 tools/db_crashtest2.py -d 600
Running db_stress
db_stress retncode -15 output LevelDB version : 1.5
Number of threads : 32
Ops per thread : 10000000
Read percentage : 50
Write-buffer-size : 4194304
Delete percentage : 30
Max key : 1000
Ratio #ops/#keys : 320000
Num times DB reopens: 0
Batches/snapshots : 1
Purge redundant % : 50
Num keys per lock : 4
Compression : snappy
------------------------------------------------
No lock creation because test_batches_snapshots set
2013/04/26-17:55:17 Starting database operations
Created bg thread 0x7fc1f07ff700
... finished 60000 ops
Running db_stress
db_stress retncode -15 output LevelDB version : 1.5
Number of threads : 32
Ops per thread : 10000000
Read percentage : 50
Write-buffer-size : 4194304
Delete percentage : 30
Max key : 1000
Ratio #ops/#keys : 320000
Num times DB reopens: 0
Batches/snapshots : 1
Purge redundant % : 50
Num keys per lock : 4
Compression : snappy
------------------------------------------------
Created bg thread 0x7ff0137ff700
No lock creation because test_batches_snapshots set
2013/04/26-17:56:15 Starting database operations
... finished 90000 ops
Revert Plan: OK
Task ID: #2252691
Reviewers: dhruba, emayanke
Reviewed By: emayanke
CC: leveldb, haobo
Differential Revision: https://reviews.facebook.net/D10581
Summary:
Currently, with paranoid_check on, DB::Open will fail on any log read error on recovery.
If client is ok with losing most recent updates, we could simply skip those errors.
However, it's important to introduce an additional flag, so that paranoid_check can
still guard against more serious problems.
Test Plan: make check; db_stress
Reviewers: dhruba, emayanke
Reviewed By: emayanke
CC: leveldb, emayanke
Differential Revision: https://reviews.facebook.net/D10869
Summary:
There is an existing field Options.max_bytes_for_level_multiplier that
sets the multiplier for the size of each level in the database.
This patch introduces the ability to set different multipliers
for every level in the database. The size of a level is determined
by using both max_bytes_for_level_multiplier as well as the
per-level fanout.
size of level[i] = size of level[i-1] * max_bytes_for_level_multiplier
* fanout[i-1]
The default value of fanout is 1, so that it is backward compatible.
Test Plan: make check
Reviewers: haobo, emayanke
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10863
Summary:
The merge operator diff introduced a performance problem in MemTable::Get.
An exit condition is missed when the current key does not match the user key.
This could lead to full memtable scan if the user key is not found.
Test Plan: make check; db_bench
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10851
Summary: db can't reopen safely with disable_wal set!
Test Plan: make db_stress; run db_stress with disable_wal and reopens set and see error
Reviewers: dhruba, vamsi
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10857
Summary:
PosixLogger and AutoRollLogger do not seem to be thread safe.
For PosixLogger, log_size_ is not atomically updated.
For AutoRollLogger, the underlying logger_ might be deleted by
one thread while still being accessed by another.
Test Plan: make check
Reviewers: kailiu, dhruba, heyongqiang
Reviewed By: kailiu
CC: leveldb, zshao, sheki
Differential Revision: https://reviews.facebook.net/D9699
Summary: Will help while debugging if the generated value is truncated at proper length.
Test Plan: make db_stress;/db_stress --max_key=10000 --db=/tmp/mcr --threads=1 --ops_per_thread=10000
Reviewers: dhruba, vamsi
Reviewed By: vamsi
Differential Revision: https://reviews.facebook.net/D10845
Summary:
Make stop watch a simple implementation, instead of subclass of a virtual class
Allocate stop watches off the stack instead of heap.
Code is more terse now.
Test Plan: make all check, db_bench with --statistics=1
Reviewers: haobo, dhruba
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10809
Summary: Statistics.h and histogram.h had double based api's to record values. Remove them as they are not used anywhere
Test Plan: make all check
Reviewers: haobo, dhruba
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10815
Summary: ldb works with raw data from the database and needs to be aware of ttl-database to work with it meaningfully. '-ttl' option now tells it that. Also added onto the ldb_test.py test. This option may be specified alongwith put, get, scan or dump. There is no support to provide a ttl-value and it uses default forever because there is no use-case for this currently.
Test Plan: make ldb_test; python tools/ldb_test.py
Reviewers: dhruba, sheki, haobo, vamsi
Reviewed By: sheki
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10797
Summary: This caused crash when the jenkins machine changed because the PATH variable is different.
Test Plan: make
Reviewers: dhruba, sheki
Reviewed By: sheki
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10803
Summary:
Implemented the StringAppendOperator class (subclass of MergeOperator).
Found in utilities/merge_operators/string_append/stringappend.{h,cc}
It is a rocksdb Merge Operator that supports string/list concatenation
with a configurable delimiter.
The tests are found in .../stringappend_test.cc. It implements a
map : key -> (list of strings), with core operations Append(list_key,val)
and Get(list_key).
Test Plan:
1. Navigate to your rocksdb repository
2. Execute: make stringappend_test (to compile)
3. Execute: ./stringappend_test (to run the tests)
4. Execute: make all check (to test the ENTIRE rocksdb codebase / regression)
Reviewers: haobo, dhruba, zshao
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10737
Summary:
This diff replaces compaction_filter_args and CompactionFilter with a single compaction_filter parameter. It gives CompactionFilter better encapsulation and a similar look to Comparator and MergeOpertor, which improves consistency of the overall interface.
The change is not backward compatible. Nevertheless, the two references in fbcode are not in production yet.
Test Plan: make check
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb, zshao
Differential Revision: https://reviews.facebook.net/D10773