Summary:
Create a new type of file on startup if it doesn't already exist called DBID.
This will store a unique number generated from boost library's uuid header file.
The use-case is to identify the case of a db losing all its data and coming back up either empty or from an image(backup/live replica's recovery)
the key point to note is that DBID is not stored in a backup or db snapshot
It's preferable to use Boost for uuid because:
1) A non-standard way of generating uuid is not good
2) /proc/sys/kernel/random/uuid generates a uuid but only on linux environments and the solution would not be clean
3) c++ doesn't have any direct way to get a uuid
4) Boost is a very good library that was already having linkage in rocksdb from third-party
Note: I had to update the TOOLCHAIN_REV in build files to get latest verison of boost from third-party as the older version had a bug.
I had to put Wno-uninitialized in Makefile because boost-1.51 has an unitialized variable and rocksdb would not comiple otherwise. Latet open-source for boost is 1.54 but is not there in third-party. I have notified the concerned people in fbcode about it.
@kailiu : While releasing to third-party, an additional dependency will need to be created for boost in TARGETS file. I can help identify.
Test Plan:
Expand db_test to test 2 cases
1) Restarting db with Id file present - verify that no change to Id
2)Restarting db with Id file deleted - verify that a different Id is there after reopen
Also run make all check
Reviewers: dhruba, haobo, kailiu, sdong
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13587
Summary: Per @haobo's request, rephrasing the comment for allocate
Test Plan: It's a comment!
Reviewers: haobo, kailiu
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13575
Summary: This caused Siying's unit test to fail.
Test Plan: Unittest
Reviewers: dhruba, kailiu, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13539
Summary: Allow ttl flag
Test Plan:
tested on my database that has merge operations and ttl
Revert Plan: OK
Task ID: #3038186
Reviewers: emayanke, dhruba, haobo
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13503
Summary:
Developing a capability for storing values on external backing file(s).
This is just a highly unoptimized first pass - supports:
1) Allocating some portion of external file to be used to store value
2) Freeing the range, enabling it to be reused by other values
As next steps, I plan to:
1) Create some kind of stress testing. Once I can measure stuff, I can focus on optimizing.
2) Optimize locking.
3) Optimize freelist data structure. Currently we have O(n) for both freeing and allocation.
4) Figure out how to do recovery.
Test Plan: Created a unit test.
Reviewers: dhruba, haobo, kailiu
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13389
Summary:
tmpfs might not support fallocate(). Fix unit test so that this
does not cause a unit test to fail.
Test Plan: ./env_test
Reviewers: emayanke, igor, kailiu
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13455
Summary:
This is needed to make existing dbs be able to open and also because BytewiseComparator was not changed since leveldb.
The inverted order in the error message caused confusion prebiously
Test Plan: make; open existing db
Reviewers: leveldb, dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D13449
Summary:
With this patch, when LRUCache.Insert() is called and the cache is full, it will first try to free up entries whose reference counter is 1 (would become 0 after remo\
ving from the cache). We do it in two passes, in the first pass, we only try to release those unreferenced entries. If we cannot free enough space after traversing t\
he first remove_scan_cnt_ entries, we start from the beginning again and remove those entries being used.
Test Plan: add two unit tests to cover the codes
Reviewers: dhruba, haobo, emayanke
Reviewed By: emayanke
CC: leveldb, emayanke, xjin
Differential Revision: https://reviews.facebook.net/D13377
Summary:
As title. Fix an lint error:
Lint: CppLint Error
Single-argument constructor 'Value(int v)' may inadvertently be used as a type conversion constructor. Prefix the function with the 'explicit' keyword to avoid this, or add an /* implicit */ comment to suppress this warning.
Test Plan: N/A
Reviewers: emayanke, haobo, dhruba
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13401
Summary: virtual NewRandomRWFile is not implemented on EnvHdfs, causing build failure.
Test Plan: make clean; make all check
Reviewers: dhruba, haobo, kailiu
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13383
Summary: I have implemented basic simple use case that I need for External Value Store I'm working on. There is a potential for making this prettier by refactoring/combining WritableFile and RandomAccessFile, avoiding some copypasta. However, I decided to implement just the basic functionality, so I can continue working on the other diff.
Test Plan: Added a unittest
Reviewers: dhruba, haobo, kailiu
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13365
Summary: In some cases, you might not want to store the data log (write ahead log) files in the same dir as the sst files. An example use case is leaf, which stores sst files in tmpfs. And would like to save the log files in a separate dir (disk) to save memory.
Test Plan: make all. Ran db_test test. A few test failing. P2785018. If you guys don't see an obvious problem with the code, maybe somebody from the rocksdb team could help me debug the issue here. Running this on leaf worked well. I could see logs stored on disk, and deleted appropriately after compactions. Obviously this is only one set of options. The unit tests cover different options. Seems like I'm missing some edge cases.
Reviewers: dhruba, haobo, leveldb
CC: xinyaohu, sumeet
Differential Revision: https://reviews.facebook.net/D13239
Summary: Split Unref into two parts -> cheap and expensive. Try to call expensive Unref outside of critical section to decrease lock contention.
Test Plan: unittests
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb, kailiu
Differential Revision: https://reviews.facebook.net/D13299
Summary:
Change namespace from leveldb to rocksdb. This allows a single
application to link in open-source leveldb code as well as
rocksdb code into the same process.
Test Plan: compile rocksdb
Reviewers: emayanke
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13287
Summary: As title. This is just a quick hack and not ready for commit. fails a lot of unit test. I will test/debug it directly in ViewState shadow .
Test Plan: Try it in shadow test.
Reviewers: dhruba, xjin
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12933
Summary: as title. unit test not polished. this is for a quick live test
Test Plan: live
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13221
Summary: as title
Test Plan: make check
Reviewers: dhruba, emayanke
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13179
Summary:
The constructor for Vector memtable has a parameter called 'count'
that specifies the capacity of the vector to be reserved at allocation
time. It was incorrectly used to initialize the size of the vector.
Test Plan: Enhanced db_test.
Reviewers: haobo, xjin, emayanke
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13083
Summary:
Added a new api to the Environment that allows clearing out not-needed
pages from the OS cache. This will be helpful when the compressed
block cache replaces the OS cache.
Test Plan: EnvPosixTest.InvalidateCache
Reviewers: haobo
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13041
Summary:
There is a use-case where we want to insert data into rocksdb as
fast as possible. Vector rep is used for this purpose.
The background flush thread needs to flush the vectorrep to
storage. It acquires the dblock then sorts the vector, releases
the dblock and then writes the sorted vector to storage. This is
suboptimal because the lock is held during the sort, which
prevents new writes for occuring.
This patch moves the sorting of the vector rep to outside the
db mutex. Performance is now as fastas the underlying storage
system. If you are doing buffered writes to rocksdb files, then
you can observe throughput upwards of 200 MB/sec writes.
This is an early draft and not yet ready to be reviewed.
Test Plan:
make check
Task ID: #
Blame Rev:
Reviewers: haobo
Reviewed By: haobo
CC: leveldb, haobo
Differential Revision: https://reviews.facebook.net/D12987
Summary: move the TwoPools test to the end of thread related tests. Otherwise, the SetBackgroundThreads call would increase the Low pool size and affect the result of other tests.
Test Plan: make env_test; ./env_test
Reviewers: dhruba, emayanke, xjin
Reviewed By: xjin
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12939
Summary:
Added a new field called max_size_amplification_ratio in the
CompactionOptionsUniversal structure. This determines the maximum
percentage overhead of space amplification.
The size amplification is defined to be the ratio between the size of
the oldest file to the sum of the sizes of all other files. If the
size amplification exceeds the specified value, then min_merge_width
and max_merge_width are ignored and a full compaction of all files is done.
A value of 10 means that the size a database that stores 100 bytes
of user data could occupy 110 bytes of physical storage.
Test Plan: Unit test DBTest.UniversalCompactionSpaceAmplification added.
Reviewers: haobo, emayanke, xjin
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12825
Summary:
this is the ground work for separating memtable flush jobs to their own thread pool.
Both SetBackgroundThreads and Schedule take a third parameter Priority to indicate which thread pool they are working on. The names LOW and HIGH are just identifiers for two different thread pools, and does not indicate real difference in 'priority'. We can set number of threads in the pools independently.
The thread pool implementation is refactored.
Test Plan: make check
Reviewers: dhruba, emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12885
Summary: As title. The DB log file life cycle is tied up with the memtable it backs. Once the memtable is flushed to sst and committed, we should be able to delete the log file, without holding the mutex. This is part of the bigger change to avoid FindObsoleteFiles at runtime. It deals with log files. sst files will be dealt with later.
Test Plan: make check; db_bench
Reviewers: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11709
Summary: The pupose of this diff is to expose per user-call level precise timing of block read, so that we can answer questions like: a Get() costs me 100ms, is that somehow related to loading blocks from file system, or sth else? We will answer that with EXACTLY how many blocks have been read, how much time was spent on transfering the bytes from os, how much time was spent on checksum verification and how much time was spent on block decompression, just for that one Get. A nano second stopwatch was introduced to track time with higher precision. The cost/precision of the stopwatch is also measured in unit-test. On my dev box, retrieving one time instance costs about 30ns, on average. The deviation of timing results is good enough to track 100ns-1us level events. And the overhead could be safely ignored for 100us level events (10000 instances/s), for example, a viewstate thrift call.
Test Plan: perf_context_test, also testing with viewstate shadow traffic.
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb, xjin
Differential Revision: https://reviews.facebook.net/D12351
Summary:
An iterator invokes reseek if the number of sequential skips over the
same userkey exceeds a configured number. This makes iter->Next()
faster (bacause of fewer key compares) if a large number of
adjacent internal keys in a table (sst or memtable) have the
same userkey.
Test Plan: Unit test DBTest.IterReseek.
Reviewers: emayanke, haobo, xjin
Reviewed By: xjin
CC: leveldb, xjin
Differential Revision: https://reviews.facebook.net/D11865
Summary:
Add new command "change_compaction_style" to ldb tool. For
universal->level, it shows "nothing to do". For level->universal, it
compacts all files into a single one and moves the file to level 0.
Also add check for number of files at level 1+ when opening db with
universal compaction style.
Test Plan:
'make all check'. New unit test for internal convertion function. Also manully test various
cmd like:
./ldb change_compaction_style --old_compaction_style=0
--new_compaction_style=1 --db=/tmp/leveldbtest-3088/db_test
Reviewers: haobo, dhruba
Reviewed By: haobo
CC: vamsi, emayanke
Differential Revision: https://reviews.facebook.net/D12603
Summary: There is a memory leak because TransformRepFactory does not delete its SliceTransform pointer. This patch adds a delete to the destructor.
Test Plan:
make check
make valgrind_check
Reviewers: dhruba, emayanke, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12513
Summary: Replace include/leveldb with include/rocksdb.
Test Plan:
make clean; make check
make clean; make release
Differential Revision: https://reviews.facebook.net/D12489
Summary:
This patch adds three new MemTableRep's: UnsortedRep, PrefixHashRep, and VectorRep.
UnsortedRep stores keys in an std::unordered_map of std::sets. When an iterator is requested, it dumps the keys into an std::set and iterates over that.
VectorRep stores keys in an std::vector. When an iterator is requested, it creates a copy of the vector and sorts it using std::sort. The iterator accesses that new vector.
PrefixHashRep stores keys in an unordered_map mapping prefixes to ordered sets.
I also added one API change. I added a function MemTableRep::MarkImmutable. This function is called when the rep is added to the immutable list. It doesn't do anything yet, but it seems like that could be useful. In particular, for the vectorrep, it means we could elide the extra copy and just sort in place. The only reason I haven't done that yet is because the use of the ArenaAllocator complicates things (I can elaborate on this if needed).
Test Plan:
make -j32 check
./db_stress --memtablerep=vector
./db_stress --memtablerep=unsorted
./db_stress --memtablerep=prefixhash --prefix_size=10
Reviewers: dhruba, haobo, emayanke
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12117
Summary: added options to Dump() I missed in D12027. I also ran a script to look for other missing options and found a couple which I added. Should we also print anything for "PrepareForBulkLoad", "memtable_factory", and "statistics"? Or should we leave those alone since it's not easy to print useful info for those?
Test Plan: run anything and look at LOG file to make sure these are printed now.
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12219
Summary: Similar to v2 (db and table code understands prefixes), but use ReadOptions as in v3. Also, make the CreateFilter code faster and cleaner.
Test Plan: make db_test; export LEVELDB_TESTS=PrefixScan; ./db_test
Reviewers: dhruba
Reviewed By: dhruba
CC: haobo, emayanke
Differential Revision: https://reviews.facebook.net/D12027
Summary:
If we have same compaction filter for each compaction,
application cannot know about the different compaction processes.
Later on, we can put in more details in compaction filter for the
application to consume and use it according to its needs. For e.g. In
the universal compaction, we have a compaction process involving all the
files while others don't involve all the files. Applications may want to
collect some stats only when during full compaction.
Test Plan: run existing unit tests
Reviewers: haobo, dhruba
Reviewed By: dhruba
CC: xinyaohu, leveldb
Differential Revision: https://reviews.facebook.net/D12057
Summary: Currently, VersionEdit::DebugString always display internal keys in the original ascii format. This could cause manifest dump to be truncated if internal keys contain special charactors (like null). Also added an option --input_key_hex for ldb idump to indicate that the passed in user keys are in hex.
Test Plan: run ldb manifest_dump
Reviewers: dhruba, emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12111
Summary:
This is the first step to fix unit tests and bugs for universal
compactiion. I added universal compaction option to ChangeOptions(), and
fixed all unit tests calling ChangeOptions(). Some of these tests
obviously assume more than 1 level and check file number/values in level
1 or above levels. I set kSkipUniversalCompaction for these tests.
The major bug I found is manual compaction with universal compaction never stops. I have put a fix for
it.
I have also set universal compaction as the default compaction and found
at least 20+ unit tests failing. I haven't looked into the details. The
next step is to check all unit tests without calling ChangeOptions().
Test Plan: make all check
Reviewers: dhruba, haobo
Differential Revision: https://reviews.facebook.net/D12051
Summary: rocksdb replicaiton will need this when writing value+TS from master to slave 'as is'
Test Plan: make
Reviewers: dhruba, vamsi, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11919
Summary:
This diff adds support for both soft and hard rate limiting. The following changes are included:
1) Options.rate_limit is renamed to Options.hard_rate_limit.
2) Options.rate_limit_delay_milliseconds is renamed to Options.rate_limit_delay_max_milliseconds.
3) Options.soft_rate_limit is added.
4) If the maximum compaction score is > hard_rate_limit and rate_limit_delay_max_milliseconds == 0, then writes are delayed by 1 ms at a time until the max compaction score falls below hard_rate_limit.
5) If the max compaction score is > soft_rate_limit but <= hard_rate_limit, then writes are delayed by 0-1 ms depending on how close we are to hard_rate_limit.
6) Users can disable 4 by setting hard_rate_limit = 0. They can add a limit to the maximum amount of time waited by setting rate_limit_delay_max_milliseconds > 0. Thus, the old behavior can be preserved by setting soft_rate_limit = 0, which is the default.
Test Plan:
make -j32 check
./db_stress
Reviewers: dhruba, haobo, MarkCallaghan
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12003
Summary:
Add an option for arena block size, default value 4096 bytes. Arena will allocate blocks with such size.
I am not sure about passing parameter to skiplist in the new virtualized framework, though I talked to Jim a bit. So add Jim as reviewer.
Test Plan:
new unit test, I am running db_test.
For passing paramter from configured option to Arena, I tried tests like:
TEST(DBTest, Arena_Option) {
std::string dbname = test::TmpDir() + "/db_arena_option_test";
DestroyDB(dbname, Options());
DB* db = nullptr;
Options opts;
opts.create_if_missing = true;
opts.arena_block_size = 1000000; // tested 99, 999999
Status s = DB::Open(opts, dbname, &db);
db->Put(WriteOptions(), "a", "123");
}
and printed some debug info. The results look good. Any suggestion for such a unit-test?
Reviewers: haobo, dhruba, emayanke, jpaton
Reviewed By: dhruba
CC: leveldb, zshao
Differential Revision: https://reviews.facebook.net/D11799
Summary: This diff virtualizes the skiplist interface so that users can provide their own implementation of a backing store for MemTables. Eventually, the backing store will be responsible for its own synchronization, allowing users (and us) to experiment with different lockless implementations.
Test Plan:
make clean
make -j32 check
./db_stress
Reviewers: dhruba, emayanke, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11739
Summary:
Introduced KeyMayExist checking during writebatch-delete and removed from Outer Delete API because it uses writebatch-delete.
Added code to skip getting Table from disk if not already present in table_cache.
Some renaming of variables.
Introduced KeyMayExistImpl which allows checking since specified sequence number in GetImpl useful to check partially written writebatch.
Changed KeyMayExist to not be pure virtual and provided a default implementation.
Expanded unit-tests in db_test to check appropriately.
Ran db_stress for 1 hour with ./db_stress --max_key=100000 --ops_per_thread=10000000 --delpercent=50 --filter_deletes=1 --statistics=1.
Test Plan: db_stress;make check
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb, xjin
Differential Revision: https://reviews.facebook.net/D11745
Summary:
Wrote a new function in db_impl.c-CheckKeyMayExist that calls Get but with a new parameter turned on which makes Get return false only if bloom filters can guarantee that key is not in database. Delete calls this function and if the option- deletes_use_filter is turned on and CheckKeyMayExist returns false, the delete will be dropped saving:
1. Put of delete type
2. Space in the db,and
3. Compaction time
Test Plan:
make all check;
will run db_stress and db_bench and enhance unit-test once the basic design gets approved
Reviewers: dhruba, haobo, vamsi
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11607
Summary: This diff added a command 'idump' to ldb tool, which dumps the internal key/value pairs. It could be useful for diagnosis and estimating the per user key 'overhead'. Also cleaned up the ldb code a bit where I touched.
Test Plan: make check; ldb idump
Reviewers: emayanke, sheki, dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11517
Summary:
There is a new option called hybrid_mode which, when switched on,
causes HBase style compactions. Files from L0 are
compacted back into L0. This meat of this compaction algorithm
is in PickCompactionHybrid().
All files reside in L0. That means all files have overlapping
keys. Each file has a time-bound, i.e. each file contains a
range of keys that were inserted around the same time. The
start-seqno and the end-seqno refers to the timeframe when
these keys were inserted. Files that have contiguous seqno
are compacted together into a larger file. All files are
ordered from most recent to the oldest.
The current compaction algorithm starts to look for
candidate files starting from the most recent file. It continues to
add more files to the same compaction run as long as the
sum of the files chosen till now is smaller than the next
candidate file size. This logic needs to be debated
and validated.
The above logic should reduce write amplification to a
large extent... will publish numbers shortly.
Test Plan: dbstress runs for 6 hours with no data corruption (tested so far).
Differential Revision: https://reviews.facebook.net/D11289
Summary: [start_time, end_time) is waht I'm following for the buckets and the whole time-range. Also cleaned up some code in db_ttl.* Not correcting the spacing/indenting convention for util/ldb_cmd.cc in this diff.
Test Plan: python ldb_test.py, make ttl_test, Run mcrocksdb-backup tool, Run the ldb tool on 2 mcrocksdb production backups form sigmafio033.prn1
Reviewers: vamsi, haobo
Reviewed By: vamsi
Differential Revision: https://reviews.facebook.net/D11433
Summary:
Scan and Dump commands in ldb use iterator. We need to also print timestamp for ttl databases for debugging. For this I create a TtlIterator class pointer in these functions and assign it the value of Iterator pointer which actually points to t TtlIterator object, and access the new function ValueWithTS which can return TS also. Buckets feature for dump command: gives a count of different key-values in the specified time-range distributed across the time-range partitioned according to bucket-size. start_time and end_time are specified in unixtimestamp and bucket in seconds on the user-commandline
Have commented out 3 ines from ldb_test.py so that the test does not break right now. It breaks because timestamp is also printed now and I have to look at wildcards in python to compare properly.
Test Plan: python tools/ldb_test.py
Reviewers: vamsi, dhruba, haobo, sheki
Reviewed By: vamsi
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11403
Summary: as title, also removed an incorrect assertion
Test Plan: make check; db_stress --mmap_read=1; db_stress --mmap_read=0
Reviewers: dhruba, emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11367
Summary: This diff added an option to control the incremenal sync frequency. db_bench has a new flag bytes_per_sync for easy tuning exercise.
Test Plan: make check; db_bench
Reviewers: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11295
Summary:
Merge multiple multiple memtables in memory before writing it
out to a file in L0.
There is a new config parameter min_write_buffer_number_to_merge
that specifies the number of write buffers that should be merged
together to a single file in storage. The system will not flush
wrte buffers to storage unless at least these many buffers have
accumulated in memory.
The default value of this new parameter is 1, which means that
a write buffer will be immediately flushed to disk as soon it is
ready.
Test Plan: make check
Differential Revision: https://reviews.facebook.net/D11241
Summary:
Use a bit set to keep track of which random number is generated.
Currently only supports single-threaded. All our perf tests are run with threads=1
Copied over bitset implementation from common/datastructures
Test Plan: printed the generated keys, and verified all keys were present.
Reviewers: MarkCallaghan, haobo, dhruba
Reviewed By: MarkCallaghan
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11247
Summary:
During compaction, we sync the output files after they are fully written out. This causes unnecessary blocking of the compaction thread and burstiness of the write traffic.
This diff simply asks the OS to sync data incrementally as they are written, on the background. The hope is that, at the final sync, most of the data are already on disk and we would block less on the sync call. Thus, each compaction runs faster and we could use fewer number of compaction threads to saturate IO.
In addition, the write traffic will be smoothed out, hopefully reducing the IO P99 latency too.
Some quick tests show 10~20% improvement in per thread compaction throughput. Combined with posix advice on compaction read, just 5 threads are enough to almost saturate the udb flash bandwidth for 800 bytes write only benchmark.
What's more promising is that, with saturated IO, iostat shows average wait time is actually smoother and much smaller.
For the write only test 800bytes test:
Before the change: await occillate between 10ms and 3ms
After the change: await ranges 1-3ms
Will test against read-modify-write workload too, see if high read latency P99 could be resolved.
Will introduce a parameter to control the sync interval in a follow up diff after cleaning up EnvOptions.
Test Plan: make check; db_bench; db_stress
Reviewers: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11115
Summary:
This diff simplifies EnvOptions by treating it as POD, similar to Options.
- virtual functions are removed and member fields are accessed directly.
- StorageOptions is removed.
- Options.allow_readahead and Options.allow_readahead_compactions are deprecated.
- Unused global variables are removed: useOsBuffer, useFsReadAhead, useMmapRead, useMmapWrite
Test Plan: make check; db_stress
Reviewers: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11175
Summary:
This diff adds an option to specify whether PTHREAD_MUTEX_ADAPTIVE_NP will be enabled for the rocksdb single big kernel lock. db_bench also have this option now.
Quickly tested 8 thread cpu bound 100 byte random read.
No fast mutex: ~750k/s ops
With fast mutex: ~880k/s ops
Test Plan: make check; db_bench; db_stress
Reviewers: dhruba
CC: MarkCallaghan, leveldb
Differential Revision: https://reviews.facebook.net/D11031
Summary:
Current posix advice implementation ties up the access pattern hint with the creation of a file.
It is not possible to apply different advice for different access (random get vs compaction read),
without keeping two open files for the same table. This patch extended the RandomeAccessFile interface
to accept new access hint at anytime. Particularly, we are able to set different access hint on the same
table file based on when/how the file is used.
Two options are added to set the access hint, after the file is first opened and after the file is being
compacted.
Test Plan: make check; db_stress; db_bench
Reviewers: dhruba
Reviewed By: dhruba
CC: MarkCallaghan, leveldb
Differential Revision: https://reviews.facebook.net/D10905
Summary: a new option block_size_deviation is added.
Test Plan: run db_test and db_bench
Reviewers: dhruba, haobo
Reviewed By: haobo
Differential Revision: https://reviews.facebook.net/D10821
Summary: a new option block_size_deviation is added.
Test Plan: run db_test and db_bench
Reviewers: dhruba, haobo
Reviewed By: haobo
Differential Revision: https://reviews.facebook.net/D10821
Summary:
Added an option stats_dump_period_sec to dump leveldb.stats to LOG periodically for diagnosis.
By defauly, it's set to a very big number 3600 (1 hour).
Test Plan: make check;
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb, zshao
Differential Revision: https://reviews.facebook.net/D10761
Summary:
The valgrind errors were in the unit tests where we change the
number of levels of a database using internal methods.
Test Plan:
valgrind ./reduce_levels_test
valgrind ./db_test
Reviewers: emayanke
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10893
Summary:
This is initial version. A few ways in which this could
be extended in the future are:
(a) Killing from more places in source code
(b) Hashing stack and using that hash in determining whether to crash.
This is to avoid crashing more often at source lines that are executed
more often.
(c) Raising exceptions or returning errors instead of killing
Test Plan:
This whole thing is for testing.
Here is part of output:
python2.7 tools/db_crashtest2.py -d 600
Running db_stress
db_stress retncode -15 output LevelDB version : 1.5
Number of threads : 32
Ops per thread : 10000000
Read percentage : 50
Write-buffer-size : 4194304
Delete percentage : 30
Max key : 1000
Ratio #ops/#keys : 320000
Num times DB reopens: 0
Batches/snapshots : 1
Purge redundant % : 50
Num keys per lock : 4
Compression : snappy
------------------------------------------------
No lock creation because test_batches_snapshots set
2013/04/26-17:55:17 Starting database operations
Created bg thread 0x7fc1f07ff700
... finished 60000 ops
Running db_stress
db_stress retncode -15 output LevelDB version : 1.5
Number of threads : 32
Ops per thread : 10000000
Read percentage : 50
Write-buffer-size : 4194304
Delete percentage : 30
Max key : 1000
Ratio #ops/#keys : 320000
Num times DB reopens: 0
Batches/snapshots : 1
Purge redundant % : 50
Num keys per lock : 4
Compression : snappy
------------------------------------------------
Created bg thread 0x7ff0137ff700
No lock creation because test_batches_snapshots set
2013/04/26-17:56:15 Starting database operations
... finished 90000 ops
Revert Plan: OK
Task ID: #2252691
Reviewers: dhruba, emayanke
Reviewed By: emayanke
CC: leveldb, haobo
Differential Revision: https://reviews.facebook.net/D10581
Summary:
Currently, with paranoid_check on, DB::Open will fail on any log read error on recovery.
If client is ok with losing most recent updates, we could simply skip those errors.
However, it's important to introduce an additional flag, so that paranoid_check can
still guard against more serious problems.
Test Plan: make check; db_stress
Reviewers: dhruba, emayanke
Reviewed By: emayanke
CC: leveldb, emayanke
Differential Revision: https://reviews.facebook.net/D10869
Summary:
There is an existing field Options.max_bytes_for_level_multiplier that
sets the multiplier for the size of each level in the database.
This patch introduces the ability to set different multipliers
for every level in the database. The size of a level is determined
by using both max_bytes_for_level_multiplier as well as the
per-level fanout.
size of level[i] = size of level[i-1] * max_bytes_for_level_multiplier
* fanout[i-1]
The default value of fanout is 1, so that it is backward compatible.
Test Plan: make check
Reviewers: haobo, emayanke
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10863
Summary:
PosixLogger and AutoRollLogger do not seem to be thread safe.
For PosixLogger, log_size_ is not atomically updated.
For AutoRollLogger, the underlying logger_ might be deleted by
one thread while still being accessed by another.
Test Plan: make check
Reviewers: kailiu, dhruba, heyongqiang
Reviewed By: kailiu
CC: leveldb, zshao, sheki
Differential Revision: https://reviews.facebook.net/D9699
Summary:
Make stop watch a simple implementation, instead of subclass of a virtual class
Allocate stop watches off the stack instead of heap.
Code is more terse now.
Test Plan: make all check, db_bench with --statistics=1
Reviewers: haobo, dhruba
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10809
Summary: Statistics.h and histogram.h had double based api's to record values. Remove them as they are not used anywhere
Test Plan: make all check
Reviewers: haobo, dhruba
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10815
Summary: ldb works with raw data from the database and needs to be aware of ttl-database to work with it meaningfully. '-ttl' option now tells it that. Also added onto the ldb_test.py test. This option may be specified alongwith put, get, scan or dump. There is no support to provide a ttl-value and it uses default forever because there is no use-case for this currently.
Test Plan: make ldb_test; python tools/ldb_test.py
Reviewers: dhruba, sheki, haobo, vamsi
Reviewed By: sheki
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10797
Summary:
This diff replaces compaction_filter_args and CompactionFilter with a single compaction_filter parameter. It gives CompactionFilter better encapsulation and a similar look to Comparator and MergeOpertor, which improves consistency of the overall interface.
The change is not backward compatible. Nevertheless, the two references in fbcode are not in production yet.
Test Plan: make check
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb, zshao
Differential Revision: https://reviews.facebook.net/D10773
Summary:
This diff introduces a new Merge operation into rocksdb.
The purpose of this review is mostly getting feedback from the team (everyone please) on the design.
Please focus on the four files under include/leveldb/, as they spell the client visible interface change.
include/leveldb/db.h
include/leveldb/merge_operator.h
include/leveldb/options.h
include/leveldb/write_batch.h
Please go over local/my_test.cc carefully, as it is a concerete use case.
Please also review the impelmentation files to see if the straw man implementation makes sense.
Note that, the diff does pass all make check and truly supports forward iterator over db and a version
of Get that's based on iterator.
Future work:
- Integration with compaction
- A raw Get implementation
I am working on a wiki that explains the design and implementation choices, but coding comes
just naturally and I think it might be a good idea to share the code earlier. The code is
heavily commented.
Test Plan: run all local tests
Reviewers: dhruba, heyongqiang
Reviewed By: dhruba
CC: leveldb, zshao, sheki, emayanke, MarkCallaghan
Differential Revision: https://reviews.facebook.net/D9651
Summary:
Mark's task description from #2316777
Env::Default() comes from util/env_posix.cc
This is a static global.
static PosixEnv default_env;
Env* Env::Default() {
return &default_env;
}
-----
These globals assume default_env was initialized first. I don't think that is safe or correct to do (http://stackoverflow.com/questions/1005685/c-static-initialization-order)
const string AutoRollLoggerTest::kTestDir(
test::TmpDir() + "/db_log_test");
const string AutoRollLoggerTest::kLogFile(
test::TmpDir() + "/db_log_test/LOG");
Env* AutoRollLoggerTest::env = Env::Default();
Test Plan:
run make clean && make && make check
But how can I know if it works in Ubuntu?
Reviewers: MarkCallaghan, chip
Reviewed By: chip
CC: leveldb, dhruba, haobo
Differential Revision: https://reviews.facebook.net/D10491
Summary:
RocksDB doesn't build on Ubuntu VM .. shoudl be fixed with this patch.
g++ --version
g++ (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3
util/env_posix.cc:68:24: sorry, unimplemented: non-static data member initializers
util/env_posix.cc:68:24: error: ISO C++ forbids in-class initialization of non-const static member ‘use_os_buffer’
util/env_posix.cc:113:24: sorry, unimplemented: non-static data member initializers
util/env_posix.cc:113:24: error: ISO C++ forbids in-class initialization of non-const static member ‘use_os_buffer
Test Plan: make check
Reviewers: sheki, leveldb
Reviewed By: sheki
Differential Revision: https://reviews.facebook.net/D10461
Summary:
Adds the --writes_per_second rate limit for the readwhilewriting test.
The purpose is to optionally avoid saturating storage with writes & compaction
and test read response time when some writes are being done.
Changes the histogram code to also print the p99.99 value
Task ID: #
Blame Rev:
Test Plan:
make check, ran db_bench with it
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: haobo
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10305
Summary: forgot to include signal_test.cc
Test Plan: make check
Reviewers: sheki
Reviewed By: sheki
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10281
Summary:
This diff provides the ability to print out a stacktrace when the process receives certain signals.
Currently, we enable this for the following signals (program error related):
SIGILL SIGSEGV SIGBUS SIGABRT
Application simply #include "util/stack_trace.h" and call leveldb::InstallStackTraceHandler() during initialization, if signal handler is needed. It's not done automatically when openning db, because it's the application(process)'s responsibility to install signal handler and some applications might already have their own (like fbcode).
Sample output:
Received signal 11 (Segmentation fault)
#0 0x408ff0 ./signal_test() [0x408ff0] /home/haobo/rocksdb/util/signal_test.cc:4
#1 0x40827d ./signal_test() [0x40827d] /home/haobo/rocksdb/util/signal_test.cc:24
#2 0x7f8bb183172e /usr/local/fbcode/gcc-4.7.1-glibc-2.14.1/lib/libc.so.6(__libc_start_main+0x10e) [0x7f8bb183172e] ??:0
#3 0x408ebc ./signal_test() [0x408ebc] /home/engshare/third-party/src/glibc/glibc-2.14.1/glibc-2.14.1/csu/../sysdeps/x86_64/elf/start.S:113
Segmentation fault (core dumped)
For each frame, we print the raw pointer, the symbol provided by backtrace_symbols (still not good enough), and the source file/line. Note that address translation is done by directly shell out to addr2line. ??:0 means addr2line fails to do the translation. Hacky, but I think it's good for now.
Test Plan: signal_test.cc
Reviewers: dhruba, MarkCallaghan
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10173
Summary: As title. Found out this when testing stack_trace.cc portability.
Test Plan: make check; manual test 'non-linux' build by forcing OS_LINUX2
Reviewers: dhruba, heyongqiang
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10263
Summary: Primarily a refactor. Introduced LDBTool interface to which customers can plug in their options and this will create their own version of ldb tool.
Test Plan: made ldb tool and tried it.
Reviewers: dhruba, heyongqiang
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10191
Summary:
The background compaction threads are never exitted and therefore caused
memory-leaks while running rpcksdb tests. Have changed the PosixEnv destructor to exit and join them and changed the tests likewise
The memory leaked has reduced from 320 bytes to 64 bytes in all the tests. The 64
bytes is relating to
pthread_exit, but still have to figure out why. The stack-trace right now with
table_test.cc = 64 bytes in 1 blocks are possibly lost in loss record 4 of 5
at 0x475D8C: malloc (jemalloc.c:914)
by 0x400D69E: _dl_map_object_deps (dl-deps.c:505)
by 0x4013393: dl_open_worker (dl-open.c:263)
by 0x400F015: _dl_catch_error (dl-error.c:178)
by 0x4013B2B: _dl_open (dl-open.c:569)
by 0x5D3E913: do_dlopen (dl-libc.c:86)
by 0x400F015: _dl_catch_error (dl-error.c:178)
by 0x5D3E9D6: __libc_dlopen_mode (dl-libc.c:47)
by 0x5048BF3: pthread_cancel_init (unwind-forcedunwind.c:53)
by 0x5048DC9: _Unwind_ForcedUnwind (unwind-forcedunwind.c:126)
by 0x5046D9F: __pthread_unwind (unwind.c:130)
by 0x50413A4: pthread_exit (pthreadP.h:289)
Test Plan: make all check
Reviewers: dhruba, sheki, haobo
Reviewed By: dhruba
CC: leveldb, chip
Differential Revision: https://reviews.facebook.net/D9573
Summary: as subject. This is causing problem in adsconv. Ideally, this flags should be set in open. But that is only supported in Linux kernel ≥2.6.23 and glibc ≥2.7.
Test Plan:
db_test
run db_test
Reviewers: dhruba, MarkCallaghan, haobo
Reviewed By: dhruba
CC: leveldb, chip
Differential Revision: https://reviews.facebook.net/D10089
Summary:
1. The stock LRUCache nukes itself whenever the working set (the total number of entries not released by client at a certain time) is bigger than the cache capacity.
See https://our.dev.facebook.com/intern/tasks/?t=2252281
2. There's a bug in shard calculation leading to segmentation fault when only one shard is needed.
Test Plan: make check
Reviewers: dhruba, heyongqiang
Reviewed By: heyongqiang
CC: leveldb, zshao, sheki
Differential Revision: https://reviews.facebook.net/D9927
Summary:
1. SetBackgroundThreads was not thread safe
2. queue_size_ does not seem necessary
3. moved condition signal after shared state change. Even though the original
order is in practice ok (because the mutex is still held), it looks fishy
and non-intuitive.
Test Plan: make check
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb, zshao
Differential Revision: https://reviews.facebook.net/D9825
Summary:
Use non mmapd files for Write-Ahead log.
Earlier use of MMaped files. made the log iterator read ahead and miss records.
Now the reader and writer will point to the same physical location.
There is no perf regression :
./db_bench --benchmarks=fillseq --db=/dev/shm/mmap_test --num=$(million 20) --use_existing_db=0 --threads=2
with This diff :
fillseq : 10.756 micros/op 185281 ops/sec; 20.5 MB/s
without this dif :
fillseq : 11.085 micros/op 179676 ops/sec; 19.9 MB/s
Test Plan: unit test included
Reviewers: dhruba, heyongqiang
Reviewed By: heyongqiang
CC: leveldb
Differential Revision: https://reviews.facebook.net/D9741
Summary:
Earlier Statistics object was a raw pointer. This meant the user had to clear up
the Statistics object after creating the database. In most use cases the database is created in a function and the statistics pointer is out of scope. Hence the statistics object would never be deleted.
Now Using a shared_ptr to manage this.
Want this in before the next release.
Test Plan: make all check.
Reviewers: dhruba, emayanke
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D9735
Summary: This caused compilation problems on some gcc platforms during the third-partyrelease
Test Plan: make
Reviewers: sheki
Reviewed By: sheki
Differential Revision: https://reviews.facebook.net/D9627
Summary:
This patch allows an application to specify whether to use bufferedio,
reads-via-mmaps and writes-via-mmaps per database. Earlier, there
was a global static variable that was used to configure this functionality.
The default setting remains the same (and is backward compatible):
1. use bufferedio
2. do not use mmaps for reads
3. use mmap for writes
4. use readaheads for reads needed for compaction
I also added a parameter to db_bench to be able to explicitly specify
whether to do readaheads for compactions or not.
Test Plan: make check
Reviewers: sheki, heyongqiang, MarkCallaghan
Reviewed By: sheki
CC: leveldb
Differential Revision: https://reviews.facebook.net/D9429
Summary: Getting rid of boost in our github codebase which caused problems on third-party
Test Plan: make ldb; python tools/ldb_test.py
Reviewers: sheki, dhruba
Reviewed By: sheki
Differential Revision: https://reviews.facebook.net/D9543