Summary: In our project, when writing to the database, we want to form the value as the concatenation of a small header and a larger payload. It's a shame to have to copy the payload just so we can give RocksDB API a linear view of the value. Since RocksDB makes a copy internally, it's easy to support gather writes.
Test Plan: write_batch_test, new test case
Reviewers: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13947
Summary:
Here's one solution we discussed on speeding up FindObsoleteFiles. Keep a set of all files in DBImpl and update the set every time we create a file. I probably missed few other spots where we create a file.
It might speed things up a bit, but makes code uglier. I don't really like it.
Much better approach would be to abstract all file handling to a separate class. Think of it as layer between DBImpl and Env. Having a separate class deal with file namings and deletion would benefit both code cleanliness (especially with huge DBImpl) and speed things up. It will take a huge effort to do this, though.
Let's discuss offline today.
Test Plan: Ran ./db_stress, verified that files are getting deleted
Reviewers: dhruba, haobo, kailiu, emayanke
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D13827
Summary: Changed the name and interface for creating HashSkipListRep. Forgot to change it in db_test.
Test Plan: make db_test
Reviewers: haobo
Reviewed By: haobo
Differential Revision: https://reviews.facebook.net/D13965
Summary: What @haobo done with TransformRep, now in TransformRepNoLock. Similar implementation, except that I made DynamicIterator a subclass of Iterator which makes me have less iterator initializations.
Test Plan: ./prefix_test. Seeing huge savings vs. TransformRep again!
Reviewers: dhruba, haobo, sdong, kailiu
Reviewed By: haobo
CC: leveldb, haobo
Differential Revision: https://reviews.facebook.net/D13953
Summary: Allow block based table to configure the way flushing the blocks. This feature will allow us to add support for prefix-aligned block.
Test Plan: make check
Reviewers: dhruba, haobo, sdong, igor
Reviewed By: sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13875
Summary: I this bug from valgrind report and found a place that may potentially leak memory.
Test Plan: re-ran the valgrind and no error any more
Reviewers: emayanke
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13959
Summary:
Added a new call LogFlush() that flushes the log contents to the OS buffers. We never call it with lock held.
We call it once for every Read/Write and often in compaction/flush process so the frequency should not be a problem.
Test Plan: db_test
Reviewers: dhruba, haobo, kailiu, emayanke
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13935
Summary: Added a prefix_seek flag in ReadOptions to indicate that Seek is prefix aware(might not return data with different prefix), and also not bound to a specific prefix. Multiple Seeks and range scans can be invoked on the same iterator. If a specific prefix is specified, this flag will be ignored. Just a quick prototype that works for PrefixHashRep, the new lockless memtable could be easily extended with this support too.
Test Plan: test it on Leaf
Reviewers: dhruba, kailiu, sdong, igor
Reviewed By: igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13929
Summary:
Archive cleaning will still happen every WAL_ttl seconds
but archived logs will be deleted only if archive size
is greater then a WAL_size_limit value.
Empty archived logs will be deleted evety WAL_ttl.
Test Plan:
1. Unit tests pass.
2. Benchmark.
Reviewers: emayanke, dhruba, haobo, sdong, kailiu, igor
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13869
Summary: We have to be able to catch last few log outputs before a crash
Test Plan: no
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13917
Summary:
This fixes#3130525. Dhruba's suggestion and Tnovak's implementation :)
The issue was with SkipEmptyDataBlocksForward(), but I also changed SkipEmptyDataBlocksBackward(). Is that OK?
Test Plan: Run the logdevice test
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13911
Summary:
I'm sending this diff together with https://reviews.facebook.net/D13881 because it didn't allow me to send only the array one.
Here I also replaced unordered_map with just an array of shared_ptrs. This elminated all the locks.
I will run the new benchmark and post the results here.
Test Plan: db_test
Reviewers: dhruba, haobo
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13893
Summary: as title, half baked test for prefixhash memtable. Also contains deadlock test option
Test Plan: run it
Reviewers: igor, dhruba
Reviewed By: igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13887
Summary:
The problem was that there was only a single key-value in a block
and its compressibility was less than 88%. Rocksdb refuses to
compress a block unless its compresses to lesser than 88% of its
original size. If a block is not compressed, it does nto get inserted
into the compressed block cache.
Create the test data so that multiple records fit into the same
data block. This increases the compressibility of these data block.
Test Plan: ./db_test
Reviewers: kailiu, haobo
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13905
Summary:
Fixed valgrind error in DBTest.CompressedCache.
This fixes the valgrind error (thanks to Haobo). I am still trying to reproduce the test-failure case deterministically.
Test Plan: db_test
Reviewers: haobo
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13899
Summary:
strict essentially means that we MUST find the startsequence. Thus we should return if starteSequence is not found in the first file in case strict is set. This will take care of ending the iterator in case of permanent gaps due to corruptions in the log files
Also created NextImpl function that will have internal variable to distinguish whether Next is being called from StartSequence or by application.
Set NotFoudn::gaps status to give an indication of gaps happeneing.
Polished the inline documentation at various places
Test Plan:
* db_repl_stress test
* db_test relating to transaction log iterator
* fbcode/wormhole/rocksdb/rocks_log_iterator
* sigma production machine sigmafio032.prn1
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13689
Summary:
Rocksdb can now support a uncompressed block cache, or a compressed
block cache or both. Lookups first look for a block in the
uncompressed cache, if it is not found only then it is looked up
in the compressed cache. If it is found in the compressed cache,
then it is uncompressed and inserted into the uncompressed cache.
It is possible that the same block resides in the compressed cache
as well as the uncompressed cache at the same time. Both caches
have their own individual LRU policy.
Test Plan: Unit test case attached.
Reviewers: kailiu, sdong, haobo, leveldb
Reviewed By: haobo
CC: xjin, haobo
Differential Revision: https://reviews.facebook.net/D12675
Summary: Added an option --count_delim=<char> which takes the given character as delimiter ('.' by default) and reports count of each row type found in the db
Test Plan:
1. Created test in file (for DBDumperCommand) rocksdb/tools/ldb_test.py which puts various key value pair in db and checks the output using dump --count_delim ,--count_delim="." and --count_delim=",".
2. Created test in file (for InternalDumperCommand) rocksdb/tools/ldb_test.py which puts various key value pair in db and checks the output using dump --count_delim ,--count_delim="." and --count_delim=",".
3. Manually created a database with several keys of several type and verified by running the command
./ldb db=<path> dump --count_delim="<char>"
./ldb db=<path> idump --count_delim="<char>"
Reviewers: vamsi, dhruba, emayanke, kailiu
Reviewed By: vamsi
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13815
Summary:
I'm figuring out how Version[Set, Edit, ] classes work and I stumbled on this.
It doesn't seem that the comment is accurate anymore. What I read is when the manifest grows too big, create a new file (and not only when we call LogAndApply for the first time).
Test Plan: make check (currently running)
Reviewers: dhruba, haobo, kailiu, emayanke
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13839
Summary: This might help with p99 performance, but does not solve the real problem. More discussion on #2947135
Test Plan: make check
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13809
Summary: Iterator benchmark case is timed incorrectly. Fix it
Test Plan: Run the benchmark
Reviewers: haobo, dhruba
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13845
Summary: It is a very simple benchmark to measure a Table implementation's Get() and iterator performance if all the data is in memory.
Test Plan: N/A
Reviewers: dhruba, haobo, kailiu
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13743
Summary: This is to give application compaction filter a chance to access context information of a specific compaction run. For example, depending on whether a compaction goes through all data files, the application could do things differently.
Test Plan: make check
Reviewers: dhruba, kailiu, sdong
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13683
Summary: Don't define if already defined.
Test Plan: Running make release in parallel, will not commit if it fails.
Reviewers: emayanke
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13833
Summary:
Currently for each put, a fresh memory is allocated, and a new entry is added to the memtable with a new sequence number irrespective of whether the key already exists in the memtable. This diff is an attempt to update the value inplace for existing keys. It currently handles a very simple case:
1. Key already exists in the current memtable. Does not inplace update values in immutable memtable or snapshot
2. Latest value type is a 'put' ie kTypeValue
3. New value size is less than existing value, to avoid reallocating memory
TODO: For a put of an existing key, deallocate memory take by values, for other value types till a kTypeValue is found, ie. remove kTypeMerge.
TODO: Update the transaction log, to allow consistent reload of the memtable.
Test Plan: Added a unit test verifying the inplace update. But some other unit tests broken due to invalid sequence number checks. WIll fix them next.
Reviewers: xinyaohu, sumeet, haobo, dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12423
Automatic commit by arc
Summary:
This patch is to address @haobo's comments on D13521:
1. rename Table to be TableReader and make its factory function to be GetTableReader
2. move the compression type selection logic out of TableBuilder but to compaction logic
3. more accurate comments
4. Move stat name constants into BlockBasedTable implementation.
5. remove some uncleaned codes in simple_table_db_test
Test Plan: pass test suites.
Reviewers: haobo, dhruba, kailiu
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13785
Summary: Two memory issues caused valgrind_check to fail on simple_table_db_test. Fix it
Test Plan: make -j32 valgrind_check
Reviewers: kailiu, haobo, dhruba
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13773
Summary: This patch makes Table and TableBuilder a abstract class and make all the implementation of the current table into BlockedBasedTable and BlockedBasedTable Builder.
Test Plan: Make db_test.cc to work with block based table. Add a new test simple_table_db_test.cc where a different simple table format is implemented.
Reviewers: dhruba, haobo, kailiu, emayanke, vamsi
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13521
Summary: assert(Overlap) significantly slows down the benchmark. Ignore assertions when executing blob_store_bench.
Test Plan: Ran the benchmark
Reviewers: dhruba, haobo, kailiu
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13737
Summary: Apparently C++ doesn't like it if you copy around its atomic<> variables. When running a benchmark for a longer time, benchmark used to stall. Changed WorkerThread in config to WorkerThread*. It works now.
Test Plan: Ran benchmark
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13731
Summary:
1. Added a new option that support user-defined table stats collection.
2. Added a deleted key stats collector in `utilities`
Test Plan:
Added a unit test for newly added code.
Also ran make check to make sure other tests are not broken.
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13491
Summary:
A latest valgrind test found a recently added unit test has memory leak,
which is because DB is not closed at the end of the test.
Test Plan: re-run the valgrind locally and make sure there's no memory leakage any more.
Reviewers: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13725
Summary:
When a Put fails, it can leave database in a messy state. We don't want to pretend that everything is OK when it may not be. We fail every write following the failed one.
I added checks for corruption to DBImpl::Write(). Is there anywhere else I need to add them?
Test Plan: Corruption unit test.
Reviewers: dhruba, haobo, kailiu
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13671
Summary:
This bug may affect the seek performance.
Test Plan:
make
make check
Also gdb into some index block builder to make sure the restart_block_interval is `1`.
Summary: Fix the unused variable warning for `first` when running `make release`
Test Plan:
make
make check
Reviewers: dhruba, igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13695
Summary:
This is to simplify rocksdb public APIs and improve the code quality.
Created an additional parameter to ParseFileName for log sub type and improved the code for deleting a wal file.
Wrote exhaustive unit-tests in delete_file_test
Unification of other redundant APIs can be taken up in a separate diff
Test Plan: Expanded delete_file test
Reviewers: dhruba, haobo, kailiu, sdong
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13647
Summary:
Crash may occur during the flushes of more than two mem tables.
As the info log suggested, even when both were successfully flushed,
the recovery process still pick up one of the memtable's log for recovery.
This diff fix the problem by setting the correct "log number" in MANIFEST.
Test Plan: make test; deployed to leaf4 and make sure it doesn't result in crashes of this type.
Reviewers: haobo, dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13659
Summary: Converted db_stress, db_repl_stress and db_bench to use gflags
Test Plan: I tested by printing out all the flags from old and new versions. Tried defaults, + various combinations with "interesting flags". Also, tested by running db_crashtest.py and db_crashtest2.py.
Reviewers: emayanke, dhruba, haobo, kailiu, sdong
Reviewed By: emayanke
CC: leveldb, xjin
Differential Revision: https://reviews.facebook.net/D13581
Summary:
Finally, arc diff works again! This has been sitting in my repo for a while.
I would like some comments on my BlobStore benchmark. We don't have to check this in.
Also, I don't do any fsync in the BlobStore, so this is all extremely fast. I'm not sure what durability guarantees we need from the BlobStore.
Test Plan: Nope
Reviewers: dhruba, haobo, kailiu, emayanke
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13527
Summary:
I added max_size option in blobstore. Since we now know the maximum number of buckets we'll ever use, we can allocate an array of buckets and access its elements without use of any locks! Common case Get doesn't lock anything now.
Benchmarks on 16KB block size show no impact on speed, though.
Test Plan: unittests + benchmark
Reviewers: dhruba, haobo, kailiu
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13641
Summary: as title
Test Plan: make check; ./perf_context_test
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13629
Summary: as per Dhruba's suggestion
Test Plan: make all check; Seen the Id getting generated properly in db_repl_stress
Reviewers: dhruba, kailiu
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13635