Commit Graph

837 Commits

Author SHA1 Message Date
lovro
8a46ecd357 WriteBatch::Put() overload that gathers key and value from arrays of slices
Summary: In our project, when writing to the database, we want to form the value as the concatenation of a small header and a larger payload.  It's a shame to have to copy the payload just so we can give RocksDB API a linear view of the value.  Since RocksDB makes a copy internally, it's easy to support gather writes.

Test Plan: write_batch_test, new test case

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13947
2013-11-08 16:34:32 -08:00
Igor Canadi
1510339e52 Speed up FindObsoleteFiles
Summary:
Here's one solution we discussed on speeding up FindObsoleteFiles. Keep a set of all files in DBImpl and update the set every time we create a file. I probably missed few other spots where we create a file.

It might speed things up a bit, but makes code uglier. I don't really like it.

Much better approach would be to abstract all file handling to a separate class. Think of it as layer between DBImpl and Env. Having a separate class deal with file namings and deletion would benefit both code cleanliness (especially with huge DBImpl) and speed things up. It will take a huge effort to do this, though.

Let's discuss offline today.

Test Plan: Ran ./db_stress, verified that files are getting deleted

Reviewers: dhruba, haobo, kailiu, emayanke

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D13827
2013-11-08 15:23:46 -08:00
Igor Canadi
dd218bbc88 Forgot to change interface everywhere
Summary: Changed the name and interface for creating HashSkipListRep. Forgot to change it in db_test.

Test Plan: make db_test

Reviewers: haobo

Reviewed By: haobo

Differential Revision: https://reviews.facebook.net/D13965
2013-11-08 12:23:12 -08:00
Igor Canadi
8b3379dc0a Implementing DynamicIterator for TransformRepNoLock
Summary: What @haobo done with TransformRep, now in TransformRepNoLock. Similar implementation, except that I made DynamicIterator a subclass of Iterator which makes me have less iterator initializations.

Test Plan: ./prefix_test. Seeing huge savings vs. TransformRep again!

Reviewers: dhruba, haobo, sdong, kailiu

Reviewed By: haobo

CC: leveldb, haobo

Differential Revision: https://reviews.facebook.net/D13953
2013-11-08 00:31:09 -08:00
Kai Liu
fd075d6edd Provide mechanism to configure when to flush the block
Summary: Allow block based table to configure the way flushing the blocks. This feature will allow us to add support for prefix-aligned block.

Test Plan: make check

Reviewers: dhruba, haobo, sdong, igor

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13875
2013-11-07 21:27:21 -08:00
Kai Liu
bba6595b1f Fix the valgrind error
Summary: I this bug from valgrind report and found a place that may potentially leak memory.

Test Plan: re-ran the valgrind and no error any more

Reviewers: emayanke

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13959
2013-11-07 15:46:48 -08:00
Igor Canadi
444cf88a56 Flush the log outside of lock
Summary:
Added a new call LogFlush() that flushes the log contents to the OS buffers. We never call it with lock held.

We call it once for every Read/Write and often in compaction/flush process so the frequency should not be a problem.

Test Plan: db_test

Reviewers: dhruba, haobo, kailiu, emayanke

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13935
2013-11-07 11:31:56 -08:00
Haobo Xu
fd2044883a [RocksDB] Generalize prefix-aware iterator to be used for more than one Seek
Summary: Added a prefix_seek flag in ReadOptions to indicate that Seek is prefix aware(might not return data with different prefix), and also not bound to a specific prefix. Multiple Seeks and range scans can be invoked on the same iterator. If a specific prefix is specified, this flag will be ignored. Just a quick prototype that works for PrefixHashRep, the new lockless memtable could be easily extended with this support too.

Test Plan: test it on Leaf

Reviewers: dhruba, kailiu, sdong, igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13929
2013-11-06 20:45:49 -08:00
shamdor
c2be2cba04 WAL log retention policy based on archive size.
Summary:
Archive cleaning will still happen every WAL_ttl seconds
but archived logs will be deleted only if archive size
is greater then a WAL_size_limit value.
Empty archived logs will be deleted evety WAL_ttl.

Test Plan:
1. Unit tests pass.
2. Benchmark.

Reviewers: emayanke, dhruba, haobo, sdong, kailiu, igor

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13869
2013-11-06 18:46:28 -08:00
Dhruba Borthakur
292c2b3357 Fix stress test failure when using mmap-reads.
Summary:
The mmap-read file->Read() does not use the scratch buffer to
read in file-contents.

Test Plan: ./db_stress --test_batches_snapshots=1 --ops_per_thread=100000000 --threads=32 --write_buffer_size=4194304 --destroy_db_initially=0 --reopen=0 --readpercent=45 --prefixpercent=5 --writepercent=35 --delpercent=5 --iterpercent=10 --db=/tmp/dhruba --max_key=100000000 --disable_seek_compaction=0 --mmap_read=1 --block_size=16384 --cache_size=1048576 --open_files=500000 --verify_checksum=1 --sync=1 --disable_wal=0 --disable_data_sync=0 --target_file_size_base=2097152 --target_file_size_multiplier=2 --max_write_buffer_number=3 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --filter_deletes=0

Reviewers: haobo, kailiu

Reviewed By: kailiu

CC: leveldb, kailiu, emayanke

Differential Revision: https://reviews.facebook.net/D13923
2013-11-06 15:40:26 -08:00
Igor Canadi
95a8213f75 Log flush every 0 seconds
Summary: We have to be able to catch last few log outputs before a crash

Test Plan: no

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13917
2013-11-06 14:19:46 -08:00
Igor Canadi
36409e0016 Fix slow no-io iterator
Summary:
This fixes #3130525. Dhruba's suggestion and Tnovak's implementation :)

The issue was with SkipEmptyDataBlocksForward(), but I also changed SkipEmptyDataBlocksBackward(). Is that OK?

Test Plan: Run the logdevice test

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13911
2013-11-06 14:11:52 -08:00
Igor Canadi
be96f2498e TransformRep - use array instead of unordered_map
Summary:
I'm sending this diff together with https://reviews.facebook.net/D13881 because it didn't allow me to send only the array one.

Here I also replaced unordered_map with just an array of shared_ptrs. This elminated all the locks.

I will run the new benchmark and post the results here.

Test Plan: db_test

Reviewers: dhruba, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13893
2013-11-06 11:55:43 -08:00
Haobo Xu
fe4a449472 [RocksDB] prefixhash memtable test
Summary: as title, half baked test for prefixhash memtable. Also contains deadlock test option

Test Plan: run it

Reviewers: igor, dhruba

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13887
2013-11-05 23:20:10 -08:00
Kai Liu
f0b0b28f9a Remove invalid items in .gitignore 2013-11-05 21:04:22 -08:00
Dhruba Borthakur
39190588e1 Fix failure in rocksdb unit test CompressedCache
Summary:
The problem was that there was only a single key-value in a block
and its compressibility was less than 88%. Rocksdb refuses to
compress a block unless its compresses to lesser than 88% of its
original size. If a block is not compressed, it does nto get inserted
into the compressed block cache.

Create the test data so that multiple records fit into the same
data block. This increases the compressibility of these data block.

Test Plan: ./db_test

Reviewers: kailiu, haobo

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13905
2013-11-05 16:11:34 -08:00
Dhruba Borthakur
7845fd9db9 Fixed valgrind error in DBTest.CompressedCache
Summary:
Fixed valgrind error in DBTest.CompressedCache.
This fixes the valgrind error (thanks to Haobo). I am still trying to reproduce the test-failure case deterministically.

Test Plan: db_test

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13899
2013-11-05 11:12:39 -08:00
Mayank Agarwal
f837f5b1c9 Making the transaction log iterator more robust
Summary:
strict essentially means that we MUST find the startsequence. Thus we should return if starteSequence is not found in the first file in case strict is set. This will take care of ending the iterator in case of permanent gaps due to corruptions in the log files
Also created NextImpl function that will have internal variable to distinguish whether Next is being called from StartSequence or by application.
Set NotFoudn::gaps status to give an indication of gaps happeneing.
Polished the inline documentation at various places

Test Plan:
* db_repl_stress test
* db_test relating to transaction log iterator
* fbcode/wormhole/rocksdb/rocks_log_iterator
* sigma production machine sigmafio032.prn1

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13689
2013-11-04 20:49:03 -08:00
Dhruba Borthakur
b4ad5e89ae Implement a compressed block cache.
Summary:
Rocksdb can now support a uncompressed block cache, or a compressed
block cache or both. Lookups first look for a block in the
uncompressed cache, if it is not found only then it is looked up
in the compressed cache. If it is found in the compressed cache,
then it is uncompressed and inserted into the uncompressed cache.

It is possible that the same block resides in the compressed cache
as well as the uncompressed cache at the same time. Both caches
have their own individual LRU policy.

Test Plan: Unit test case attached.

Reviewers: kailiu, sdong, haobo, leveldb

Reviewed By: haobo

CC: xjin, haobo

Differential Revision: https://reviews.facebook.net/D12675
2013-11-01 14:31:35 -07:00
Piyush Garg
1e4375d2ef Task #3071144 Enhance ldb (db dump tool for leveldb) to report row counters for each row type
Summary: Added an option --count_delim=<char> which takes the given character as delimiter ('.' by default) and reports count of each row type found in the db

Test Plan:
1. Created test in file (for DBDumperCommand) rocksdb/tools/ldb_test.py which puts various key value pair in db and checks the output using dump --count_delim ,--count_delim="." and --count_delim=",".
2. Created test in file (for InternalDumperCommand) rocksdb/tools/ldb_test.py which puts various key value pair in db and checks the output using dump --count_delim ,--count_delim="." and --count_delim=",".
3. Manually created a database with several keys of several type and verified by running the command
   ./ldb db=<path> dump --count_delim="<char>"
   ./ldb db=<path> idump --count_delim="<char>"

Reviewers: vamsi, dhruba, emayanke, kailiu

Reviewed By: vamsi

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13815
2013-11-01 13:59:14 -07:00
Igor Canadi
beeb74be6f Move I/O outside of lock
Summary:
I'm figuring out how Version[Set, Edit, ] classes work and I stumbled on this.

It doesn't seem that the comment is accurate anymore. What I read is when the manifest grows too big, create a new file (and not only when we call LogAndApply for the first time).

Test Plan: make check (currently running)

Reviewers: dhruba, haobo, kailiu, emayanke

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13839
2013-11-01 12:32:27 -07:00
Igor Canadi
b572e81f94 Flush Log every 5 seconds
Summary: This might help with p99 performance, but does not solve the real problem. More discussion on #2947135

Test Plan: make check

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13809
2013-10-31 15:36:40 -07:00
Siying Dong
82b7e37f6e Fix a bug of table_reader_bench
Summary: Iterator benchmark case is timed incorrectly. Fix it

Test Plan: Run the benchmark

Reviewers: haobo, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13845
2013-10-31 15:26:06 -07:00
Siying Dong
7caadf2e52 A very simple benchmark to measure Table implemenation's Get() And Iterator performance
Summary: It is a very simple benchmark to measure a Table implementation's Get() and iterator performance if all the data is in memory.

Test Plan: N/A

Reviewers: dhruba, haobo, kailiu

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13743
2013-10-31 13:38:54 -07:00
Haobo Xu
8cbe5bb56b [RocksDB] Add OnCompactionStart to CompactionFilter class
Summary: This is to give application compaction filter a chance to access context information of a specific compaction run. For example, depending on whether a compaction goes through all data files, the application could do things differently.

Test Plan: make check

Reviewers: dhruba, kailiu, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13683
2013-10-31 13:36:43 -07:00
Naman Gupta
b4fab3be2a Merge branch 'master' of github.com:facebook/rocksdb into inplace 2013-10-31 11:51:03 -07:00
Igor Canadi
138a8eee8e Fix make release
Summary: Don't define if already defined.

Test Plan: Running make release in parallel, will not commit if it fails.

Reviewers: emayanke

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13833
2013-10-31 11:47:22 -07:00
Naman Gupta
fe25070242 In-place updates for equal keys and similar sized values
Summary:
Currently for each put, a fresh memory is allocated, and a new entry is added to the memtable with a new sequence number irrespective of whether the key already exists in the memtable. This diff is an attempt to update the value inplace for existing keys. It currently handles a very simple case:
1. Key already exists in the current memtable. Does not inplace update values in immutable memtable or snapshot
2. Latest value type is a 'put' ie kTypeValue
3. New value size is less than existing value, to avoid reallocating memory

TODO: For a put of an existing key, deallocate memory take by values, for other value types till a kTypeValue is found, ie. remove kTypeMerge.
TODO: Update the transaction log, to allow consistent reload of the memtable.

Test Plan: Added a unit test verifying the inplace update. But some other unit tests broken due to invalid sequence number checks. WIll fix them next.

Reviewers: xinyaohu, sumeet, haobo, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12423

Automatic commit by arc
2013-10-31 11:27:12 -07:00
Siying Dong
f03b2df010 Follow-up Cleaning-up After D13521
Summary:
This patch is to address @haobo's comments on D13521:
1. rename Table to be TableReader and make its factory function to be GetTableReader
2. move the compression type selection logic out of TableBuilder but to compaction logic
3. more accurate comments
4. Move stat name constants into BlockBasedTable implementation.
5. remove some uncleaned codes in simple_table_db_test

Test Plan: pass test suites.

Reviewers: haobo, dhruba, kailiu

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13785
2013-10-30 10:52:33 -07:00
Siying Dong
068a819ac9 Fix valgrind_check problem of simple_table_db_test.cc
Summary: Two memory issues caused valgrind_check to fail on simple_table_db_test. Fix it

Test Plan: make -j32 valgrind_check

Reviewers: kailiu, haobo, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13773
2013-10-29 14:29:03 -07:00
Kai Liu
79d8dad331 Change a typo in method signature 2013-10-28 21:23:17 -07:00
Siying Dong
d4eec30ed0 Make "Table" pluggable
Summary: This patch makes Table and TableBuilder a abstract class and make all the implementation of the current table into BlockedBasedTable and BlockedBasedTable Builder.

Test Plan: Make db_test.cc to work with block based table. Add a new test simple_table_db_test.cc where a different simple table format is implemented.

Reviewers: dhruba, haobo, kailiu, emayanke, vamsi

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13521
2013-10-28 17:54:09 -07:00
Igor Canadi
8ace6b0f91 Run benchmark with no debug
Summary: assert(Overlap) significantly slows down the benchmark. Ignore assertions when executing blob_store_bench.

Test Plan: Ran the benchmark

Reviewers: dhruba, haobo, kailiu

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13737
2013-10-28 17:42:33 -07:00
Igor Canadi
17991cd5a0 Fix data race in BlobStore benchmark
Summary: Apparently C++ doesn't like it if you copy around its atomic<> variables. When running a benchmark for a longer time, benchmark used to stall. Changed WorkerThread in config to WorkerThread*. It works now.

Test Plan: Ran benchmark

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13731
2013-10-28 16:31:44 -07:00
Kai Liu
994575c134 Support user-defined table stats collector
Summary:
1. Added a new option that support user-defined table stats collection.
2. Added a deleted key stats collector in `utilities`

Test Plan:
Added a unit test for newly added code.
Also ran make check to make sure other tests are not broken.

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13491
2013-10-28 15:45:14 -07:00
Kai Liu
7e91b86f4d Fix a valgrind warning
Summary:
A latest valgrind test found a recently added unit test has memory leak,
which is because DB is not closed at the end of the test.

Test Plan: re-run the valgrind locally and make sure there's no memory leakage any more.

Reviewers: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13725
2013-10-28 14:34:27 -07:00
Igor Canadi
100fa8e013 If a Put fails, fail all other puts
Summary:
When a Put fails, it can leave database in a messy state. We don't want to pretend that everything is OK when it may not be. We fail every write following the failed one.

I added checks for corruption to DBImpl::Write(). Is there anywhere else I need to add them?

Test Plan: Corruption unit test.

Reviewers: dhruba, haobo, kailiu

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13671
2013-10-28 12:36:02 -07:00
Kai Liu
1ca86f0391 Fix a bug that index block's restart_block_interval is not 1
Summary:

This bug may affect the seek performance.

Test Plan:

make
make check

Also gdb into some index block builder to make sure the restart_block_interval is `1`.
2013-10-28 10:51:34 -07:00
Kai Liu
a1d38a41fd fix the error message in debug mode
Summary:

my fix patch introduced a new error in debug mode.

Test Plan:

`make` and `make release`
2013-10-27 23:11:13 -07:00
Kai Liu
39c14891b6 Fix the gcc warning for unused variable
Summary: Fix the unused variable warning for `first` when running `make release`

Test Plan:
make
make check

Reviewers: dhruba, igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13695
2013-10-27 22:57:45 -07:00
Mayank Agarwal
56305221c4 Unify DeleteFile and DeleteWalFiles
Summary:
This is to simplify rocksdb public APIs and improve the code quality.
Created an additional parameter to ParseFileName for log sub type and improved the code for deleting a wal file.
Wrote exhaustive unit-tests in delete_file_test
Unification of other redundant APIs can be taken up in a separate diff

Test Plan: Expanded delete_file test

Reviewers: dhruba, haobo, kailiu, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13647
2013-10-25 08:32:14 -07:00
Kai Liu
c17607a251 Fix the log number bug when updating MANIFEST file
Summary:
Crash may occur during the flushes of more than two mem tables.

As the info log suggested, even when both were successfully flushed,
the recovery process still pick up one of the memtable's log for recovery.

This diff fix the problem by setting the correct "log number" in MANIFEST.

Test Plan: make test; deployed to leaf4 and make sure it doesn't result in crashes of this type.

Reviewers: haobo, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13659
2013-10-24 21:05:33 -07:00
Slobodan Predolac
e44976b199 Conversion of db_bench, db_stress and db_repl_stress to use gflags
Summary: Converted db_stress, db_repl_stress and db_bench to use gflags

Test Plan: I tested by printing out all the flags from old and new versions. Tried defaults, + various combinations with "interesting flags". Also, tested by running db_crashtest.py and db_crashtest2.py.

Reviewers: emayanke, dhruba, haobo, kailiu, sdong

Reviewed By: emayanke

CC: leveldb, xjin

Differential Revision: https://reviews.facebook.net/D13581
2013-10-24 07:43:14 -07:00
Igor Canadi
7e2c1ba173 BlobStore Benchmark
Summary:
Finally, arc diff works again! This has been sitting in my repo for a while.

I would like some comments on my BlobStore benchmark. We don't have to check this in.

Also, I don't do any fsync in the BlobStore, so this is all extremely fast. I'm not sure what durability guarantees we need from the BlobStore.

Test Plan: Nope

Reviewers: dhruba, haobo, kailiu, emayanke

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13527
2013-10-23 17:31:12 -07:00
Igor Canadi
cb8a7302e4 Implement max_size in BlobStore
Summary:
I added max_size option in blobstore. Since we now know the maximum number of buckets we'll ever use, we can allocate an array of buckets and access its elements without use of any locks! Common case Get doesn't lock anything now.

Benchmarks on 16KB block size show no impact on speed, though.

Test Plan: unittests + benchmark

Reviewers: dhruba, haobo, kailiu

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13641
2013-10-23 14:38:52 -07:00
Haobo Xu
2fb361ad98 [RocksDB] Add perf_context.wal_write_time to track time spent on writing the recovery log.
Summary: as title

Test Plan: make check; ./perf_context_test

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13629
2013-10-23 13:38:39 -07:00
Mayank Agarwal
e56ce03691 Hardcoding temp file name for Identity file to 000000.dbtmp just like it's done for CURRENT file
Summary: as per Dhruba's suggestion

Test Plan: make all check; Seen the Id getting generated properly in db_repl_stress

Reviewers: dhruba, kailiu

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13635
2013-10-23 11:34:22 -07:00
Kai Liu
b37fda842d Improve the comment for the shared library in Make file 2013-10-22 22:40:16 -07:00
Igor Canadi
30f1b97a06 Enable blobs to be fragmented
Summary:
I have implemented a FreeList version that supports fragmented blob chunks. Each block gets allocated and freed in FIFO order. Since the idea for the blocks to be big, we will not take a big hit of non-sequential IO. Free list is also faster, taking only O(k) size in both free and allocate instead of O(N) as before.

See more info on the task: https://our.intern.facebook.com/intern/tasks/?t=2990558

Also, I'm taking Slice instead of const char * and size in Put function.

Test Plan: unittests

Reviewers: haobo, kailiu, dhruba, emayanke

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13569
2013-10-22 17:44:00 -07:00
Kai Liu
70e87f7866 Update the latest rocksdb version 2013-10-22 14:49:44 -07:00