Summary: So fflush() takes a lock which is heavyweight. I added flush_pending_, but more importantly, I removed LogFlush() from foreground threads.
Test Plan: ./db_test
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14535
Summary:
In this diff I present you BackupableDB v1. You can easily use it to backup your DB and it will do incremental snapshots for you.
Let's first describe how you would use BackupableDB. It's inheriting StackableDB interface so you can easily construct it with your DB object -- it will add a method RollTheSnapshot() to the DB object. When you call RollTheSnapshot(), current snapshot of the DB will be stored in the backup dir. To restore, you can just call RestoreDBFromBackup() on a BackupableDB (which is a static method) and it will restore all files from the backup dir. In the next version, it will even support automatic backuping every X minutes.
There are multiple things you can configure:
1. backup_env and db_env can be different, which is awesome because then you can easily backup to HDFS or wherever you feel like.
2. sync - if true, it *guarantees* backup consistency on machine reboot
3. number of snapshots to keep - this will keep last N snapshots around if you want, for some reason, be able to restore from an earlier snapshot. All the backuping is done in incremental fashion - if we already have 00010.sst, we will not copy it again. *IMPORTANT* -- This is based on assumption that 00010.sst never changes - two files named 00010.sst from the same DB will always be exactly the same. Is this true? I always copy manifest, current and log files.
4. You can decide if you want to flush the memtables before you backup, or you're fine with backing up the log files -- either way, you get a complete and consistent view of the database at a time of backup.
5. More things you can find in BackupableDBOptions
Here is the directory structure I use:
backup_dir/CURRENT_SNAPSHOT - just 4 bytes holding the latest snapshot
0, 1, 2, ... - files containing serialized version of each snapshot - containing a list of files
files/*.sst - sst files shared between snapshots - if one snapshot references 00010.sst and another one needs to backup it from the DB, it will just reference the same file
files/ 0/, 1/, 2/, ... - snapshot directories containing private snapshot files - current, manifest and log files
All the files are ref counted and deleted immediatelly when they get out of scope.
Some other stuff in this diff:
1. Added GetEnv() method to the DB. Discussed with @haobo and we agreed that it seems right thing to do.
2. Fixed StackableDB interface. The way it was set up before, I was not able to implement BackupableDB.
Test Plan:
I have a unittest, but please don't look at this yet. I just hacked it up to help me with debugging. I will write a lot of good tests and update the diff.
Also, `make asan_check`
Reviewers: dhruba, haobo, emayanke
Reviewed By: dhruba
CC: leveldb, haobo
Differential Revision: https://reviews.facebook.net/D14295
Summary: This would enable rocksdb users to get the db identity without depending on implementation details(storing that in IDENTITY file)
Test Plan: db/db_test (has identity checks)
Reviewers: dhruba, haobo, igor, kailiu
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14463
Summary:
Let's get rid of TransformRep and it's children. We have confirmed that HashSkipListRep works better with multifeed, so there is no benefit to keeping this around.
This diff is mostly just deleting references to obsoleted functions. I also have a diff for fbcode that we'll need to push when we switch to new release.
I had to expose HashSkipListRepFactory in the client header files because db_impl.cc needs access to GetTransform() function for SanitizeOptions.
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14397
Summary:
I went through all remaining shared_ptrs and removed the ones that I found not-necessary. Only GenerateCachePrefix() is called fairly often, so don't expect much perf wins.
The ones that are left are accessed infrequently and I think we're fine with keeping them.
Test Plan: make asan_check
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14427
Summary:
The commit at 27bbef1180 had a memory leak
that was detected by valgrind. The memtable that has a refcount decrement
in MemTableList::InstallMemtableFlushResults was not freed.
Test Plan: valgrind ./db_test --leak-check=full
Reviewers: igor
Reviewed By: igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14391
Summary:
Large memory allocations and frees are costly and best done outside the
db-mutex. The memtables are already allocated outside the db-mutex but
they were being freed while holding the db-mutex.
This patch frees obsolete memtables outside the db-mutex.
Test Plan:
make check
db_stress
Unit tests pass, I am in the process of running stress tests.
Reviewers: haobo, igor, emayanke
Reviewed By: haobo
CC: reconnect.grayhat, leveldb
Differential Revision: https://reviews.facebook.net/D14319
Summary: We need access to options for BackupableDB
Test Plan: make check
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb, reconnect.grayhat
Differential Revision: https://reviews.facebook.net/D14331
Summary: This is part of https://reviews.facebook.net/D14295 -- smaller diff that is easier to review
Test Plan: make asan_check
Reviewers: dhruba, haobo, emayanke
Reviewed By: emayanke
CC: leveldb, kailiu, reconnect.grayhat
Differential Revision: https://reviews.facebook.net/D14301
Summary:
All filesystem Io should be done outside the dbmutex. There was one place
when we have to roll the transaction log that we were creating the new log file
while holding the dbmutex.
I rearranged this code so that the act of creating the new transaction log
file is done without holding the dbmutex. I also allocate the new memtable
outside the dbmutex, this is important because creating the memtable
could be heavyweight.
Test Plan: make check and dbstress
Reviewers: haobo, igor
Reviewed By: haobo
CC: leveldb, reconnect.grayhat
Differential Revision: https://reviews.facebook.net/D14283
Summary: liveness of the statistics object is already ensured by the shared pointer in DB options. There's no reason to pass again shared pointer among internal functions. Raw pointer is sufficient and efficient.
Test Plan: make check
Reviewers: dhruba, MarkCallaghan, igor
Reviewed By: dhruba
CC: leveldb, reconnect.grayhat
Differential Revision: https://reviews.facebook.net/D14289
Summary:
Provide a framework to profile a query in detail to figure out latency bottleneck. Currently, in Get(), Put() and iterators, 2-3 simple timing is used. We can easily add more profile counters to the framework later.
Test Plan: Enable this profiling in seveal existing tests.
Reviewers: haobo, dhruba, kailiu, emayanke, vamsi, igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14001
Summary:
Previously we introduce a `flush_block_policy_factory` in Options, however, that options is strongly releated to Table based tables.
It will make more sense to move it to block based table's own factory class.
Test Plan: make check to pass existing tests
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14211
Summary:
The primary motivation of the changes is to make it easier to figure out the inside of the tables.
* rename "table stats" to "table properties" since now we have more than "integers" to store in the property block.
* Add filter block size to the basic table properties.
* Whenever a table is built, we'll log the table properties (the sample output is in Test Plan).
* Make an api to expose deleted keys.
Test Plan:
Passed all existing test. and the sample output of table stats:
==================================================================
Basic Properties
------------------------------------------------------------------
# data blocks: 1
# entries: 1
raw key size: 9
raw average key size: 9
raw value size: 9
raw average value size: 0
data block size: 25
index block size: 27
filter block size: 18
(estimated) table size: 70
filter policy: rocksdb.BuiltinBloomFilter
==================================================================
User collected properties: InternalKeyPropertiesCollector
------------------------------------------------------------------
kDeletedKeys: 1
==================================================================
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14187
Summary:
Previously in KeyMayExist(), if DB::Get() returns non-Status::OK(), we assumes key may not exist.
However, as if index block is not in block cache, Status::Incomplete() will return. Worse still, if
options::filter_delete is enabled, we may falsely ignore the "delete" operation:
https://github.com/facebook/rocksdb/blob/master/db/write_batch.cc#L217-L220
This diff fixes this bug and will let crash-test pass.
Test Plan:
Ran:
./db_stress --test_batches_snapshots=1 --ops_per_thread=1000000 --threads=32 --write_buffer_size=4194304 --destroy_db_initially=1 --reopen=0 --readpercent=5 --prefixpercent=45 --writepercent=35 --delpercent=5 --iterpercent=10 --db=/home/kailiu/local/newer --max_key=100000000 --disable_seek_compaction=0 --mmap_read=0 --block_size=16384 --cache_size=1048576 --open_files=500000 --verify_checksum=1 --sync=0 --disable_wal=0 --disable_data_sync=0 --target_file_size_base=2097152
--target_file_size_multiplier=2 --max_write_buffer_number=3 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --filter_deletes=1
Previously we'll see crash happens very soon.
Reviewers: igor, dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14115
Summary:
Created a unittest that verifies that automatic deletion performed by PurgeObsoleteFiles() works correctly.
Also, few small fixes on the logic part -- call version_set_->GetObsoleteFiles() in FindObsoleteFiles() instead of on some arbitrary positions.
Test Plan: Created a unit test
Reviewers: dhruba, haobo, nkg-
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14079
Summary:
mac and our dev server has totally differnt definition of uint64_t, therefore fixing the warning in mac has actually made code in linux uncompileable.
Test Plan:
make clean && make -j32
Summary: One more fix! In some cases, our filenames start with "/". Apparently, env_ can't handle filenames with double //
Test Plan:
deletefile_test does not include this line in the LOG anymore:
2013/11/12-18:11:43.150149 7fe4a6fff700 RenameFile logfile #3 FAILED -- IO error: /tmp/rocksdbtest-3574/deletefile_test//000003.log: No such file or directory
Reviewers: dhruba, haobo
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14055
Summary: The work to make sure mac os compiles rocksdb is not completed yet. But at least we can start cleaning some warnings captured only by g++ from mac os..
Test Plan: ran make in mac os
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14049
Summary:
@haobo's suggestions from https://reviews.facebook.net/D13827
Renaming some variables, deprecating purge_log_after_flush, changing for loop into auto for loop.
I have not implemented deleting objects outside of mutex yet because it would require a big code change - we would delete object in db_impl, which currently does not know anything about object because it's defined in version_edit.h (FileMetaData). We should do it at some point, though.
Test Plan: Ran deletefile_test
Reviewers: haobo
Reviewed By: haobo
CC: leveldb, haobo
Differential Revision: https://reviews.facebook.net/D14025
Summary: FindObsoleteFiles() has to be called before PurgeObsoleteFiles() because FindObsoleteFiles() sets manifest_file_number, log_number and prev_log_number to valid values.
Test Plan: deletefile_test now works
Reviewers: dhruba, emayanke, kailiu
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13995
Summary:
Here's one solution we discussed on speeding up FindObsoleteFiles. Keep a set of all files in DBImpl and update the set every time we create a file. I probably missed few other spots where we create a file.
It might speed things up a bit, but makes code uglier. I don't really like it.
Much better approach would be to abstract all file handling to a separate class. Think of it as layer between DBImpl and Env. Having a separate class deal with file namings and deletion would benefit both code cleanliness (especially with huge DBImpl) and speed things up. It will take a huge effort to do this, though.
Let's discuss offline today.
Test Plan: Ran ./db_stress, verified that files are getting deleted
Reviewers: dhruba, haobo, kailiu, emayanke
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D13827
Summary: Allow block based table to configure the way flushing the blocks. This feature will allow us to add support for prefix-aligned block.
Test Plan: make check
Reviewers: dhruba, haobo, sdong, igor
Reviewed By: sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13875
Summary:
Added a new call LogFlush() that flushes the log contents to the OS buffers. We never call it with lock held.
We call it once for every Read/Write and often in compaction/flush process so the frequency should not be a problem.
Test Plan: db_test
Reviewers: dhruba, haobo, kailiu, emayanke
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13935
Summary: Added a prefix_seek flag in ReadOptions to indicate that Seek is prefix aware(might not return data with different prefix), and also not bound to a specific prefix. Multiple Seeks and range scans can be invoked on the same iterator. If a specific prefix is specified, this flag will be ignored. Just a quick prototype that works for PrefixHashRep, the new lockless memtable could be easily extended with this support too.
Test Plan: test it on Leaf
Reviewers: dhruba, kailiu, sdong, igor
Reviewed By: igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13929
Summary:
Archive cleaning will still happen every WAL_ttl seconds
but archived logs will be deleted only if archive size
is greater then a WAL_size_limit value.
Empty archived logs will be deleted evety WAL_ttl.
Test Plan:
1. Unit tests pass.
2. Benchmark.
Reviewers: emayanke, dhruba, haobo, sdong, kailiu, igor
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13869
Summary:
strict essentially means that we MUST find the startsequence. Thus we should return if starteSequence is not found in the first file in case strict is set. This will take care of ending the iterator in case of permanent gaps due to corruptions in the log files
Also created NextImpl function that will have internal variable to distinguish whether Next is being called from StartSequence or by application.
Set NotFoudn::gaps status to give an indication of gaps happeneing.
Polished the inline documentation at various places
Test Plan:
* db_repl_stress test
* db_test relating to transaction log iterator
* fbcode/wormhole/rocksdb/rocks_log_iterator
* sigma production machine sigmafio032.prn1
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13689
Summary:
Rocksdb can now support a uncompressed block cache, or a compressed
block cache or both. Lookups first look for a block in the
uncompressed cache, if it is not found only then it is looked up
in the compressed cache. If it is found in the compressed cache,
then it is uncompressed and inserted into the uncompressed cache.
It is possible that the same block resides in the compressed cache
as well as the uncompressed cache at the same time. Both caches
have their own individual LRU policy.
Test Plan: Unit test case attached.
Reviewers: kailiu, sdong, haobo, leveldb
Reviewed By: haobo
CC: xjin, haobo
Differential Revision: https://reviews.facebook.net/D12675
Summary: This is to give application compaction filter a chance to access context information of a specific compaction run. For example, depending on whether a compaction goes through all data files, the application could do things differently.
Test Plan: make check
Reviewers: dhruba, kailiu, sdong
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13683
Summary:
This patch is to address @haobo's comments on D13521:
1. rename Table to be TableReader and make its factory function to be GetTableReader
2. move the compression type selection logic out of TableBuilder but to compaction logic
3. more accurate comments
4. Move stat name constants into BlockBasedTable implementation.
5. remove some uncleaned codes in simple_table_db_test
Test Plan: pass test suites.
Reviewers: haobo, dhruba, kailiu
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13785
Summary: This patch makes Table and TableBuilder a abstract class and make all the implementation of the current table into BlockedBasedTable and BlockedBasedTable Builder.
Test Plan: Make db_test.cc to work with block based table. Add a new test simple_table_db_test.cc where a different simple table format is implemented.
Reviewers: dhruba, haobo, kailiu, emayanke, vamsi
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13521
Summary:
1. Added a new option that support user-defined table stats collection.
2. Added a deleted key stats collector in `utilities`
Test Plan:
Added a unit test for newly added code.
Also ran make check to make sure other tests are not broken.
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13491
Summary:
When a Put fails, it can leave database in a messy state. We don't want to pretend that everything is OK when it may not be. We fail every write following the failed one.
I added checks for corruption to DBImpl::Write(). Is there anywhere else I need to add them?
Test Plan: Corruption unit test.
Reviewers: dhruba, haobo, kailiu
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13671
Summary:
This is to simplify rocksdb public APIs and improve the code quality.
Created an additional parameter to ParseFileName for log sub type and improved the code for deleting a wal file.
Wrote exhaustive unit-tests in delete_file_test
Unification of other redundant APIs can be taken up in a separate diff
Test Plan: Expanded delete_file test
Reviewers: dhruba, haobo, kailiu, sdong
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13647
Summary:
Crash may occur during the flushes of more than two mem tables.
As the info log suggested, even when both were successfully flushed,
the recovery process still pick up one of the memtable's log for recovery.
This diff fix the problem by setting the correct "log number" in MANIFEST.
Test Plan: make test; deployed to leaf4 and make sure it doesn't result in crashes of this type.
Reviewers: haobo, dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13659
Summary: as title
Test Plan: make check; ./perf_context_test
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13629
Summary:
Create a new type of file on startup if it doesn't already exist called DBID.
This will store a unique number generated from boost library's uuid header file.
The use-case is to identify the case of a db losing all its data and coming back up either empty or from an image(backup/live replica's recovery)
the key point to note is that DBID is not stored in a backup or db snapshot
It's preferable to use Boost for uuid because:
1) A non-standard way of generating uuid is not good
2) /proc/sys/kernel/random/uuid generates a uuid but only on linux environments and the solution would not be clean
3) c++ doesn't have any direct way to get a uuid
4) Boost is a very good library that was already having linkage in rocksdb from third-party
Note: I had to update the TOOLCHAIN_REV in build files to get latest verison of boost from third-party as the older version had a bug.
I had to put Wno-uninitialized in Makefile because boost-1.51 has an unitialized variable and rocksdb would not comiple otherwise. Latet open-source for boost is 1.54 but is not there in third-party. I have notified the concerned people in fbcode about it.
@kailiu : While releasing to third-party, an additional dependency will need to be created for boost in TARGETS file. I can help identify.
Test Plan:
Expand db_test to test 2 cases
1) Restarting db with Id file present - verify that no change to Id
2)Restarting db with Id file deleted - verify that a different Id is there after reopen
Also run make all check
Reviewers: dhruba, haobo, kailiu, sdong
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13587
Summary:
This patch adds a option for universal compaction to allow us to only compress output files if the files compacted previously did not yet reach a specified ratio, to save CPU costs in some cases.
Compression is always skipped for flushing. This is because the size information is not easy to evaluate for flushing case. We can improve it later.
Test Plan:
add test
DBTest.UniversalCompactionCompressRatio1 and DBTest.UniversalCompactionCompressRatio12
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13467
Summary:
Enable background flush thread in this patch and fix unit tests with:
(1) After background flush, schedule a background compaction if condition satisfied;
(2) Fix a bug that if universal compaction is enabled and number of levels are set to be 0, compaction will not be automatically triggered
(3) Fix unit tests to wait for compaction to finish instead of flush, before checking the compaction results.
Test Plan: pass all unit tests
Reviewers: haobo, xjin, dhruba
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13461
Summary:
* Logstore requests a valid change of reutrning an empty iterator and not an error in case of no log files.
* Changed the code to return the writebatch containing the sequence number requested from GetupdatesSince even if it lies in the middle. Earlier we used to return the next writebatch,. This also allows me oto guarantee that no files played upon by the iterator are redundant. I mean the starting log file has at least a sequence number >= the sequence number requested form GetupdatesSince.
* Cleaned up redundant logic in Iterator::Next and made a new function SeekToStartSequence for greater readability and maintainibilty.
* Modified a test in db_test accordingly
Please check the logic carefully and suggest improvements. I have a separate patch out for more improvements like restricting reader to read till written sequences.
Test Plan:
* transaction log iterator tests in db_test,
* db_repl_stress.
* rocks_log_iterator_test in fbcode/wormhole/rocksdb/test - 2 tests thriving on hacks till now can get simplified
* testing on the shadow setup for sigma with replication
Reviewers: dhruba, haobo, kailiu, sdong
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13437
Summary:
So far we only have key/value pairs as well as bloom filter stored in the
sst file. It will be great if we are able to store more metadata about
this table itself, for example, the entry size, bloom filter name, etc.
This diff is the first step of this effort. It allows table to keep the
basic statistics mentioned in http://fburl.com/14995441, as well as
allowing writing user-collected stats to stats block.
After this diff, we will figure out the interface of how to allow user to collect their interested statistics.
Test Plan:
1. Added several unit tests.
2. Ran `make check` to ensure it doesn't break other tests.
Reviewers: dhruba, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13419
Summary: When I debug the unit test failures when enabling background flush thread, I feel the function names can be made clearer for people to understand. Also, if the names are fixed, in many places, some tests' bugs are obvious (and some of those tests are failing). This patch is to clean it up for future maintenance.
Test Plan: Run test suites.
Reviewers: haobo, dhruba, xjin
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13431
Summary:
This careless error was causing ASSERT_OK(DestroyDB) to fail in db_test.
Basically .. was being returned as a child of db/archive and ParseFileName returned false on that,
but 'type' was set to LogFile from earlier and not reset. The return of ParseFileName was not being checked to delete the log file or not.
Test Plan: make all check
Reviewers: dhruba, haobo, xjin, kailiu, nkg-
Reviewed By: nkg-
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13413
Summary: In some cases, you might not want to store the data log (write ahead log) files in the same dir as the sst files. An example use case is leaf, which stores sst files in tmpfs. And would like to save the log files in a separate dir (disk) to save memory.
Test Plan: make all. Ran db_test test. A few test failing. P2785018. If you guys don't see an obvious problem with the code, maybe somebody from the rocksdb team could help me debug the issue here. Running this on leaf worked well. I could see logs stored on disk, and deleted appropriately after compactions. Obviously this is only one set of options. The unit tests cover different options. Seems like I'm missing some edge cases.
Reviewers: dhruba, haobo, leveldb
CC: xinyaohu, sumeet
Differential Revision: https://reviews.facebook.net/D13239
Summary:
Change namespace from leveldb to rocksdb. This allows a single
application to link in open-source leveldb code as well as
rocksdb code into the same process.
Test Plan: compile rocksdb
Reviewers: emayanke
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13287
Summary: This is useful to keep track of refreshes in transaction log iterator
Test Plan: make; db_stress --statistics=1 shows it
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13281
Summary: as title
Test Plan: make check
Reviewers: emayanke
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13263
Summary: As title. This is just a quick hack and not ready for commit. fails a lot of unit test. I will test/debug it directly in ViewState shadow .
Test Plan: Try it in shadow test.
Reviewers: dhruba, xjin
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12933
Summary:
We saw SIGSEGV when set options.num_levels=1 in universal compaction
style. Dug into this issue for a while, and finally found the root cause (thank Haobo for discussion).
Test Plan: Add new unit test. It throws SIGSEGV without this change. Also run "make all check".
Reviewers: haobo, dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13251
Summary: A previous diff moved these outside of lock protected area. Moved back in now. Also moved tmp_batch_ update outside of lock protected area, as only the single write thread can access it.
Test Plan: make check
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13137
Summary: As title. The DB log file life cycle is tied up with the memtable it backs. Once the memtable is flushed to sst and committed, we should be able to delete the log file, without holding the mutex. This is part of the bigger change to avoid FindObsoleteFiles at runtime. It deals with log files. sst files will be dealt with later.
Test Plan: make check; db_bench
Reviewers: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11709
Summary:
There is an config option called Options.min_write_buffer_number_to_merge
that specifies the minimum number of write buffers to merge in memory
before flushing to a file in L0. But in the the case when the db is
being closed, we should not be using this config, instead we should
flush whatever write buffers were available at that time.
Test Plan: Unit test attached.
Reviewers: haobo, emayanke
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12717
Summary: So that replication can just download from wherever LogFile.Pathname is pointing them.
Test Plan: make all check;./db_repl_stress
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12609
Summary:
Add new command "change_compaction_style" to ldb tool. For
universal->level, it shows "nothing to do". For level->universal, it
compacts all files into a single one and moves the file to level 0.
Also add check for number of files at level 1+ when opening db with
universal compaction style.
Test Plan:
'make all check'. New unit test for internal convertion function. Also manully test various
cmd like:
./ldb change_compaction_style --old_compaction_style=0
--new_compaction_style=1 --db=/tmp/leveldbtest-3088/db_test
Reviewers: haobo, dhruba
Reviewed By: haobo
CC: vamsi, emayanke
Differential Revision: https://reviews.facebook.net/D12603
Summary:
The way counters/statistics are implemented in rocksdb demands that enum Tickers and TickerNameMap follow the same order, otherwise statistics exposed from fbcode/rocks get out-of-sync. 2 counters for prefix had violated this order and when I built counters for fbcode/mcrocksdb, statistics for sequence number were appearing out-of-sync.
The other change is to record sequence-number using setTickerCount only and not recordTick. This is because of difference in statistics as understood by rocks/utils which uses ServiceData::statistics function and rocksdb statistics. In rocksdb there is just 1 counter for a countername. But in ServiceData there are 4 independent buckets for every countername-Count, Sum, Average and Rate. SetTickerCount and RecordTick update the same variable in rocksdb but different buckets in ServiceData. Therefore, I had to choose one consistent function from RecordTick or SetTickerCount for sequence number in rocksdb. I chose SetTickerCount because the statistics object in options passed during rocksdb-open is user-dependent and SetTickerCount makes sense there.
There will be a corresponding diff to mcorcksdb in fbcode shortly.
Test Plan: make all check; check ticker value using fprintfs
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12669
Summary: db->DeleteFile calls ParseFileName to check name that was returned for sst file. Now, sst filename is returned using TableFileName which uses MakeFileName. This puts a / at the front of the name and ParseFileName doesn't like that. Changed ParseFileName to tolerate /s at the beginning. The test delet_file_test used to pass earlier because this behaviour of MakeFileName had been changed a while back to not return a / during which delete_file_test was checked in. But MakeFileName had to be reverted to add / at the front because GetLiveFiles used at many places outside rocksdb used the previous behaviour of MakeFileName.
Test Plan: make;./delete_filetest;make all check
Reviewers: dhruba, haobo, vamsi
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12663
Summary:
The DeleteFile API was removing files inside the db-lock. This
is now changed to remove files outside the db-lock.
The GetLiveFilesMetadata() returns the smallest and largest
seqnuence number of each file as well.
Test Plan: deletefile_test
Reviewers: emayanke, haobo
Reviewed By: haobo
CC: leveldb
Maniphest Tasks: T63
Differential Revision: https://reviews.facebook.net/D12567
Summary: Let TransformRepFactory own the passed in transform. Also make it better encapsulated.
Test Plan: make valgrind_check;
Reviewers: dhruba, emayanke
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12591
Summary:
If ReadOptions.non_blocking_io is set to true, then KeyMayExists
and Iterators will return data that is cached in RAM.
If the Iterator needs to do IO from storage to serve the data,
then the Iterator.status() will return Status::IsRetry().
Test Plan:
Enhanced unit test DBTest.KeyMayExist to detect if there were are IOs
issues from storage. Added DBTest.NonBlockingIteration to verify
nonblocking Iterations.
Reviewers: emayanke, haobo
Reviewed By: haobo
CC: leveldb
Maniphest Tasks: T63
Differential Revision: https://reviews.facebook.net/D12531
Summary:
As title. This is possible as tickers are atomic now.
db_bench on high qps in-memory muti-thread random get workload, showed ~5% throughput improvement.
Test Plan: make check; db_bench; db_stress
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12555
Summary: Replace include/leveldb with include/rocksdb.
Test Plan:
make clean; make check
make clean; make release
Differential Revision: https://reviews.facebook.net/D12489
Summary:
This patch adds three new MemTableRep's: UnsortedRep, PrefixHashRep, and VectorRep.
UnsortedRep stores keys in an std::unordered_map of std::sets. When an iterator is requested, it dumps the keys into an std::set and iterates over that.
VectorRep stores keys in an std::vector. When an iterator is requested, it creates a copy of the vector and sorts it using std::sort. The iterator accesses that new vector.
PrefixHashRep stores keys in an unordered_map mapping prefixes to ordered sets.
I also added one API change. I added a function MemTableRep::MarkImmutable. This function is called when the rep is added to the immutable list. It doesn't do anything yet, but it seems like that could be useful. In particular, for the vectorrep, it means we could elide the extra copy and just sort in place. The only reason I haven't done that yet is because the use of the ArenaAllocator complicates things (I can elaborate on this if needed).
Test Plan:
make -j32 check
./db_stress --memtablerep=vector
./db_stress --memtablerep=unsorted
./db_stress --memtablerep=prefixhash --prefix_size=10
Reviewers: dhruba, haobo, emayanke
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12117
Summary: An api to query the level, key ranges, size etc for each SST file and an api to delete a specific file from the db and all associated state in the bookkeeping datastructures.
Notes: Editing the manifest version does not release the obsolete files right away. However deleting the file directly will mess up the iterator. We may need a more aggressive/timely file deletion api.
I have used std::unique_ptr - will switch to boost:: since this is external. thoughts?
Unit test is fragile right now as it expects the compaction at certain levels.
Test Plan: unittest
Reviewers: dhruba, vamsi, emayanke
CC: zshao, leveldb, haobo
Task ID: #
Blame Rev:
Test Plan:
- make all check;
- make release;
- make stringappend_test; ./stringappend_test
Reviewers: haobo, emayanke
Reviewed By: haobo
CC: leveldb, kailiu
Differential Revision: https://reviews.facebook.net/D12381
Summary: Also expanded class LogFile to have startSequene and FileSize and exposed it publicly
Test Plan: make all check
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12087
Summary: statistic for sequence number is needed by wormhole. setTickerCount is demanded for this statistic. I can't simply recordTick(max_sequence) when db recovers because the statistic iobject is owned by client and may/may not be reset during reopen. Eg. statistic is reset in mcrocksdb whereas it is not in db_stress. Therefore it is best to go with setTickerCount
Test Plan: ./db_stress ... --statistics=1 and observed expected sequence number
Reviewers: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12327
Summary:
Minor fix to current codes, including: coding style, output format,
comments. No major logic change. There are only 2 real changes, please see my inline comments.
Test Plan: make all check
Reviewers: haobo, dhruba, emayanke
Differential Revision: https://reviews.facebook.net/D12297
Summary:
With Merge returning bool, it can keep failing silently(eg. While faling to fetch timestamp in TTL). We need to detect this through a rocksdb counter which can get bumped whenever Merge returns false. This will also be super-useful for the mcrocksdb-counter service where Merge may fail.
Added a counter NUMBER_MERGE_FAILURES and appropriately updated db/merge_helper.cc
I felt that it would be better to directly add counter-bumping in Merge as a default function of MergeOperator class but user should not be aware of this, so this approach seems better to me.
Test Plan: make all check
Reviewers: dnicholas, haobo, dhruba, vamsi
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12129
Summary: Similar to v2 (db and table code understands prefixes), but use ReadOptions as in v3. Also, make the CreateFilter code faster and cleaner.
Test Plan: make db_test; export LEVELDB_TESTS=PrefixScan; ./db_test
Reviewers: dhruba
Reviewed By: dhruba
CC: haobo, emayanke
Differential Revision: https://reviews.facebook.net/D12027
Summary:
If we have same compaction filter for each compaction,
application cannot know about the different compaction processes.
Later on, we can put in more details in compaction filter for the
application to consume and use it according to its needs. For e.g. In
the universal compaction, we have a compaction process involving all the
files while others don't involve all the files. Applications may want to
collect some stats only when during full compaction.
Test Plan: run existing unit tests
Reviewers: haobo, dhruba
Reviewed By: dhruba
CC: xinyaohu, leveldb
Differential Revision: https://reviews.facebook.net/D12057
Summary:
The pre-existing code was purging a DeleteMarker if thay key did not
exist in deeper levels. But in the Universal Compaction Style, all
files are in Level0. For compaction runs that did not include the
earliest file, we were erroneously purging the DeleteMarkers.
The fix is to purge DeleteMarkers only if the compaction includes
the earlist file.
Test Plan: DBTest.Randomized triggers this code path.
Differential Revision: https://reviews.facebook.net/D12081
Summary:
Continue fixing existing unit tests for universal compaction. I have
tried to apply universal compaction to all unit tests those haven't
called ChangeOptions(). I left a few which are either apparently not
applicable to universal compaction (because they check files/keys/values
at level 1 or above levels), or apparently not related to compaction
(e.g., open a file, open a db).
I also add a new unit test for universal compaction.
Good news is I didn't see any bugs during this round.
Test Plan: Ran "make all check" yesterday. Has rebased and is rerunning
Reviewers: haobo, dhruba
Differential Revision: https://reviews.facebook.net/D12135
Summary:
This is the first step to fix unit tests and bugs for universal
compactiion. I added universal compaction option to ChangeOptions(), and
fixed all unit tests calling ChangeOptions(). Some of these tests
obviously assume more than 1 level and check file number/values in level
1 or above levels. I set kSkipUniversalCompaction for these tests.
The major bug I found is manual compaction with universal compaction never stops. I have put a fix for
it.
I have also set universal compaction as the default compaction and found
at least 20+ unit tests failing. I haven't looked into the details. The
next step is to check all unit tests without calling ChangeOptions().
Test Plan: make all check
Reviewers: dhruba, haobo
Differential Revision: https://reviews.facebook.net/D12051
Summary:
Here are the major changes to the Merge Interface. It has been expanded
to handle cases where the MergeOperator is not associative. It does so by stacking
up merge operations while scanning through the key history (i.e.: during Get() or
Compaction), until a valid Put/Delete/end-of-history is encountered; it then
applies all of the merge operations in the correct sequence starting with the
base/sentinel value.
I have also introduced an "AssociativeMerge" function which allows the user to
take advantage of associative merge operations (such as in the case of counters).
The implementation will always attempt to merge the operations/operands themselves
together when they are encountered, and will resort to the "stacking" method if
and only if the "associative-merge" fails.
This implementation is conjectured to allow MergeOperator to handle the general
case, while still providing the user with the ability to take advantage of certain
efficiencies in their own merge-operator / data-structure.
NOTE: This is a preliminary diff. This must still go through a lot of review,
revision, and testing. Feedback welcome!
Test Plan:
-This is a preliminary diff. I have only just begun testing/debugging it.
-I will be testing this with the existing MergeOperator use-cases and unit-tests
(counters, string-append, and redis-lists)
-I will be "desk-checking" and walking through the code with the help gdb.
-I will find a way of stress-testing the new interface / implementation using
db_bench, db_test, merge_test, and/or db_stress.
-I will ensure that my tests cover all cases: Get-Memtable,
Get-Immutable-Memtable, Get-from-Disk, Iterator-Range-Scan, Flush-Memtable-to-L0,
Compaction-L0-L1, Compaction-Ln-L(n+1), Put/Delete found, Put/Delete not-found,
end-of-history, end-of-file, etc.
-A lot of feedback from the reviewers.
Reviewers: haobo, dhruba, zshao, emayanke
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11499
Summary: This diff adds histogram stats for soft_rate_limit stalls. It also renames the old rate_limit stats to hard_rate_limit.
Test Plan: make -j32 check
Reviewers: dhruba, haobo, MarkCallaghan
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12021
Summary:
This diff adds support for both soft and hard rate limiting. The following changes are included:
1) Options.rate_limit is renamed to Options.hard_rate_limit.
2) Options.rate_limit_delay_milliseconds is renamed to Options.rate_limit_delay_max_milliseconds.
3) Options.soft_rate_limit is added.
4) If the maximum compaction score is > hard_rate_limit and rate_limit_delay_max_milliseconds == 0, then writes are delayed by 1 ms at a time until the max compaction score falls below hard_rate_limit.
5) If the max compaction score is > soft_rate_limit but <= hard_rate_limit, then writes are delayed by 0-1 ms depending on how close we are to hard_rate_limit.
6) Users can disable 4 by setting hard_rate_limit = 0. They can add a limit to the maximum amount of time waited by setting rate_limit_delay_max_milliseconds > 0. Thus, the old behavior can be preserved by setting soft_rate_limit = 0, which is the default.
Test Plan:
make -j32 check
./db_stress
Reviewers: dhruba, haobo, MarkCallaghan
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12003
Summary: Removed KeyMayExistImpl because KeyMayExist demanded Get like semantics now. Removed no_io from memtable and imm because we need the proper value now and shouldn't just stop when we see Merge in memtable. Added checks to block_cache. Updated documentation and unit-test
Test Plan: make all check;db_stress for 1 hour
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11853
Summary:
Currently, when a certain number of level0 files (level0_slowdown_writes_trigger) are present, RocksDB will slow down each write by 1ms. There is a second limit of level0 files at which RocksDB will stop writes altogether (level0_stop_writes_trigger).
This patch enables the user to supply a third parameter specifying the number of files at which Rocks will start slowing down writes (level0_start_slowdown_writes). When this number is reached, Rocks will slow down writes as a quadratic function of level0_slowdown_writes_trigger - num_level0_files.
For some workloads, this improves latency and throughput. I will post some stats momentarily in https://our.intern.facebook.com/intern/tasks/?t=2613384.
Test Plan:
make -j32 check
./db_stress
./db_bench
Reviewers: dhruba, haobo, MarkCallaghan, xjin
Reviewed By: xjin
CC: leveldb, xjin, zshao
Differential Revision: https://reviews.facebook.net/D11859
Summary:
Add an option for arena block size, default value 4096 bytes. Arena will allocate blocks with such size.
I am not sure about passing parameter to skiplist in the new virtualized framework, though I talked to Jim a bit. So add Jim as reviewer.
Test Plan:
new unit test, I am running db_test.
For passing paramter from configured option to Arena, I tried tests like:
TEST(DBTest, Arena_Option) {
std::string dbname = test::TmpDir() + "/db_arena_option_test";
DestroyDB(dbname, Options());
DB* db = nullptr;
Options opts;
opts.create_if_missing = true;
opts.arena_block_size = 1000000; // tested 99, 999999
Status s = DB::Open(opts, dbname, &db);
db->Put(WriteOptions(), "a", "123");
}
and printed some debug info. The results look good. Any suggestion for such a unit-test?
Reviewers: haobo, dhruba, emayanke, jpaton
Reviewed By: dhruba
CC: leveldb, zshao
Differential Revision: https://reviews.facebook.net/D11799
Summary: After my patch for stall histograms, there are redundant calls to NowMicros() by both the stop watches and DBImpl::MakeRoomForWrites. So I removed the redundant calls such that the information is gotten from the stopwatch.
Test Plan:
make clean
make -j32 check
Reviewers: dhruba, haobo, MarkCallaghan
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11883
Summary: Previously, statistics are kept on how much time is spent on stalls of different types. This patch adds support for keeping number of stalls of each type. For example, instead of just reporting how many microseconds are spent waiting for memtables to be compacted, it will also report how many times a write stalled for that to occur.
Test Plan:
make -j32 check
./db_stress
# Not really sure what else should be done...
Reviewers: dhruba, MarkCallaghan, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11841
Summary:
Revert "If disable wal is set, then batch commits are avoided" because
keeping the mutex while inserting into the skiplist means that readers
and writes are all serialized on the mutex.
Test Plan:
Reviewers:
CC:
Task ID: #
Blame Rev:
Summary: This diff virtualizes the skiplist interface so that users can provide their own implementation of a backing store for MemTables. Eventually, the backing store will be responsible for its own synchronization, allowing users (and us) to experiment with different lockless implementations.
Test Plan:
make clean
make -j32 check
./db_stress
Reviewers: dhruba, emayanke, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11739
Summary:
rocksdb uses batch commit to write to transaction log. But if
disable wal is set, then writes to transaction log are anyways
avoided. In this case, there is not much value-add to batch things,
batching can cause unnecessary delays to Puts().
This patch avoids batching when disableWal is set.
Test Plan:
make check.
I am running db_stress now.
Reviewers: haobo
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11763