Summary:
Two issues:
* the input keys to the compaction don't include sequence number.
* sequence number is set to max(seq_num), but it should be set to max(seq_num)+1, because the condition here is strictly-larger (i.e. we will only zero-out sequence number if the DB's sequence number is strictly greater than the key's sequence number): https://github.com/facebook/rocksdb/blob/master/db/compaction_job.cc#L830
Test Plan: make compaction_job_test && ./compaction_job_test
Reviewers: sdong, lovro
Reviewed By: lovro
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D41247
Summary:
While profiling compaction in our service I noticed a lot of CPU (~15% of compaction) being spent in MergingIterator and key comparison. Looking at the code I found MergingIterator was (understandably) using std::priority_queue for the multiway merge.
Keys in our dataset include sequence numbers that increase with time. Adjacent keys in an L0 file are very likely to be adjacent in the full database. Consequently, compaction will often pick a chunk of rows from the same L0 file before switching to another one. It would be great to avoid the O(log K) operation per row while compacting.
This diff replaces std::priority_queue with a custom binary heap implementation. It has a "replace top" operation that is cheap when the new top is the same as the old one (i.e. the priority of the top entry is decreased but it still stays on top).
Test Plan:
make check
To test the effect on performance, I generated databases with data patterns that mimic what I describe in the summary (rows have a mostly increasing sequence number). I see a 10-15% CPU decrease for compaction (and a matching throughput improvement on tmpfs). The exact improvement depends on the number of L0 files and the amount of locality. Performance on randomly distributed keys seems on par with the old code.
Reviewers: kailiu, sdong, igor
Reviewed By: igor
Subscribers: yoshinorim, dhruba, tnovak
Differential Revision: https://reviews.facebook.net/D29133
Summary:
Introduced a new category in the enum InfoLogLevel in env.h.
Modifed Log() in env.cc to use the Header()
when the InfoLogLevel == HEADER_LEVEL.
Updated tests in auto_roll_logger_test to ensure
the header is handled properly in these cases.
Test Plan: Augment existing tests in auto_roll_logger_test
Reviewers: igor, sdong, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D41067
Summary:
This fixes the following scenario we've hit:
- we reached max_total_wal_size, created a new wal and scheduled flushing all memtables corresponding to the old one,
- before the last of these flushes started its column family was dropped; the last background flush call was a no-op; no one removed the old wal from alive_logs_,
- hours have passed and no flushes happened even though lots of data was written; data is written to different column families, compactions are disabled; old column families are dropped before memtable grows big enough to trigger a flush; the old wal still sits in alive_logs_ preventing max_total_wal_size limit from kicking in,
- a few more hours pass and we run out disk space because of one huge .log file.
Test Plan: `make check`; backported the new test, checked that it fails without this diff
Reviewers: igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D40893
Summary: see title
Test Plan: run 'make unity'
Reviewers: igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D41079
Summary: About to cut release
Test Plan: none
Reviewers: igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D41061
Summary:
Add a new field: BackupableDBOptions.max_background_copies.
CreateNewBackup() and RestoreDBFromBackup() will use this number of threads to perform copies.
If there is a backup rate limit, then max_background_copies must be 1.
Update backupable_db_test.cc to test multi-threaded backup and restore.
Update backupable_db_test.cc to test backups when the backup environment is not the same as the database environment.
Test Plan:
Run ./backupable_db_test
Run valgrind ./backupable_db_test
Run with TSAN and ASAN
Reviewers: yhchiang, rven, anthony, sdong, igor
Reviewed By: igor
Subscribers: yhchiang, anthony, sdong, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D40725
Summary:
The option bottommost_level_compaction was introduced lately.
This option breaks the Java API behavior. To prevent the library
from doing so we set that option to a fixed value in Java.
In future we are going to remove that portion and replace the
hardcoded options using a more flexible way.
Fixed bug introduced by WriteBatchWithIndex Patch
Lately icanadi changed the behavior of WriteBatchWithIndex.
See commit: 821cff114e
This commit solves problems introduced by above mentioned commit.
Test Plan:
make rocksdbjava
make jtest
Reviewers: adamretter, ankgup87, yhchiang
Reviewed By: yhchiang
Subscribers: igor, dhruba
Differential Revision: https://reviews.facebook.net/D40647
Summary:
Rewrite Java tests compactRangeToLevel and compactRangeToLevelColumnFamily
to make them more deterministic and robust.
Test Plan:
make rocksdbjava
make jtest
Reviewers: anthony, fyrz, adamretter, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40941
Summary: Try to allocate LevelFileIteratorState and LevelFileNumIterator from DB iterator's arena, instead of calling malloc and free.
Test Plan: valgrind check
Reviewers: rven, yhchiang, anthony, kradhakrishnan, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D40929
Summary: Copy change from D37533 to gcc 4.8.1 config
Test Plan: make db_bench, `ldd db_bench`, try running it
Reviewers: MarkCallaghan, anthony
Reviewed By: anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40845
Summary: We have a race in the way test works. We avoided the race by adding the
wait to the counter. I thought 1s was eternity, but that is not true in some
scenarios. Increasing the timeout to 10s and adding warnings.
Also, adding nosleep to avoid the case where the wakeup thread is waiting behind
the sleeping thread for scheduling.
Test Plan: Run make check
Reviewers: siying igorcanadi
CC: leveldb@
Task ID: #7312624
Blame Rev:
Summary:
When seek target is a merge key (`kTypeMerge`), `DBIter::FindNextUserEntry()`
advances the underlying iterator _past_ the current key (`saved_key_`); see
`MergeValuesNewToOld()`. However, `FindPrevUserKey()` assumes that `iter_`
points to an entry with the same user key as `saved_key_`. As a result,
`it->Seek(key) && it->Prev()` can cause the iterator to be positioned at the
_next_, instead of the previous, entry (new test, written by @lovro, reproduces
the bug).
This diff changes `FindPrevUserKey()` to also skip keys that are _greater_ than
`saved_key_`.
Test Plan: db_test
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: leveldb, dhruba, lovro
Differential Revision: https://reviews.facebook.net/D40791
Summary: Make column_family_test runnable in ROCKSDB_LITE.
Test Plan: column_family_test
Reviewers: sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40251
Summary: Based on @anthony's feedback, we want to fail early if our static linking fails.
Test Plan: none
Reviewers: anthony
Reviewed By: anthony
Subscribers: dhruba, anthony, leveldb
Differential Revision: https://reviews.facebook.net/D40839
Summary:
Currently, when we insert something into block cache, we say that the block cache capacity decreased by the size of the block. However, size of the block might be less than the actual memory used by this object. For example, 4.5KB block will actually use 8KB of memory. So even if we configure block cache to 10GB, our actually memory usage of block cache will be 20GB!
This problem showed up a lot in testing and just recently also showed up in MongoRocks production where we were using 30GB more memory than expected.
This diff will fix the problem. Instead of counting the block size, we will count memory used by the block. That way, a block cache configured to be 10GB will actually use only 10GB of memory.
I'm using non-portable function and I couldn't find info on portability on Google. However, it seems to work on Linux, which will cover majority of our use-cases.
Test Plan:
1. fill up mongo instance with 80GB of data
2. restart mongo with block cache size configured to 10GB
3. do a table scan in mongo
4. memory usage before the diff: 12GB. memory usage after the diff: 10.5GB
Reviewers: sdong, MarkCallaghan, rven, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40635
Summary: It's not really nice to call user's API with garbage data in new_value. This diff makes sure that new_value is empty before calling the merge operator.
Test Plan: Added assert to Merge operator in merge_test
Reviewers: sdong, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40773
Summary: If we create a new temp directory for each build, scons will recompile everything because we have different parameters. Instead, let's set up a constant path to our static lib. That way we won't have to recompile.
Test Plan: Run fb_compile_mongo.sh twice -- second time it didn't recompile everything
Reviewers: MarkCallaghan, anthony
Reviewed By: anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40707
Summary:
Fixes task 7156865 where a compaction causes a hang in flush
memtable if CancelAllBackgroundWork was called prior to it.
Stack trace is in : https://phabricator.fb.com/P19848829
We end up waiting for a flush which will never happen because there are no background threads.
Test Plan: PreShutdownFlush
Reviewers: sdong, igor
Reviewed By: sdong, igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40617
Summary: #7124486: RocksDB's Iterator.SeekToLast should seek to the last key before iterate_upper_bound if presents
Test Plan: ./db_iter_test run successfully with the new testcase
Reviewers: rven, yhchiang, igor, anthony, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D40425
Summary:
Added a script that will compile MongoRocks with the same flags as RocksDB binary. On FB infra, we can now do:
cd ~/rocksdb; make static_lib
cd ~/mongo; ~/rocksdb/build_tools/fb_compile_mongo.sh
No need to upgrade the g++ on the devbox (like Aaron and I did) or maintain a separate script to compile (like Mark did)
fb_compile_mongo.sh gets the settings from fbcode_config.sh, so it also makes it easier to upgrade the environment one day.
Test Plan: Compiled mongod with new script. Also, ldd output looks good: https://phabricator.fb.com/P19891602
Reviewers: AaronFeldman, MarkCallaghan, anthony
Reviewed By: anthony
Subscribers: anthony, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40659
Summary: Make stringappend_test runnable in ROCKSDB_LITE
Test Plan: stringappend_test
Reviewers: sdong, rven, anthony, kradhakrishnan, IslamAbdelRahman, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40593
Summary:
BYTES_READ only count the number of logical bytes read from
the DB::Get() function. It neither includes all logical bytes read
nor indicates IO read bytes.
This patch improves the comment for BYTES_READ.
Test Plan: Only change comment.
Reviewers: sdong, rven, anthony, kradhakrishnan, IslamAbdelRahman, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40599
Summary: Make table_properties_collector_test runnable in ROCKSDB_LITE
Test Plan: table_properties_collector_test
Reviewers: sdong, rven, anthony, kradhakrishnan, IslamAbdelRahman, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40581
Summary:
Remove -Wl,--no-as-needed flag when making shared_lib in OSX and IOS as
those environment doe not have compile option --no-as-needed
ld: unknown option: --no-as-needed
clang: error: linker command failed with exit code 1 (use -v to see invocation)
Test Plan: make shared_lib
Reviewers: meyering, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40353
Summary: Replace force_bottommost_level_compaction in CompactRangeOption with an option that allow the user to (always skip, always compact, compact if compaction filter is present) the bottommost level for level based compaction.
Test Plan: make check
Reviewers: sdong, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D40527
Summary:
Implementation of a table-level row cache.
It only caches point queries done through the `DB::Get` interface, queries done through the `Iterator` interface will completely skip the cache.
Supports snapshots and merge operations.
Test Plan: Ran `make valgrind_check commit-prereq`
Reviewers: igor, philipp, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D39849
Summary:
The "one size fits all" approach with WAL recovery will only introduce inconvenience for our varied clients as we go forward. The current recovery is a bit heuristic. We introduce the following levels of consistency while replaying the WAL.
1. RecoverAfterRestart (kTolerateCorruptedTailRecords)
This mocks the current recovery mode.
2. RecoverAfterCleanShutdown (kAbsoluteConsistency)
This is ideal for unit test and cases where the store is shutdown cleanly. We tolerate no corruption or incomplete writes.
3. RecoverPointInTime (kPointInTimeRecovery)
This is ideal when using devices with controller cache or file systems which can loose data on restart. We recover upto the point were is no corruption or incomplete write.
4. RecoverAfterDisaster (kSkipAnyCorruptRecord)
This is ideal mode to recover data. We tolerate corruption and incomplete writes, and we hop over those sections that we cannot make sense of salvaging as many records as possible.
Test Plan:
(1) Run added unit test to cover all levels.
(2) Run make check.
Reviewers: leveldb, sdong, igor
Subscribers: yoshinorim, dhruba
Differential Revision: https://reviews.facebook.net/D38487
Summary: Fixing bad merge
Test Plan: make -j64 check (this is not enough to verify the fix)
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D40521
Summary: MyRocks need a mechanism to track read outliers. We need to expose this
stat.
Test Plan: None
Reviewers: sdong
CC: leveldb
Task ID: #7152512
Blame Rev:
Summary: Fix broken gflags link
Test Plan: Follow the link
Reviewers: igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D40503
Summary: Fixed a valgrind issue in checkpoint_test
Test Plan: valgrind on checkpoint_test
Reviewers: igor, anthony
Reviewed By: anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40455
Summary: Recent checkin added ldb_test.py to the make check target but the test fails. Remove it again for now and make task.
Test Plan: No more ldb_tests.py running
Reviewers: igor, anthony
Reviewed By: anthony
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D40449
Summary: Hack up rocksdb_dump and rocksdb_undump utilities to get this task rolling/promote discussion.
Test Plan: Dump/undump databases recursively to see if nothing is lost.
Reviewers: sdong, yhchiang, rven, anthony, kradhakrishnan, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D37269
Summary:
When there are multiple column families, the flush in
GetLiveFiles is not atomic, so that there are entries in the wal files
which are needed to get a consisten RocksDB. We now add the log files to
the checkpoint.
Test Plan:
CheckpointCF - This test forces more data to be written to
the other column families after the flush of the first column family but
before the second.
Reviewers: igor, yhchiang, IslamAbdelRahman, anthony, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40323
Summary: See title
Test Plan: Run valgrind ./cache_test
Reviewers: igor
Reviewed By: igor
Subscribers: anthony, dhruba
Differential Revision: https://reviews.facebook.net/D40419
Summary: CompressLevelCompaction() depends on Zlib. We should skip it when zlib is not present.
Test Plan: `make check` without zlib
Reviewers: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40401
Summary: Make autovector_test runnable in ROCKSDB_LITE
Test Plan: autovector_test
Reviewers: sdong, rven, anthony, kradhakrishnan, IslamAbdelRahman, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40245
Summary:
Block geodb_test in ROCKSDB_LITE as geodb is not supported
in ROCKSDB_LITE
Test Plan: geodb_test
Reviewers: sdong, rven, anthony, kradhakrishnan, IslamAbdelRahman, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40335
Summary:
Remove compactor_test, which depends on a directory not exist
in our code base.
make compactor_test
GEN util/build_version.cc
GEN util/build_version.cc
make: *** No rule to make target `utilities/compaction/compactor_test.o', needed by `compactor_test'. Stop.
Test Plan: verify the output message of make compactor_test
Reviewers: rven, anthony, kradhakrishnan, igor, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40341
Summary:
Currently RocksDB silently ignores this issue and doesn't compress the data. Based on discussion, we agree that this is pretty bad because it can cause confusion for our users.
This patch fails DB::Open() if we don't support the compression that is specified in the options.
Test Plan: make check with LZ4 not present. If Snappy is not present all tests will just fail because Snappy is our default library. We should make Snappy the requirement, since without it our default DB::Open() fails.
Reviewers: sdong, MarkCallaghan, rven, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39687
Summary:
Add the funcion Cache.GetPinnedUsage() to return the memory size of entries
that are in use by the system (that is, all the entries not in the LRU list).
Test Plan:
Run ./cache_test and examine PinnedUsageTest.
Reviewers: tnovak, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D40305
Summary:
This is https://reviews.facebook.net/D39999 but after introducing an option to force compaction the bottom most level
Changes in this patch
- Introduce force_bottommost_level_compaction to CompactRangeOptions that force compacting bottommost level during compaction
- Skip bottommost level compaction if we dont have a compaction filter and force_bottommost_level_compaction options is not set
Although tests pass on my machine but I suspect that there maybe some tests that I am not aware of that should use force_bottommost_level_compaction to pass in a deterministic way
Test Plan:
make check
adding new tests
Reviewers: igor, sdong, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D40059
Summary: Currently we dump DBOptions for each column family options we dump. This leads to duplicate lines in our LOG file. This diff fixes that.
Test Plan: Check out the LOG
Reviewers: sdong, rven, yhchiang
Reviewed By: yhchiang
Subscribers: IslamAbdelRahman, yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39729