Summary: Extend the option memtable_prefix_bloom_huge_page_tlb_size from just putting memtable bloom filter to huge page to memtable itself too.
Test Plan: Run all existing tests.
Reviewers: IslamAbdelRahman, yhchiang, andrewkr
Reviewed By: andrewkr
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60513
Summary: We may wrongly drop delete operation if we pick a file with the entry to be delete, the put entry of the same user key is in the next file in the level, and the next file is not picked. We expand compaction inputs for output level too.
Test Plan: Add unit tests that reproduct the bug of dropping delete entry. Change compaction_picker_test to assert the new behavior.
Reviewers: IslamAbdelRahman, igor
Reviewed By: igor
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61173
Summary: std::make_unique is not standard and not always available, remove it
Test Plan: Run "make clean jclean rocksdbjava jtest -j8" on my mac
Reviewers: yhchiang, yiwu, sdong, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61143
Summary: old typos with FILTER/INDEX_CACHE
Test Plan: still pass this unit test
Reviewers: andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61185
Summary:
- Added a new subcommand, 'ldb restore', that restores from backup
- Made backup_env_uri optional (also for 'ldb backup') because it can use db_env when backup_env isn't provided
Test Plan:
verify backup and restore commands work:
$ ./ldb backup --db=/data/users/andrewkr/rocksdb/db_bench-out/ --num_threads 1 --backup_dir=/data/users/andrewkr/rocksdb/db_bench-out/backup/
$ ./ldb restore --db=/data/users/andrewkr/rocksdb/db_bench-out/restore/ --num_threads 1 --backup_dir=/data/users/andrewkr/rocksdb/db_bench-out/backup/
Reviewers: sdong, wanning
Reviewed By: wanning
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60849
Summary:
this script creates a summary message based on the test type and
output. Previously it only ran during contbuild. This diff makes it also
run on commit-triggered diffs.
Test Plan: will see if it works on diff's sandcastle tests
Reviewers: kradhakrishnan, lightmark, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60879
Summary:
TickersNameMap is not consistent with Tickers enum.
this cause us to report wrong statistics and sometimes to access TickersNameMap outside it's boundary causing crashes (in Fb303 statistics)
Test Plan: added new unit test
Reviewers: sdong, kradhakrishnan
Reviewed By: kradhakrishnan
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61083
Summary:
MergeContext::copied_operands contain strings that MergeContext::operand_list_ Slices point to
It's possible that when MergeContext::copied_operands grow, these strings are moved and there place in memory is changed, this will cause MergeContext::operand_list_ to point to invalid memory.
fix this problem by using unique_ptr<string> instead of string
Test Plan: run tests under mac/clang
Reviewers: sdong, yiwu
Reviewed By: yiwu
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61023
Summary: The test is flaky on Travis in osx environment. The background flush the test wanting to block can run behind the L2 manual compaction, making the test actually blocking the L2 compaction and won't able to proceed.
Test Plan: Test run on travis
Reviewers: kradhakrishnan, sdong, andrewkr, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61101
Summary: RocksDB lite don't support dynamic options. Disable the two test from lite build, and assert `SetOptions` should return `status::OK`.
Test Plan: Run the db_options test under lite build and normal build.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61119
Summary:
Stale log files can be deleted out of order. This can happen for various reasons. One of the reason is that no data is ever inserted to a column family and we have an optimization to update its log number, but not all the old log files are cleaned up (the case shown in the unit tests added). It can also happen when we simply delete multiple log files out of order.
This causes data corruption because we simply increase seqID after processing the next row and we may end up with writing data with smaller seqID than what is already flushed to memtables.
In DB recovery, for the oldest files we are replaying, if there it contains no data for any column family, we ignore the sequence IDs in the file.
Test Plan: Add two unit tests that fail without the fix.
Reviewers: IslamAbdelRahman, igor, yiwu
Reviewed By: yiwu
Subscribers: hermanlee4, yoshinorim, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60891
Summary:
Tsan crash white-box test was disabled because it never ends. Re-enable it with reduced killing odds.
Add a parameter in crash test script to allow we pass it through an environment variable.
Test Plan: Run it manually.
Reviewers: IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61053
getline on std::cin can be very inefficient when ldb is loading large values, with high CPU usage in libc _IO_(un)getc, this is because of the performance penalty that comes from synchronizing stdio and iostream buffers.
See the reproducers and tests in #1133 .
If an ifstream on /dev/stdin is used (when available) then using ldb to load large values can be much more efficient.
I thought for ldb load, that this approach is preferable to using <cstdio> or std::ios_base::sync_with_stdio(false).
I couldn't think of a use case where ldb load would need to support reading unbuffered input, an alternative approach would be to add support for passing --input_file=/dev/stdin.
I have a CLA in place, thanks.
The CI tests were failing at the time of https://github.com/facebook/rocksdb/pull/1156, so this change and PR will supersede it.
* fixes 1220: rocksjni build fails on Windows due to variable-size array declaration
using new keyword to create variable-size arrays in order to satisfy most of the compilers
* fixes 1220: rocksjni build fails on Windows due to variable-size array declaration
using unique_ptr keyword to create variable-size arrays in order to satisfy most of the compilers
Summary: Multiput atomiciy is broken across multiple column families if we don't sync WAL before flushing one column family. The WAL file may contain a write batch containing writes to a key to the CF to be flushed and a key to other CF. If we don't sync WAL before flushing, if machine crashes after flushing, the write batch will only be partial recovered. Data to other CFs are lost.
Test Plan: Add a new unit test which will fail without the diff.
Reviewers: yhchiang, IslamAbdelRahman, igor, yiwu
Reviewed By: yiwu
Subscribers: yiwu, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60915
Summary: Comment out assertion of number of table files from lite build.
Test Plan:
OPT=-DROCKSDB_LITE make check
Reviewers: lightmark
Reviewed By: lightmark
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60999
Summary: RocksDB behavior is slightly different between data on tmpfs and normal file systems. Add a test case to run RocksDB on normal file system.
Test Plan: See the tests launched by Phabricator
Reviewers: kradhakrishnan, IslamAbdelRahman, gunnarku
Reviewed By: gunnarku
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60963
Summary: Bump master version to the next potential release version of 4.11
Test Plan: Compile
Reviewers: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60993
EnvLibrados is a customized RocksDB Env to use RADOS as the backend file system of RocksDB. It overrides all file system related API of default Env. The easiest way to use it is just like following:
std::string db_name = "test_db";
std::string config_path = "path/to/ceph/config";
DB* db;
Options options;
options.env = EnvLibrados(db_name, config_path);
Status s = DB::Open(options, kDBPath, &db);
Then EnvLibrados will forward all file read/write operation to the RADOS cluster assigned by config_path. Default pool is db_name+"_pool".
There are some options that users could set for EnvLibrados.
- write_buffer_size. This variable is the max buffer size for WritableFile. After reaching the buffer_max_size, EnvLibrados will sync buffer content to RADOS, then clear buffer.
- db_pool. Rather than using default pool, users could set their own db pool name
- wal_dir. The dir for WAL files. Because RocksDB only has 2-level structure (dir_name/file_name), the format of wal_dir is "/dir_name"(CAN'T be "/dir1/dir2"). Default wal_dir is "/wal".
- wal_pool. Corresponding pool name for WAL files. Default value is db_name+"_wal_pool"
The example of setting options looks like following:
db_name = "test_db";
db_pool = db_name+"_pool";
wal_dir = "/wal";
wal_pool = db_name+"_wal_pool";
write_buffer_size = 1 << 20;
env_ = new EnvLibrados(db_name, config, db_pool, wal_dir, wal_pool, write_buffer_size);
DB* db;
Options options;
options.env = env_;
// The last level dir name should match the dir name in prefix_pool_map
options.wal_dir = "/tmp/wal";
// open DB
Status s = DB::Open(options, kDBPath, &db);
Librados is required to compile EnvLibrados. Then use "$make LIBRADOS=1" to compile RocksDB. If you want to only compile EnvLibrados test, just run "$ make env_librados_test LIBRADOS=1". To run env_librados_test, you need to have a running RADOS cluster with the configure file located in "../ceph/src/ceph.conf" related to "rocksdb/".
Summary:
Fix flush not being commit while writing manifest, which is a recent bug introduced by D60075.
The issue:
# Options.max_background_flushes > 1
# Background thread A pick up a flush job, flush, then commit to manifest. (Note that mutex is released before writing manifest.)
# Background thread B pick up another flush job, flush. When it gets to `MemTableList::InstallMemtableFlushResults`, it notices another thread is commiting, so it quit.
# After the first commit, thread A doesn't double check if there are more flush result need to commit, leaving the second flush uncommitted.
Test Plan: run the test. Also verify the new test hit deadlock without the fix.
Reviewers: sdong, igor, lightmark
Reviewed By: lightmark
Subscribers: andrewkr, omegaga, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60969
* Replace %zu format specifier with Windows-compatible macro 'ROCKSDB_PRIszt'
* Added "port/port.h" include to sim_cache.cc for call to snprintf().
* Applied cleaner fix to windows build, reverting part of 7bedd94
The tests run by `make check` require Bash. On Debian you'd need to run
the test as `make SHELL=/bin/bash check`. This commit makes it work on
all POSIX compatible shells (tested on Debian with `dash`).
Summary: I tried on my host and TSAN black box test runs well. I didn't see any problem with white box crash test too but it chance of hitting crash point is too low that it may run almost forever. First re-enable black box crash test to unblock the job.
Test Plan: Run it locally.
Reviewers: kradhakrishnan, andrewkr, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60951
Summary: As Title.
Test Plan: See how the diff works.
Reviewers: kradhakrishnan, andrewkr, gunnarku
Reviewed By: gunnarku
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60933
Summary: In T8216281 we decided to disable prefetching the index and filter during opening table handlers during startup (max_open_files = -1).
Test Plan: Rely on `IndexAndFilterBlocksOfNewTableAddedToCache` to guarantee L0 indexes and filters are still cached and change `PinL0IndexAndFilterBlocksTest` to make sure other levels are not cached (maybe add one more test to test we don't cache other levels?)
Reviewers: sdong, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59913
Summary:
(1) Integer size correction (mac build break)
(2) snprint usage in Windows (windows build break)
Test Plan: Build in windows and mac
Reviewers: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60927
Summary: In many use cases there is no deletes. No need to pay the overhead of atomically updating num_deletes.
Test Plan: Run existing test.
Reviewers: ngbronson, yiwu, andrewkr, igor
Reviewed By: andrewkr
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60555
Summary:
Summary
There is a possibility that there is no L0 file after writing the data. Generate an L0 file to make it work.
Test Plan: Run the test many times.
Reviewers: andrewkr, yiwu
Reviewed By: yiwu
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60825
Summary:
Persistent cache tier is the tier abstraction that can work for any block
device based device mounted on a file system. The design/implementation can
handle any generic block device.
Any generic block support is achieved by generalizing the access patten as
{io-size, q-depth, direct-io/buffered}.
We have specifically tested and adapted the IO path for NVM and SSD.
Persistent cache tier consists of there parts :
1) File layout
Provides the implementation for handling IO path for reading and writing data
(key/value pair).
2) Meta-data
Provides the implementation for handling the index for persistent read cache.
3) Implementation
It binds (1) and (2) and flushed out the PersistentCacheTier interface
This patch provides implementation for (1)(2). Follow up patch will provide (3)
and tests.
Test Plan: Compile and run check
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57117
* Added new statistics and refactored to allow ioptions to be passed around as required to access environment and statistics pointers (and, as a convenient side effect, info_log pointer).
* Prevent incrementing compression counter when compression is turned off in options.
* Prevent incrementing compression counter when compression is turned off in options.
* Added two more supported compression types to test code in db_test.cc
* Prevent incrementing compression counter when compression is turned off in options.
* Added new StatsLevel that excludes compression timing.
* Fixed casting error in coding.h
* Fixed CompressionStatsTest for new StatsLevel.
* Removed unused variable that was breaking the Linux build
Summary: Each SST's file size increases after we add more table properties. Threshold in DBTest.DynamicLevelCompressionPerLevel need to adjust accordingly to avoid occasional failures.
Test Plan: Run the test
Reviewers: andrewkr, yiwu
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60819
Summary: Refactor cache.cc so that I can plugin clock cache (D55581). Mainly move `ShardedCache` to separate file, move `LRUHandle` back to cache.cc and rename it lru_cache.cc.
Test Plan:
make check -j64
Reviewers: lightmark, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D59655
Summary: add backup support for ldb tool, and use it to run load test for backup on two HDFS envs
Test Plan: first generate some db, then compile against load test in fbcode, run load_test --db=<db path> backup --backup_env_uri=<URI of backup env> --backup_dir=<backup directory> --num_threads=<number of thread>
Reviewers: andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60633
Summary: If options.write_buffer_size is not set, nor options.write_buffer_manager, no need to update the bytes allocated counter in MemTableAllocator, which is expensive in parallel memtable insert case. Remove it can improve parallel memtable insert throughput by 10% with write batch size 128.
Test Plan:
Run benchmarks
TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom -disable_auto_compactions -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -num=10000000 --writes=1000000 -max_background_flushes=16 -max_write_buffer_number=16 --threads=32 --batch_size=128 -allow_concurrent_memtable_write -enable_write_thread_adaptive_yield
The throughput grows 10% with the benchmark.
Reviewers: andrewkr, yiwu, IslamAbdelRahman, igor, ngbronson
Reviewed By: ngbronson
Subscribers: ngbronson, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60465
Summary:
add DestroyColumnFamilyHandle(ColumnFamilyHandle**) to close column family instead of deleting cfh*
User should call this to close a cf and then we can detect the deletion in this function.
Test Plan: make all check -j64
Reviewers: andrewkr, yiwu, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60765
Summary: In Zlib_Compress and BZip2_Compress the calculation for size was slightly off when using compression_foramt_version 2 (which includes the decompressed size in the output). Also there were unnecessary loops around the deflate/BZ2_bzCompress calls. In Zlib_Compress there was also a possible exit from the function after calling deflateInit2 that didn't call deflateEnd.
Test Plan: Standard tests
Reviewers: sdong, IslamAbdelRahman, igor
Reviewed By: igor
Subscribers: sdong, IslamAbdelRahman, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60537