Summary:
In the current code base, RocksDB generate the checksum for each block and verify the checksum at usage. Current PR enable SST file checksum. After a SST file is generated by Flush or Compaction, RocksDB generate the SST file checksum and store the checksum value and checksum method name in the vs_info and MANIFEST as part for the FileMetadata.
Added the enable_sst_file_checksum to Options to enable or disable file checksum. Added sst_file_checksum to Options such that user can plugin their own SST file checksum calculate method via overriding the SstFileChecksum class. The checksum information inlcuding uint32_t checksum value and a checksum name (string). A new tool is added to LDB such that user can dump out a list of file checksum information from MANIFEST. If user enables the file checksum but does not provide the sst_file_checksum instance, RocksDB will use the default crc32checksum implemented in table/sst_file_checksum_crc32c.h
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6216
Test Plan: Added the testing case in table_test and ldb_cmd_test to verify checksum is correct in different level. Pass make asan_check.
Differential Revision: D19171461
Pulled By: zhichao-cao
fbshipit-source-id: b2e53479eefc5bb0437189eaa1941670e5ba8b87
Summary:
Right, when reading from option files, no readahead is used and 8KB buffer is used. It might introduce high latency if the file system provide high latency and doesn't do readahead. Instead, introduce a readahead to the file. When calling inside DB, infer the value from options.log_readahead. Otherwise, a default 512KB readahead size is used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6372
Test Plan: Add --log_readahead_size in db_bench. Run it with several options and observe read size from option files using strace.
Differential Revision: D19727739
fbshipit-source-id: e6d8053b0a64259abc087f1f388b9cd66fa8a583
Summary:
We see some odd errors complaining math. However, it doesn't seem that it is needed to be included. Remove the include of math.h. Just removing it from db_bench doesn't seem to break anything. Replacing sqrt from std::sqrt seems to work for histogram.cc
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6373
Test Plan: Watch Travis and appveyor to run.
Differential Revision: D19730068
fbshipit-source-id: d3ad41defcdd9f51c2da1a3673fb258f5dfacf47
Summary:
This reverts commit 8e309b35bb.
The stress tests are failing . Revert it until we figure the root cause.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6327
Differential Revision: D19537657
Pulled By: maysamyabandeh
fbshipit-source-id: bf34a5dd720825957729e136e9a5a729a240e61a
Summary:
kHashSearch is incompatible with larger than 1 values for index_block_restart_interval. Setting it to 1 in stress tests would avoid confusion about the test parameters.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6324
Differential Revision: D19525669
Pulled By: maysamyabandeh
fbshipit-source-id: fbf3a797e0ebcebb4d32eba3728cf3583906fc8a
Summary:
Block-based table has index has been disabled in crash test due to bugs. We fixed a bug and re-enable it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6310
Test Plan: Finish one round of "crash_test_with_atomic_flush" test successfully while exclusively running has index. Another run also ran for several hours without failure.
Differential Revision: D19455856
fbshipit-source-id: 1192752d2c1e81ed7e5c5c7a9481c841582d5274
Summary:
A previous change meant to make db_stress to run on sync=1 mode for 1/20 of the time in crash_test, but a bug caused to to always run on sync=1 mode. Fix it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6304
Test Plan: Start and kill "python -u tools/db_crashtest.py --simple whitebox" multiple times and observe that most times sync=0 is used while some times sync=1 is used.
Differential Revision: D19433000
fbshipit-source-id: 7a0adba39b17a1b3acbbd791bb0cdb743b91fa95
Summary:
This commit is suspected in some crash test failures such as
Verification failed for column family 0 key 78438077: Value not found: NotFound:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6243
Test Plan: 'make check' and start 'make crash_test'
Differential Revision: D19220495
Pulled By: pdillinger
fbshipit-source-id: 6c4709cee80ab4344e06ce360f51e947d79fb3fa
Summary:
Currently, db_stress generates fixed length keys of 8 bytes. This patch adds the ability to generate variable length keys. Most of the db_stress code continues to work with a numeric key randomly generated, and the numeric key also acts as an index into the values_ array. The numeric key is mapped to a variable length string key in a deterministic way. Furthermore, the ordering is preserved.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6165
Test Plan: run make crash_test
Differential Revision: D19204646
Pulled By: anand1976
fbshipit-source-id: d2d46a96615b4832a8be2a981f5913905f0e1ca7
Summary:
Several improvements to crash_test/stress_test:
(1) Stress_test to support an parameter of bottommost compression
(2) Rename those FLAGS_* variables that are not gflags to avoid confusion
(3) Crash_test to randomly generate compression type for bottommost compression with half the chance.
(4) Stress_test to sanitize unsupported compression type to snappy, so that crash_test to cover all possible compression types and people don't need to worry about they don't support all comrpession types in their environment.
(5) In crash_test, when generating db_stress command, sort arguments in alphabeta order, so that it is easier to find value for a specific argument.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6215
Test Plan: Run "make crash_test" for a while and see the botommost option shown in LOG files.
Differential Revision: D19171255
fbshipit-source-id: d7001e246c4ff9ee5760776eea0be97738650735
Summary:
Add the verification in operateDB to verify GetLiveFiles, GetSortedWalFiles and GetCurrentWalFile. The test will be called every 1 out of N, N is decided by get_live_files_and_wal_files_one_i, whose default is 1000000.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6224
Test Plan: pass db_stress default run.
Differential Revision: D19183358
Pulled By: zhichao-cao
fbshipit-source-id: 20073cf72ede77a3e0d3cf5f28304f1f605d2b1a
Summary:
Currently, db_stress performs verification by calling `VerifyDb()` at the end of test and optionally before tests start. In case of corruption or incorrect result, it will be too late. This PR adds more verification in two ways.
1. For cf consistency test, each test thread takes a snapshot and verifies every N ops. N is configurable via `-verify_db_one_in`. This option is not supported in other stress tests.
2. For cf consistency test, we use another background thread in which a secondary instance periodically tails the primary (interval is configurable). We verify the secondary. Once an error is detected, we terminate the test and report. This does not affect other stress tests.
Test plan (devserver)
```
$./db_stress -test_cf_consistency -verify_db_one_in=0 -ops_per_thread=100000 -continuous_verification_interval=100
$./db_stress -test_cf_consistency -verify_db_one_in=1000 -ops_per_thread=10000 -continuous_verification_interval=0
$make crash_test
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6173
Differential Revision: D19047367
Pulled By: riversand963
fbshipit-source-id: aeed584ad71f9310c111445f34975e5ab47a0615
Summary:
The new Python syntax check could fail if external entities
were cloned or symlinked to a subdir in a rocksdb git clone. (E.g.
Facebook internal LITE build.) Only look for Python files in specific
subdirs
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6225
Test Plan: python tools/check_all_python.py (still 34 files checked)
Reviewed By: gfosco
Differential Revision: D19186110
Pulled By: pdillinger
fbshipit-source-id: 1fefa54e36b32cd5d96d3d1a43e8a2a694c22ea5
Summary:
This reverts commit 54f9092b0c.
It making our daily stress tests fail. Revert it until the issues are fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6220
Differential Revision: D19179881
Pulled By: maysamyabandeh
fbshipit-source-id: 99de0eaf776567fa81110b9ad2608234a16083ce
Summary:
We're seeing assertion violations like this in crash test:
db_stress: table/block_based/block_based_table_reader.cc:4129: virtual uint64_t rocksdb::BlockBasedTable::ApproximateSize(const rocksdb::Slice&, const rocksdb::Slice&, rocksdb::TableReaderCaller): Assertion `end_offset >= start_offset' failed.***
And ApproximateSize appears only to be called with the level_compaction_dynamic_level_bytes option.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6217
Test Plan:
temporarily put an assert(false) in ApproximateSize and
briefly run 'make crash_test'
Differential Revision: D19179174
Pulled By: pdillinger
fbshipit-source-id: 506e6549aea0da19b363a1a6da04373c364d92e4
Summary:
Adds a python script to syntax check all python files in the
repository and report any errors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6209
Test Plan:
'make check' with and without seeded syntax errors. Also look
for "No syntax errors in 34 .py files" on success, and in java_test CI output
Differential Revision: D19166756
Pulled By: pdillinger
fbshipit-source-id: 537df464b767260d66810b4cf4c9808a026c58a4
Summary:
Beside extending index_type to kHashSearch, it clarifies in the code base that this feature is incompatible with index_block_restart_interval > 1.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6210
Test Plan:
```
make -j32 crash_test
Differential Revision: D19166567
Pulled By: maysamyabandeh
fbshipit-source-id: 3aaf75a70a8b462d372d43aac69dbd10df303ec7
Summary:
The patch makes it possible to set the BlobDB configuration option
`garbage_collection_cutoff` on the command line. In addition, it changes
the `db_bench` code so that the default values of BlobDB related
parameters are taken from the defaults of the actual BlobDB
configuration options (note: this changes the the default of
`blob_db_bytes_per_sync`).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6211
Test Plan: Ran `db_bench` with various values of the new parameter.
Differential Revision: D19166895
Pulled By: ltamasi
fbshipit-source-id: 305ccdf0123b9db032b744715810babdc3e3b7d5
Summary:
Add an option to db_stress, verify_checksum_one_in, to call DB::VerifyChecksum() once every N ops.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6203
Differential Revision: D19145753
Pulled By: anand1976
fbshipit-source-id: d09edf21f309ad53aa40dd25b7a563d50665fd8b
Summary:
Fix two crash test issues:
1. sync mode should not run with disable_wal=true
2. disable "compaction_readahead_size" for now. With it on, some block checksum verification failure will happen in compaction paths. Not sure why, but disable it for now to keep the test clean.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6200
Test Plan: Run "make crash_test" and "make crash_test_with_atomic_flush" and see it runs way longer than before the fix without failing.
Differential Revision: D19143493
fbshipit-source-id: 438fad52fbda60aafd142e1b65578addbe7d72b1
Summary:
Several options are trivially added to crash test and random values are picked.
Made simple test run non-dynamic level and normal test run dynamic level.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6176
Test Plan: Run crash_test and watch the printing
Differential Revision: D19053955
fbshipit-source-id: 958cb43c968541ebd87ed4d91e778bd1d40e7502
Summary:
Current implementation holds on to 10% of snapshots for 10x longer, and 1% of snapshots 100x longer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6171
Test Plan:
```
make -j32 crash_test
Differential Revision: D19038399
Pulled By: maysamyabandeh
fbshipit-source-id: 75da2dbb5c47a0b3f37d299b8719e392b73b42c0
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
Summary:
With WritePrepared transactions configured with two_write_queues, unordered_write will offer the same guarantees as vanilla rocksdb and thus can be enabled in stress tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6164
Test Plan:
```
make -j32 crash_test_with_txn
Differential Revision: D18991899
Pulled By: maysamyabandeh
fbshipit-source-id: eece5e96b4169b67d7931e5c0afca88540a113e1
Summary:
Currently the default txn write policy in crash tests is WRITE_PREPARED. The patch randomly picks the write policy at the start of the crash test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6158
Test Plan:
```
make -j32 crash_test_with_txn
```
Differential Revision: D18946307
Pulled By: maysamyabandeh
fbshipit-source-id: f77d7a94f99a08791ef9626da153d284bf521950
Summary:
Add an option to explicitly disable building shared versions of the
RocksDB libraries. The shared libraries cannot be built in cases where
some dependencies are only available as static libraries. This allows
still building RocksDB in these situations.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6122
Differential Revision: D18920740
fbshipit-source-id: d24f66d93c68a1e65635e6e0b663bae62c903bca
Summary:
Especially with non-integral bits/key now supported,
db_crashtest should vary the bloom_bits configuration. The probabilities
look like this:
1/2 chance of a uniform int from 0 to 19. This includes overall 1/40
chance of 0 which disables the bloom filter.
1/2 chance of a float from a lognormal distribution with a median of 10.
This always produces positive values but with a decent chance of < 1
(overall ~1/40) or > 100 (overall ~1/40), the enforced/coerced
implementation limits.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6103
Test Plan:
start 'make blackbox_crash_test' several times and look at
configuration output
Differential Revision: D18734877
Pulled By: pdillinger
fbshipit-source-id: 4a38cb057d3b3fc1327f93199f65b9a9ffbd7316
Summary:
db_stress_tool.cc now is a giant file. In order to main it easier to improve and maintain, break it down to multiple source files.
Most classes are turned into their own files. Separate .h and .cc files are created for gflag definiations. Another .h and .cc files are created for some common functions. Some test execution logic that is only loosely related to class StressTest is moved to db_stress_driver.h and db_stress_driver.cc. All the files are located under db_stress_tool/. The directory name is created as such because if we end it with either stress or test, .gitignore will ignore any file under it and makes it prone to issues in developements.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6134
Test Plan: Build under GCC7 with and without LITE on using GNU Make. Build with GCC 4.8. Build with cmake with -DWITH_TOOL=1
Differential Revision: D18876064
fbshipit-source-id: b25d0a7451840f31ac0f5ebb0068785f783fdf7d
Summary:
PR https://github.com/facebook/rocksdb/issues/5937 changed the db_stress tool to also require db_stress_tool.cc,
and updated the Makefile but not the CMakeLists.txt file. This updates
the CMakeLists.txt file so that the CMake build succeeds again.
PR https://github.com/facebook/rocksdb/issues/5950 updated the Makefile build to package db_stress_tool.cc into
its own librocksdb_stress.a library. I haven't done that here since
there didn't really seem to be much benefit: the Makefile-based build
does not install this library.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6117
Test Plan: Confirmed the CMake build succeeds on an Ubuntu 18.04 system.
Differential Revision: D18835053
Pulled By: riversand963
fbshipit-source-id: 6e2a66834716e73b1eb736d9b7159870defffec5
Summary:
```
In file included from /usr/include/c++/4.8.2/algorithm:62:0,
from ./db/merge_context.h:7,
from ./db/dbformat.h:16,
from ./tools/block_cache_analyzer/block_cache_trace_analyzer.h:12,
from tools/block_cache_analyzer/block_cache_trace_analyzer.cc:8:
/usr/include/c++/4.8.2/bits/stl_algo.h: In instantiation of ‘_RandomAccessIterator std::__unguarded_partition(_RandomAccessIterator, _RandomAccessIterator, const _Tp&, _Compare) [with _RandomAccessIterator = __gnu_cxx::__normal_iterator<std::pair<std::basic_string<char>, long unsigned int>*, std::vector<std::pair<std::basic_string<char>, long unsigned int> > >; _Tp = std::pair<std::basic_string<char>, long unsigned int>; _Compare = rocksdb::BlockCacheTraceAnalyzer::WriteSkewness(const string&, const std::vector<long unsigned int>&, rocksdb::TraceType) const::__lambda1]’:
/usr/include/c++/4.8.2/bits/stl_algo.h:2296:78: required from ‘_RandomAccessIterator std::__unguarded_partition_pivot(_RandomAccessIterator, _RandomAccessIterator, _Compare) [with _RandomAccessIterator = __gnu_cxx::__normal_iterator<std::pair<std::basic_string<char>, long unsigned int>*, std::vector<std::pair<std::basic_string<char>, long unsigned int> > >; _Compare = rocksdb::BlockCacheTraceAnalyzer::WriteSkewness(const string&, const std::vector<long unsigned int>&, rocksdb::TraceType) const::__lambda1]’
/usr/include/c++/4.8.2/bits/stl_algo.h:2337:62: required from ‘void std::__introsort_loop(_RandomAccessIterator, _RandomAccessIterator, _Size, _Compare) [with _RandomAccessIterator = __gnu_cxx::__normal_iterator<std::pair<std::basic_string<char>, long unsigned int>*, std::vector<std::pair<std::basic_string<char>, long unsigned int> > >; _Size = long int; _Compare = rocksdb::BlockCacheTraceAnalyzer::WriteSkewness(const string&, const std::vector<long unsigned int>&, rocksdb::TraceType) const::__lambda1]’
/usr/include/c++/4.8.2/bits/stl_algo.h:5499:44: required from ‘void std::sort(_RAIter, _RAIter, _Compare) [with _RAIter = __gnu_cxx::__normal_iterator<std::pair<std::basic_string<char>, long unsigned int>*, std::vector<std::pair<std::basic_string<char>, long unsigned int> > >; _Compare = rocksdb::BlockCacheTraceAnalyzer::WriteSkewness(const string&, const std::vector<long unsigned int>&, rocksdb::TraceType) const::__lambda1’
tools/block_cache_analyzer/block_cache_trace_analyzer.cc:583:79: required from here
/usr/include/c++/4.8.2/bits/stl_algo.h:2263:35: error: no match for call to ‘(rocksdb::BlockCacheTraceAnalyzer::WriteSkewness(const string&, const std::vector<long unsigned int>&, rocksdb::TraceType) const::__lambda1) (std::pair<std::basic_string<char>, long unsigned int>&, const std::pair<std::basic_string<char>, long unsigned int>&)’
while (__comp(*__first, __pivot))
^
tools/block_cache_analyzer/block_cache_trace_analyzer.cc:582:9: note: candidates are:
[=](std::pair<std::string, uint64_t>& a,
^
In file included from /usr/include/c++/4.8.2/algorithm:62:0,
from ./db/merge_context.h:7,
from ./db/dbformat.h:16,
from ./tools/block_cache_analyzer/block_cache_trace_analyzer.h:12,
from tools/block_cache_analyzer/block_cache_trace_analyzer.cc:8:
/usr/include/c++/4.8.2/bits/stl_algo.h:2263:35: note: bool (*)(std::pair<std::basic_string<char>, long unsigned int>&, std::pair<std::basic_string<char>, long unsigned int>&) <conversion>
while (__comp(*__first, __pivot))
^
/usr/include/c++/4.8.2/bits/stl_algo.h:2263:35: note: candidate expects 3 arguments, 3 provided
tools/block_cache_analyzer/block_cache_trace_analyzer.cc:583:46: note: rocksdb::BlockCacheTraceAnalyzer::WriteSkewness(const string&, const std::vector<long unsigned int>&, rocksdb::TraceType) const::__lambda1
std::pair<std::string, uint64_t>& b) { return b.second < a.second; });
^
tools/block_cache_analyzer/block_cache_trace_analyzer.cc:583:46: note: no known conversion for argument 2 from ‘const std::pair<std::basic_string<char>, long unsigned int>’ to ‘std::pair<std::basic_string<char>, long unsigned int>&’
In file included from /usr/include/c++/4.8.2/algorithm:62:0,
from ./db/merge_context.h:7,
from ./db/dbformat.h:16,
from ./tools/block_cache_analyzer/block_cache_trace_analyzer.h:12,
from tools/block_cache_analyzer/block_cache_trace_analyzer.cc:8:
/usr/include/c++/4.8.2/bits/stl_algo.h:2266:34: error: no match for call to ‘(rocksdb::BlockCacheTraceAnalyzer::WriteSkewness(const string&, const std::vector<long unsigned int>&, rocksdb::TraceType) const::__lambda1) (const std::pair<std::basic_string<char>, long unsigned int>&, std::pair<std::basic_string<char>, long unsigned int>&)’
while (__comp(__pivot, *__last))
^
tools/block_cache_analyzer/block_cache_trace_analyzer.cc:582:9: note: candidates are:
[=](std::pair<std::string, uint64_t>& a,
^
In file included from /usr/include/c++/4.8.2/algorithm:62:0,
from ./db/merge_context.h:7,
from ./db/dbformat.h:16,
from ./tools/block_cache_analyzer/block_cache_trace_analyzer.h:12,
from tools/block_cache_analyzer/block_cache_trace_analyzer.cc:8:
/usr/include/c++/4.8.2/bits/stl_algo.h:2266:34: note: bool (*)(std::pair<std::basic_string<char>, long unsigned int>&, std::pair<std::basic_string<char>, long unsigned int>&) <conversion>
while (__comp(__pivot, *__last))
^
/usr/include/c++/4.8.2/bits/stl_algo.h:2266:34: note: candidate expects 3 arguments, 3 provided
tools/block_cache_analyzer/block_cache_trace_analyzer.cc:583:46: note: rocksdb::BlockCacheTraceAnalyzer::WriteSkewness(const string&, const std::vector<long unsigned int>&, rocksdb::TraceType) const::__lambda1
std::pair<std::string, uint64_t>& b) { return b.second < a.second; });
^
tools/block_cache_analyzer/block_cache_trace_analyzer.cc:583:46: note: no known conversion for argument 1 from ‘const std::pair<std::basic_string<char>, long unsigned int>’ to ‘std::pair<std::basic_string<char>, long unsigned int>&’
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6106
Differential Revision: D18783943
Pulled By: riversand963
fbshipit-source-id: cc7fc10565f0210b9eebf46b95cb4950ec0b15fa
Summary:
format_version=5 enables new Bloom filter. Using 2/5
probability for "latest and greatest" rather than naive 1/4.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6102
Test Plan: start 'make blackbox_crash_test'
Differential Revision: D18735685
Pulled By: pdillinger
fbshipit-source-id: e81529c8a3f53560d246086ee5f92ee7d79a2eab
Summary:
**NOTE**: this also needs to be back-ported to 6.4.6 and possibly older branches if further releases from them is envisaged.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6081
Differential Revision: D18710107
Pulled By: zhichao-cao
fbshipit-source-id: 03260f9316566e2bfc12c7d702d6338bb7941e01
Summary:
Recently, a bug was found related to a seek key that is close to SST file boundary. However, it only occurs in a very small chance in db_stress, because the chance that a random key hits SST file boundaries is small. To boost the chance, with 1/16 chance, we pick keys that are close to SST file boundaries.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6037
Test Plan: Did some manual printing out, and hack to cover the key generation logic to be correct.
Differential Revision: D18598476
fbshipit-source-id: 13b76687d106c5be4e3e02a0c77fa5578105a071
Summary:
Right now, in db_stress, as long as prefix extractor is defined, TestIterator always uses. There is value of cover total_order_seek = true when prefix extractor is define. Add a small chance that this flag is turned on.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6039
Test Plan: Run the test for a while.
Differential Revision: D18539689
fbshipit-source-id: 568790dd7789c9986b83764b870df0423a122d99
Summary:
Right now, crash_test always uses 16KB max_manifest_file_size value. It is good to cover logic of manifest file switch. However, information stored in manifest files might be useful in debugging failures. Switch to only use small manifest file size in 1/15 of the time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6034
Test Plan: Observe command generated by db_crash_test.py multiple times and see the --max_manifest_file_size value distribution.
Differential Revision: D18513824
fbshipit-source-id: 7b3ae6dbe521a0918df41064e3fa5ecbf2466e04
Summary:
Right now, db_stress doesn't cover SeekForPrev(). Add the coverage, which mirrors what we do for Seek().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6022
Test Plan: Run "make crash_test". Do some manual source code hack to simular iterator wrong results and see it caught.
Differential Revision: D18442193
fbshipit-source-id: 879b79000d5e33c625c7e970636de191ccd7776c
Summary:
In stress test, all iterator verification is turned off is lower bound is enabled. This might be stricter than needed. This PR relaxes the condition and include the case where lower bound is lower than both of seek key and upper bound. It seems to work mostly fine when I run crash test locally.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5869
Test Plan: Run crash_test
Differential Revision: D18363578
fbshipit-source-id: 23d57e11ea507949b8100f4190ddfbe8db052d5a
Summary:
Right now, in db_stress's CF consistency test's TestGet case, if failure happens, we do normal string printing, rather than hex printing, so that some text is not printed out, which makes debugging harder. Fix it by printing hex instead.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5989
Test Plan: Build db_stress and see t passes.
Differential Revision: D18363552
fbshipit-source-id: 09d1b8f6fbff37441cbe7e63a1aef27551226cec
Summary:
In the previous PR https://github.com/facebook/rocksdb/issues/4788, user can use db_bench mix_graph option to generate the workload that is from the social graph. The key is generated based on the key access hotness. In this PR, user can further model the key-range hotness and fit those to two-term-exponential distribution. First, user cuts the whole key space into small key ranges (e.g., key-ranges are the same size and the key-range number is the number of SST files). Then, user calculates the average access count per key of each key-range as the key-range hotness. Next, user fits the key-range hotness to two-term-exponential distribution (f(x) = f(x) = a*exp(b*x) + c*exp(d*x)) and generate the value of a, b, c, and d. They are the parameters in db_bench: prefix_dist_a, prefix_dist_b, prefix_dist_c, and prefix_dist_d. Finally, user can run db_bench by specify the parameters.
For example:
`./db_bench --benchmarks="mixgraph" -use_direct_io_for_flush_and_compaction=true -use_direct_reads=true -cache_size=268435456 -key_dist_a=0.002312 -key_dist_b=0.3467 -keyrange_dist_a=14.18 -keyrange_dist_b=-2.917 -keyrange_dist_c=0.0164 -keyrange_dist_d=-0.08082 -keyrange_num=30 -value_k=0.2615 -value_sigma=25.45 -iter_k=2.517 -iter_sigma=14.236 -mix_get_ratio=0.85 -mix_put_ratio=0.14 -mix_seek_ratio=0.01 -sine_mix_rate_interval_milliseconds=5000 -sine_a=350 -sine_b=0.0105 -sine_d=50000 --perf_level=2 -reads=1000000 -num=5000000 -key_size=48`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5953
Test Plan: run db_bench with different parameters and checked the results.
Differential Revision: D18053527
Pulled By: zhichao-cao
fbshipit-source-id: 171f8b3142bd76462f1967c58345ad7e4f84bab7
Summary:
DBImpl extends the public GetSnapshot() with GetSnapshotForWriteConflictBoundary() method that takes snapshots specially for write-write conflict checking. Compaction treats such snapshots differently to avoid GCing a value written after that, so that the write conflict remains visible even after the compaction. The patch extends stress tests with such snapshots.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5897
Differential Revision: D17937476
Pulled By: maysamyabandeh
fbshipit-source-id: bd8b0c578827990302194f63ae0181e15752951d
Summary:
This reverts commit 351e25401b.
All branches have been fixed to buildable on FB environments, so we can revert it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5999
Differential Revision: D18281947
fbshipit-source-id: 6deaaf1b5df2349eee5d6ed9b91208cd7e23ec8e
Summary:
A recent commit make periodic compaction option valid in FIFO, which means TTL. But we fail to disable it in crash test, causing assert failure. Fix it by having it disabled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5993
Test Plan: Restart "make crash_test" many times and make sure --periodic_compaction_seconds=0 is always the case when --compaction_style=2
Differential Revision: D18263223
fbshipit-source-id: c91a802017d83ae89ac43827d1b0012861933814
Summary:
We have updated earlier release branches going back to 5.5 so they are
built using gcc7 by default. Disabling ancient versions before that
until we figure out a plan for them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5990
Test Plan: Ran the script locally.
Differential Revision: D18252386
Pulled By: ltamasi
fbshipit-source-id: a7bbb30dc52ff2eaaf31a29ecc79f7cf4e2834dc
Summary:
Recently, pipelined write is enabled even if atomic flush is enabled, which causing sanitizing failure in db_stress. Revert this change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5986
Test Plan: Run "make crash_test_with_atomic_flush" and see it to run for some while so that the old sanitizing error (which showed up quickly) doesn't show up.
Differential Revision: D18228278
fbshipit-source-id: 27fdf2f8e3e77068c9725a838b9bef4ab25a2553
Summary:
More release branches are created. We should include them in continuous format compatibility checks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5985
Test Plan: Let's see whether it is passes.
Differential Revision: D18226532
fbshipit-source-id: 75d8cad5b03ccea4ce16f00cea1f8b7893b0c0c8
Summary:
In pipeline writing mode, memtable switching needs to wait for memtable writing to finish to make sure that when memtables are made immutable, inserts are not going to them. This is currently done in DBImpl::SwitchMemtable(). This is done after flush_scheduler_.TakeNextColumnFamily() is called to fetch the list of column families to switch. The function flush_scheduler_.TakeNextColumnFamily() itself, however, is not thread-safe when being called together with flush_scheduler_.ScheduleFlush().
This change provides a fix, which moves the waiting logic before flush_scheduler_.TakeNextColumnFamily(). WaitForPendingWrites() is a natural place where the logic can happen.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5716
Test Plan: Run all tests with ASAN and TSAN.
Differential Revision: D18217658
fbshipit-source-id: b9c5e765c9989645bf10afda7c5c726c3f82f6c3
Summary:
Right now, in db_stress's iterator tests, we always use the same CF to validate iterator results. This commit changes it so that a randomized CF is used in Cf consistency test, where every CF should have exactly the same data. This would help catch more bugs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5983
Test Plan: Run "make crash_test_with_atomic_flush".
Differential Revision: D18217643
fbshipit-source-id: 3ac998852a0378bb59790b20c5f236f6a5d681fe
Summary:
Right now in CF consitency stres test's TestGet(), keys are just fetched without validation. With this change, in 1/2 the time, compare all the CFs share the same value with the same key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5863
Test Plan: Run "make crash_test_with_atomic_flush" and see tests pass. Hack the code to generate some inconsistency and observe the test fails as expected.
Differential Revision: D17934206
fbshipit-source-id: 00ba1a130391f28785737b677f80f366fb83cced
Summary:
Since we already parse env_uri from command line and creates custom Env
accordingly, we should invoke the methods of such Envs instead of using
Env::Default().
Test Plan (on devserver):
```
$make db_bench db_stress
$./db_bench -benchmarks=fillseq
./db_stress
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5943
Differential Revision: D18018550
Pulled By: riversand963
fbshipit-source-id: 03b61329aaae0dfd914a0b902cc677f570f102e3
Summary:
Since SeekForPrev (used by Prev) is not supported by HashSkipList when prefix is used, we disable it when stress testing HashSkipList.
- Change the default memtablerep to skip list.
- Avoid Prev() when memtablerep is HashSkipList and prefix is used.
Test Plan (on devserver):
```
$make db_stress
$./db_stress -ops_per_thread=10000 -reopen=1 -destroy_db_initially=true -column_families=1 -threads=1 -column_families=1 -memtablerep=prefix_hash
$# or simply
$./db_stress
$./db_stress -memtablerep=prefix_hash
```
Results must print "Verification successful".
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5942
Differential Revision: D18017062
Pulled By: riversand963
fbshipit-source-id: af867e59aa9e6f533143c984d7d529febf232fd7
Summary:
Plain table SSTs could crash sst_dump because of a bug in
PlainTableReader that can leave table_properties_ as null. Even if it
was intended not to keep the table properties in some cases, they were
leaked on the offending code path.
Steps to reproduce:
$ db_bench --benchmarks=fillrandom --num=2000000 --use_plain_table --prefix-size=12
$ sst_dump --file=0000xx.sst --show_properties
from [] to []
Process /dev/shm/dbbench/000014.sst
Sst file format: plain table
Raw user collected properties
------------------------------
Segmentation fault (core dumped)
Also added missing unit testing of plain table full_scan_mode, and
an assertion in NewIterator to check for regression.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5940
Test Plan: new unit test, manual, make check
Differential Revision: D18018145
Pulled By: pdillinger
fbshipit-source-id: 4310c755e824c4cd6f3f86a3abc20dfa417c5e07
Summary:
In the current trace replay, all the queries are serialized and called by single threads. It may not simulate the original application query situations closely. The multi-threads replay is implemented in this PR. Users can set the number of threads to replay the trace. The queries generated according to the trace records are scheduled in the thread pool job queue.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5934
Test Plan: test with make check and real trace replay.
Differential Revision: D17998098
Pulled By: zhichao-cao
fbshipit-source-id: 87eecf6f7c17a9dc9d7ab29dd2af74f6f60212c8
Summary:
expose db stress test by providing db_stress_tool.h in public header.
This PR does the following:
- adds a new header, db_stress_tool.h, in include/rocksdb/
- renames db_stress.cc to db_stress_tool.cc
- adds a db_stress.cc which simply invokes a test function.
- update Makefile accordingly.
Test Plan (dev server):
```
make db_stress
./db_stress
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5937
Differential Revision: D17997647
Pulled By: riversand963
fbshipit-source-id: 1a8d9994f89ce198935566756947c518f0052410
Summary:
The patch adds a new command line parameter --decode_blob_index to sst_dump.
If this switch is specified, sst_dump prints blob indexes in a human readable format,
printing the blob file number, offset, size, and expiration (if applicable) for blob
references, and the blob value (and expiration) for inlined blobs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5926
Test Plan:
Used db_bench's BlobDB mode to generate SST files containing blob references with
and without expiration, as well as inlined blobs with and without expiration (note: the
latter are stored as plain values), and confirmed sst_dump correctly prints all four types
of records.
Differential Revision: D17939077
Pulled By: ltamasi
fbshipit-source-id: edc5f58fee94ba35f6699c6a042d5758f5b3963d
Summary:
Currently, db_bench only supports PutWithTTL operations for BlobDB but
not regular Puts. The patch adds support for regular (non-TTL) Puts and also
changes the default for blob_db_max_ttl_range to zero, which corresponds
to no TTL.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5921
Test Plan:
make check
./db_bench -benchmarks=fillrandom -statistics -stats_interval_seconds=1
-duration=90 -num=500000 -use_blob_db=1 -blob_db_file_size=1000000
-target_file_size_base=1000000 (issues Put operations with no TTL)
./db_bench -benchmarks=fillrandom -statistics -stats_interval_seconds=1
-duration=90 -num=500000 -use_blob_db=1 -blob_db_file_size=1000000
-target_file_size_base=1000000 -blob_db_max_ttl_range=86400 (issues
PutWithTTL operations with random TTLs in the [0, blob_db_max_ttl_range)
interval, as before)
Differential Revision: D17919798
Pulled By: ltamasi
fbshipit-source-id: b946c3522b836b92b4c157ffbad24f92ba2b0a16
Summary:
The loop in OperateDb() is getting quite complicated with the introduction of multiple key operations such as MultiGet and Reseeks. This is resulting in a number of corner cases that hangs db_stress due to synchronization problems during reopen (i.e when -reopen=<> option is specified). This PR makes it more robust by ensuring all db_stress threads vote to reopen the DB the exact same number of times.
Most of the changes in this diff are due to indentation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5893
Test Plan: Run crash test
Differential Revision: D17823827
Pulled By: anand1976
fbshipit-source-id: ec893829f611ac7cac4057c0d3d99f9ffb6a6dd9
Summary:
This PR allows for the creation of custom env when using sst_dump. If
the user does not set options.env or set options.env to nullptr, then sst_dump
will automatically try to create a custom env depending on the path to the sst
file or db directory. In order to use this feature, the user must call
ObjectRegistry::Register() beforehand.
Test Plan (on devserver):
```
$make all && make check
```
All tests must pass to ensure this change does not break anything.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5845
Differential Revision: D17678038
Pulled By: riversand963
fbshipit-source-id: 58ecb4b3f75246d52b07c4c924a63ee61c1ee626
Summary:
This is the 2nd attempt after the revert of https://github.com/facebook/rocksdb/pull/4020
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5895
Test Plan:
```
./tools/db_crashtest.py blackbox --simple --interval=10 --max_key=10000000
```
Differential Revision: D17822137
Pulled By: maysamyabandeh
fbshipit-source-id: 3d148c0d8cc129080410ff859c04b544223c8ea3
Summary:
When multiple operations are performed in a db_stress thread in one loop
iteration, the reopen voting logic needs to take that into account. It
was doing that for MultiGet, but a new option was introduced recently to
do multiple iterator seeks per iteration, which broke it again. Fix the
logic to be more robust and agnostic of the type of operation performed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5876
Test Plan: Run db_stress
Differential Revision: D17733590
Pulled By: anand1976
fbshipit-source-id: 787f01abefa1e83bba43e0b4f4abb26699b2089e
Summary:
Two more bug fixes in db_stress:
1. this is to complete the fix of the regression bug causing overflowing when supporting FLAGS_prefix_size = -1.
2. Fix regression bug in compare iterator itself:
(1) when creating control iterator, which used the same read option as the normal iterator by mistake; (2) the logic of comparing has some problems. Fix them.
(3) disable validation for lower bound now, which generated some wildly different results. Disabling it to make normal tests pass while investigating it.
3. Cleaning up snapshots in verification failure cases. Memory is leaked otherwise.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5867
Test Plan: Run "make crash_test" for a while and see at least 1 is fixed.
Differential Revision: D17671712
fbshipit-source-id: 011f98ea1a72aef23e19ff28656830c78699b402
Summary:
When prefix_size = -1, stress test crashes with run time error because of overflow. Fix it by not using -1 but 7 in prefix scan mode.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5862
Test Plan:
Run
python -u tools/db_crashtest.py --simple whitebox --random_kill_odd \
888887 --compression_type=zstd
and see it doesn't crash.
Differential Revision: D17642313
fbshipit-source-id: f029e7651498c905af1b1bee6d310ae50cdcda41
Summary:
For now, crash_test is not able to report any failure for the logic related to iterator upper, lower bounds or iterators, or reseek. These are features prone to errors. Improve db_stress in several ways:
(1) For each iterator run, reseek up to 3 times.
(2) For every iterator, create control iterator with upper or lower bound, with total order seek. Compare the results with the iterator.
(3) Make simple crash test to avoid prefix size to have more coverage.
(4) make prefix_size = 0 a valid size and -1 to indicate disabling prefix extractor.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5846
Test Plan: Manually hack the code to create wrong results and see they are caught by the tool.
Differential Revision: D17631760
fbshipit-source-id: acd460a177bd2124a5ffd7fff490702dba63030b
Summary:
Further apply formatter to more recent commits.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5830
Test Plan: Run all existing tests.
Differential Revision: D17488031
fbshipit-source-id: 137458fd94d56dd271b8b40c522b03036943a2ab
Summary:
Some recent commits might not have passed through the formatter. I formatted recent 45 commits. The script hangs for more commits so I stopped there.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5827
Test Plan: Run all existing tests.
Differential Revision: D17483727
fbshipit-source-id: af23113ee63015d8a43d89a3bc2c1056189afe8f
Summary:
PR https://github.com/facebook/rocksdb/issues/4020 enabled partitioned indexes/filters in stress tests; however,
this causes assertion failures in BatchedOpsStressTest. This patch
disables them until we can root cause the failures.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5811
Test Plan: Ran the script and made sure it only uses the binary search index.
Differential Revision: D17399366
Pulled By: ltamasi
fbshipit-source-id: adb116e6297f9c6ccd7ac15b6a16c9aa91f21ac5
Summary:
file_reader_writer.h and .cc contain several files and helper function, and it's hard to navigate. Separate it to multiple files and put them under file/
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5803
Test Plan: Build whole project using make and cmake.
Differential Revision: D17374550
fbshipit-source-id: 10efca907721e7a78ed25bbf74dc5410dea05987
Summary:
PR https://github.com/facebook/rocksdb/issues/4020 implicitly enabled the hash index as well in stress/crash
tests, resulting in assertion failures in Block. This patch disables
the hash index until we can pinpoint the root cause of these issues.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5792
Test Plan:
Ran tools/db_crashtest.py and made sure it only uses index types 0 and 2
(binary search and partitioned index).
Differential Revision: D17346777
Pulled By: ltamasi
fbshipit-source-id: b4318f37f1fda3ee1bbff4ef2c2f556ca9e6b551
Summary:
This is required to compile on Windows with Visual Studio 2015.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5786
Differential Revision: D17335994
fbshipit-source-id: 8f9568310bc6f697e312b5e24ad465e9084f0011
Summary:
- In `db_stress`, support choosing index type and whether to enable filter partitioning, and randomly set those options in crash test
- When partitioned filter is enabled by crash test, force partitioned index to also be enabled since it's a prerequisite
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4020
Test Plan:
currently this is blocked on fixing the bug that crash test caught:
```
$ TEST_TMPDIR=/data/compaction_bench python ./tools/db_crashtest.py blackbox --simple --interval=10 --max_key=10000000
...
Verification failed for column family 0 key 937501: Value not found: NotFound:
Crash-recovery verification failed :(
```
Differential Revision: D8508683
Pulled By: maysamyabandeh
fbshipit-source-id: 0337e5d0558bcef26b1f3699f47265a2c1e99629
Summary:
https://github.com/facebook/rocksdb/pull/5741 added compaction TTL to crash test, but it causes assertion fails for FIFO compaction. Disable this combination for now while we debug the assertion failure.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5749
Test Plan: Run crash test and observe that when compaction_style=2, compaction_ttl is always 0.
Differential Revision: D17078292
fbshipit-source-id: 446821a3b9739956094d5e4f9be1251a15b57f5d
Summary:
Covering periodic compaction and compaction TTL can help us expose potential issues. Add it there.
Randomly select value for these two options.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5741
Test Plan: Run crash_test and see the perameters generated.
Differential Revision: D17059515
fbshipit-source-id: 8213974846a0b6a22fc13be705825c9054d1d097
Summary:
MyRocks currently sets `max_write_buffer_number_to_maintain` in order to maintain enough history for transaction conflict checking. The effectiveness of this approach depends on the size of memtables. When memtables are small, it may not keep enough history; when memtables are large, this may consume too much memory.
We are proposing a new way to configure memtable list history: by limiting the memory usage of immutable memtables. The new option is `max_write_buffer_size_to_maintain` and it will take precedence over the old `max_write_buffer_number_to_maintain` if they are both set to non-zero values. The new option accounts for the total memory usage of flushed immutable memtables and mutable memtable. When the total usage exceeds the limit, RocksDB may start dropping immutable memtables (which is also called trimming history), starting from the oldest one.
The semantics of the old option actually works both as an upper bound and lower bound. History trimming will start if number of immutable memtables exceeds the limit, but it will never go below (limit-1) due to history trimming.
In order the mimic the behavior with the new option, history trimming will stop if dropping the next immutable memtable causes the total memory usage go below the size limit. For example, assuming the size limit is set to 64MB, and there are 3 immutable memtables with sizes of 20, 30, 30. Although the total memory usage is 80MB > 64MB, dropping the oldest memtable will reduce the memory usage to 60MB < 64MB, so in this case no memtable will be dropped.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5022
Differential Revision: D14394062
Pulled By: miasantreble
fbshipit-source-id: 60457a509c6af89d0993f988c9b5c2aa9e45f5c5
Summary:
AtomicFlushStressTest is a powerful test, but right now we only run it for atomic_flush=true + disable_wal=true. We further extend it to the case where atomic_flush=false + disable_wal = false. All the workload generation and validation can stay the same.
Atomic flush crash test is also changed to switch between the two test scenarios. It makes the name "atomic flush crash test" out of sync from what it really does. We leave it as it is to avoid troubles with continous test set-up.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5729
Test Plan: Run "CRASH_TEST_KILL_ODD=188 TEST_TMPDIR=/dev/shm/ USE_CLANG=1 make whitebox_crash_test_with_atomic_flush", observe the settings used and see it passed.
Differential Revision: D16969791
fbshipit-source-id: 56e37487000ae631e31b0100acd7bdc441c04163
Summary:
Atomic white box test's kill odd is the same as normal test. However, in the scenario that only WritableFileWriter::Append() is blacklisted, WritableFileWriter::Flush() dominates the killing odds. Normally, most of WritableFileWriter::Flush() are called in WAL writes, where every write triggers a WAL flush. In atomic test, WAL is disabled, so the kill happens less frequently than we antipated. In some rare cases, the kill didn't end up with happening (for reasons I still don't fully understand) and cause the stress test timeout.
If WAL is disabled, make the odds 5x likely to trigger.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5717
Test Plan: Run whitebox_crash_test_with_atomic_flush and whitebox_crash_test and observe the kill odds printed out.
Differential Revision: D16897237
fbshipit-source-id: cbf5d96f6fc0e980523d0f1f94bf4e72cdb82d1c
Summary:
Right now VerifyChecksum() doesn't do read-ahead. In some use cases, users won't be able to achieve good performance. With this change, by default, RocksDB will do a default readahead, and users will be able to overwrite the readahead size by passing in a ReadOptions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5713
Test Plan: Add a new unit test.
Differential Revision: D16860874
fbshipit-source-id: 0cff0fe79ac855d3d068e6ccd770770854a68413
Summary:
Add a command in ldb so that users can print out tombstones in SST files.
In order to test the code, change the interface of LDBCommandRunner::RunCommand() so that it doesn't return from the program, but return the status code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5615
Test Plan: Add a new unit test
Differential Revision: D16550326
fbshipit-source-id: 88ddfe6984bdcbb3a528abdd115089df09eba52e
Summary:
This PR adds support in block cache trace analyzer to read from human readable trace file. This is needed when a user does not have access to the binary trace file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5679
Test Plan: USE_CLANG=1 make check -j32
Differential Revision: D16697239
Pulled By: HaoyuHuang
fbshipit-source-id: f2e29d7995816c389b41458f234ec8e184a924db
Summary:
This PR adds four more eviction policies.
- OPT [1]
- Hyperbolic caching [2]
- ARC [3]
- GreedyDualSize [4]
[1] L. A. Belady. 1966. A Study of Replacement Algorithms for a Virtual-storage Computer. IBM Syst. J. 5, 2 (June 1966), 78-101. DOI=http://dx.doi.org/10.1147/sj.52.0078
[2] Aaron Blankstein, Siddhartha Sen, and Michael J. Freedman. 2017. Hyperbolic caching: flexible caching for web applications. In Proceedings of the 2017 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC '17). USENIX Association, Berkeley, CA, USA, 499-511.
[3] Nimrod Megiddo and Dharmendra S. Modha. 2003. ARC: A Self-Tuning, Low Overhead Replacement Cache. In Proceedings of the 2nd USENIX Conference on File and Storage Technologies (FAST '03). USENIX Association, Berkeley, CA, USA, 115-130.
[4] N. Young. The k-server dual and loose competitiveness for paging. Algorithmica, June 1994, vol. 11,(no.6):525-41. Rewritten version of ''On-line caching as cache size varies'', in The 2nd Annual ACM-SIAM Symposium on Discrete Algorithms, 241-250, 1991.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5644
Differential Revision: D16548817
Pulled By: HaoyuHuang
fbshipit-source-id: 838f76db9179f07911abaab46c97e1c929cfcd63
Summary:
This is a new API added to db.h to allow for fetching all merge operands associated with a Key. The main motivation for this API is to support use cases where doing a full online merge is not necessary as it is performance sensitive. Example use-cases:
1. Update subset of columns and read subset of columns -
Imagine a SQL Table, a row is encoded as a K/V pair (as it is done in MyRocks). If there are many columns and users only updated one of them, we can use merge operator to reduce write amplification. While users only read one or two columns in the read query, this feature can avoid a full merging of the whole row, and save some CPU.
2. Updating very few attributes in a value which is a JSON-like document -
Updating one attribute can be done efficiently using merge operator, while reading back one attribute can be done more efficiently if we don't need to do a full merge.
----------------------------------------------------------------------------------------------------
API :
Status GetMergeOperands(
const ReadOptions& options, ColumnFamilyHandle* column_family,
const Slice& key, PinnableSlice* merge_operands,
GetMergeOperandsOptions* get_merge_operands_options,
int* number_of_operands)
Example usage :
int size = 100;
int number_of_operands = 0;
std::vector<PinnableSlice> values(size);
GetMergeOperandsOptions merge_operands_info;
db_->GetMergeOperands(ReadOptions(), db_->DefaultColumnFamily(), "k1", values.data(), merge_operands_info, &number_of_operands);
Description :
Returns all the merge operands corresponding to the key. If the number of merge operands in DB is greater than merge_operands_options.expected_max_number_of_operands no merge operands are returned and status is Incomplete. Merge operands returned are in the order of insertion.
merge_operands-> Points to an array of at-least merge_operands_options.expected_max_number_of_operands and the caller is responsible for allocating it. If the status returned is Incomplete then number_of_operands will contain the total number of merge operands found in DB for key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5604
Test Plan:
Added unit test and perf test in db_bench that can be run using the command:
./db_bench -benchmarks=getmergeoperands --merge_operator=sortlist
Differential Revision: D16657366
Pulled By: vjnadimpalli
fbshipit-source-id: 0faadd752351745224ee12d4ae9ef3cb529951bf
Summary:
This PR updated the python script to plot graphs for stats output from block cache analyzer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5673
Test Plan: Manually run the script to generate graphs.
Differential Revision: D16657145
Pulled By: HaoyuHuang
fbshipit-source-id: fd510b5fd4307835f9a986fac545734dbe003d28
Summary:
This PR implements cache eviction using reinforcement learning. It includes two implementations:
1. An implementation of Thompson Sampling for the Bernoulli Bandit [1].
2. An implementation of LinUCB with disjoint linear models [2].
The idea is that a cache uses multiple eviction policies, e.g., MRU, LRU, and LFU. The cache learns which eviction policy is the best and uses it upon a cache miss.
Thompson Sampling is contextless and does not include any features.
LinUCB includes features such as level, block type, caller, column family id to decide which eviction policy to use.
[1] Daniel J. Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. 2018. A Tutorial on Thompson Sampling. Found. Trends Mach. Learn. 11, 1 (July 2018), 1-96. DOI: https://doi.org/10.1561/2200000070
[2] Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. 2010. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web (WWW '10). ACM, New York, NY, USA, 661-670. DOI=http://dx.doi.org/10.1145/1772690.1772758
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5610
Differential Revision: D16435067
Pulled By: HaoyuHuang
fbshipit-source-id: 6549239ae14115c01cb1e70548af9e46d8dc21bb
Summary:
The ObjectRegistry class replaces the Registrar and NewCustomObjects. Objects are registered with the registry by Type (the class must implement the static const char *Type() method).
This change is necessary for a few reasons:
- By having a class (rather than static template instances), the class can be passed between compilation units, meaning that objects could be registered and shared from a dynamic library with an executable.
- By having a class with instances, different units could have different objects registered. This could be useful if, for example, one Option allowed for a dynamic library and one did not.
When combined with some other PRs (being able to load shared libraries, a Configurable interface to configure objects to/from string), this code will allow objects in external shared libraries to be added to a RocksDB image at run-time, rather than requiring every new extension to be built into the main library and called explicitly by every program.
Test plan (on riversand963's devserver)
```
$COMPILE_WITH_ASAN=1 make -j32 all && sleep 1 && make check
```
All tests pass.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5293
Differential Revision: D16363396
Pulled By: riversand963
fbshipit-source-id: fbe4acb615bfc11103eef40a0b288845791c0180
Summary:
Right now, ldb cannot scan a DB with merge operands with default ldb. There is no hard to give a general merge operator so that it can at least print out something
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5607
Test Plan: Run ldb against a DB with merge operands and see the outputs.
Differential Revision: D16442634
fbshipit-source-id: c66c414ec07f219cfc6e6ec2cc14c783ee95df54
Summary:
- Compute correlation between a few features and predictions, e.g., number of accesses since the last access vs number of accesses till the next access on a block.
- Output human readable trace file so python can consume it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5596
Test Plan: make clean && USE_CLANG=1 make check -j32
Differential Revision: D16373200
Pulled By: HaoyuHuang
fbshipit-source-id: c848d26bc2e9210461f317d7dbee42d55be5a0cc
Summary:
Atomic flush test started to fail after https://github.com/facebook/rocksdb/issues/5099. Then https://github.com/facebook/rocksdb/issues/5278 provided a fix after
which the same error occurred much less frequently. However it still occur
occasionally. Not sure what the root cause is. This PR disables the feature of
snapshot list refresh, and we should keep an eye on the failure in the future.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5581
Differential Revision: D16295985
Pulled By: riversand963
fbshipit-source-id: c9e62e65133c52c21b07097de359632ca62571e4
Summary:
ldb idump now only works for default column family. Extend it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5594
Test Plan: Compile and run the tool against a multiple CF DB.
Differential Revision: D16380684
fbshipit-source-id: bfb8af36fdad1806837c90aaaab492d71528aceb
Summary:
This PR traces the referenced key for Get for all types of blocks. This is useful when evaluating hybrid row-block caches.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5548
Test Plan: make clean && USE_CLANG=1 make check -j32
Differential Revision: D16157979
Pulled By: HaoyuHuang
fbshipit-source-id: f6327411c9deb74e35e22a35f66cdbae09ab9d87
Summary:
Currently, when the block cache is used for the filter block, it is not
really the block itself that is stored in the cache but a FilterBlockReader
object. Since this object is not pure data (it has, for instance, pointers that
might dangle, including in one case a back pointer to the TableReader), it's not
really sharable. To avoid the issues around this, the current code erases the
cache entries when the TableReader is closed (which, BTW, is not sufficient
since a concurrent TableReader might have picked up the object in the meantime).
Instead of doing this, the patch moves the FilterBlockReader out of the cache
altogether, and decouples the filter reader object from the filter block.
In particular, instead of the TableReader owning, or caching/pinning the
FilterBlockReader (based on the customer's settings), with the change the
TableReader unconditionally owns the FilterBlockReader, which in turn
owns/caches/pins the filter block. This change also enables us to reuse the code
paths historically used for data blocks for filters as well.
Note:
Eviction statistics for filter blocks are temporarily broken. We plan to fix this in a
separate phase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5504
Test Plan: make asan_check
Differential Revision: D16036974
Pulled By: ltamasi
fbshipit-source-id: 770f543c5fb4ed126fd1e04bfd3809cf4ff9c091
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5563
Test Plan: Manually run the script on files generated by block_cache_trace_analyzer.
Differential Revision: D16214400
Pulled By: HaoyuHuang
fbshipit-source-id: 94485eed995e9b2b63e197c5dfeb80129fa7897f
Summary:
This PR adds a ghost cache for admission control. Specifically, it admits an entry on its second access.
It also adds a hybrid row-block cache that caches the referenced key-value pairs of a Get/MultiGet request instead of its blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5534
Test Plan: make clean && COMPILE_WITH_ASAN=1 make check -j32
Differential Revision: D16101124
Pulled By: HaoyuHuang
fbshipit-source-id: b99edda6418a888e94eb40f71ece45d375e234b1
Summary:
When atomic flush stress test fails, we print internal keys within the range with mismatched key/values for all column families.
Test plan (on devserver)
Manually hack the code to randomly insert wrong data. Run the test.
```
$make clean && COMPILE_WITH_TSAN=1 make -j32 db_stress
$./db_stress -test_atomic_flush=true -ops_per_thread=10000
```
Check that proper error messages are printed, as follows:
```
2019/07/08-17:40:14 Starting verification
Verification failed
Latest Sequence Number: 190903
[default] 000000000000050B => 56290000525350515E5F5C5D5A5B5859
[3] 0000000000000533 => EE100000EAEBE8E9E6E7E4E5E2E3E0E1FEFFFCFDFAFBF8F9
Internal keys in CF 'default', [000000000000050B, 0000000000000533] (max 8)
key 000000000000050B seq 139920 type 1
key 0000000000000533 seq 0 type 1
Internal keys in CF '3', [000000000000050B, 0000000000000533] (max 8)
key 0000000000000533 seq 0 type 1
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5549
Differential Revision: D16158709
Pulled By: riversand963
fbshipit-source-id: f07fa87763f87b3bd908da03c956709c6456bcab
Summary:
Right now ldb can open running DB through read-only DB. However, it might leave info logs files to the read-only DB directory. Add an option to open the DB as secondary to avoid it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5537
Test Plan:
Run
./ldb scan --max_keys=10 --db=/tmp/rocksdbtest-2491/dbbench --secondary_path=/tmp --no_value --hex
and
./ldb get 0x00000000000000103030303030303030 --hex --db=/tmp/rocksdbtest-2491/dbbench --secondary_path=/tmp
against a normal db_bench run and observe the output changes. Also observe that no new info logs files are created under /tmp/rocksdbtest-2491/dbbench.
Run without --secondary_path and observe that new info logs created under /tmp/rocksdbtest-2491/dbbench.
Differential Revision: D16113886
fbshipit-source-id: 4e09dec47c2528f6ca08a9e7a7894ba2d9daebbb
Summary:
Print out some more information when db_tress fails with verification failures to help debugging problems.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5543
Test Plan:
Manually ingest some failures and observe the outputs are like this:
Verification failed
[default] 0000000000199A5A => 7C3D000078797A7B74757677707172736C6D6E6F68696A6B
[6] 000000000019C8BD => 65380000616063626D6C6F6E69686B6A
internal keys in default CF [0000000000199A5A, 000000000019C8BD] (max 8)
key 0000000000199A5A seq 179246 type 1
key 000000000019C8BD seq 163970 type 1
Lastest Sequence Number: 292234
Differential Revision: D16153717
fbshipit-source-id: b33fa50a828c190cbf8249a37955432044f92daf
Summary:
Sometimes it is helpful to fetch the whole history of stats after benchmark runs. Add such an option
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5532
Test Plan: Run the benchmark manually and observe the output is as expected.
Differential Revision: D16097764
fbshipit-source-id: 10b5b735a22a18be198b8f348be11f11f8806904
Summary:
Formatting fixes in db_bench_tool that were accidentally omitted
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5525
Test Plan: Unit tests
Differential Revision: D16078516
Pulled By: elipoz
fbshipit-source-id: bf8df0e3f08092a91794ebf285396d9b8a335bb9
Summary:
This PR creates cache_simulator.h file. It contains a CacheSimulator that runs against a block cache trace record. We can add alternative cache simulators derived from CacheSimulator later. For example, this PR adds a PrioritizedCacheSimulator that inserts filter/index/uncompressed dictionary blocks with high priority.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5517
Test Plan: make clean && COMPILE_WITH_ASAN=1 make check -j32
Differential Revision: D16043689
Pulled By: HaoyuHuang
fbshipit-source-id: 65f28ed52b866ffb0e6eceffd7f9ca7c45bb680d
Summary:
`create_column_family` cmd already exists but was somehow missed in the help message.
also add `drop_column_family` cmd which can drop a cf without opening db.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5503
Test Plan: Updated existing ldb_test.py to test deleting a column family.
Differential Revision: D16018414
Pulled By: lightmark
fbshipit-source-id: 1fc33680b742104fea86b10efc8499f79e722301
Summary:
This PR adds a feature in block cache trace analysis tool to write statistics into csv files.
1. The analysis tool supports grouping the number of accesses per second by various labels, e.g., block, column family, block type, or a combination of them.
2. It also computes reuse distance and reuse interval.
Reuse distance: The cumulated size of unique blocks read between two consecutive accesses on the same block.
Reuse interval: The time between two consecutive accesses on the same block.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5490
Differential Revision: D15901322
Pulled By: HaoyuHuang
fbshipit-source-id: b5454fea408a32757a80be63de6fe1c8149ca70e
Summary:
This PR adds more callers for table readers. These information are only used for block cache analysis so that we can know which caller accesses a block.
1. It renames the BlockCacheLookupCaller to TableReaderCaller as passing the caller from upstream requires changes to table_reader.h and TableReaderCaller is a more appropriate name.
2. It adds more table reader callers in table/table_reader_caller.h, e.g., kCompactionRefill, kExternalSSTIngestion, and kBuildTable.
This PR is long as it requires modification of interfaces in table_reader.h, e.g., NewIterator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5454
Test Plan: make clean && COMPILE_WITH_ASAN=1 make check -j32.
Differential Revision: D15819451
Pulled By: HaoyuHuang
fbshipit-source-id: b6caa704c8fb96ddd15b9a934b7e7ea87f88092d
Summary:
sprintf is unsafe and has buffer overrun risk. Replace it with the safer version snprintf where buffer size is supplied to avoid overrun.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5475
Differential Revision: D15879481
Pulled By: sagar0
fbshipit-source-id: 7ae1958ffc9727fa50261dfbb98ddd74e70a72d8
Summary:
This PR adds a BlockCacheTraceSimulator that reports the miss ratios given different cache configurations. A cache configuration contains "cache_name,num_shard_bits,cache_capacities". For example, "lru, 1, 1K, 2K, 4M, 4G".
When we replay the trace, we also perform lookups and inserts on the simulated caches.
In the end, it reports the miss ratio for each tuple <cache_name, num_shard_bits, cache_capacity> in a output file.
This PR also adds a main source block_cache_trace_analyzer so that we can run the analyzer in command line.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5449
Test Plan:
Added tests for block_cache_trace_analyzer.
COMPILE_WITH_ASAN=1 make check -j32.
Differential Revision: D15797073
Pulled By: HaoyuHuang
fbshipit-source-id: aef0c5c2e7938f3e8b6a10d4a6a50e6928ecf408
Summary:
This PR continues the work in https://github.com/facebook/rocksdb/pull/4748 and https://github.com/facebook/rocksdb/pull/4535 by adding a new DBOption `persist_stats_to_disk` which instructs RocksDB to persist stats history to RocksDB itself. When statistics is enabled, and both options `stats_persist_period_sec` and `persist_stats_to_disk` are set, RocksDB will periodically write stats to a built-in column family in the following form: key -> (timestamp in microseconds)#(stats name), value -> stats value. The existing API `GetStatsHistory` will detect the current value of `persist_stats_to_disk` and either read from in-memory data structure or from the hidden column family on disk.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5046
Differential Revision: D15863138
Pulled By: miasantreble
fbshipit-source-id: bb82abdb3f2ca581aa42531734ac799f113e931b
Summary:
This PR integrates the block cache tracing into db_bench. It adds three command line arguments.
-block_cache_trace_file (Block cache trace file path.) type: string default: ""
-block_cache_trace_max_trace_file_size_in_bytes (The maximum block cache
trace file size in bytes. Block cache accesses will not be logged if the
trace file size exceeds this threshold. Default is 64 GB.) type: int64
default: 68719476736
-block_cache_trace_sampling_frequency (Block cache trace sampling
frequency, termed s. It uses spatial downsampling and samples accesses to
one out of s blocks.) type: int32 default: 1
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5459
Differential Revision: D15832031
Pulled By: HaoyuHuang
fbshipit-source-id: 0ecf2f2686557251fe741a2769b21170777efa3d
Summary:
This PR integrates the block cache tracer into block based table reader. The tracer will write the block cache accesses using the trace_writer. The tracer is null in this PR so that nothing will be logged.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5441
Differential Revision: D15772029
Pulled By: HaoyuHuang
fbshipit-source-id: a64adb92642cd23222e0ba8b10d86bf522b42f9b
Summary:
This PR integrates the block cache tracer class into db_impl.cc.
db_impl.cc contains a member variable of AtomicBlockCacheTraceWriter class and passes its reference to the block_based_table_reader.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5433
Differential Revision: D15728016
Pulled By: HaoyuHuang
fbshipit-source-id: 23d5659e8c82d556833dcc1a5558aac8c1f7db71
Summary:
The tsan crash tests are failing with a data race compliant with pipelined write option. Temporarily disable it until its concurrency issue are fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5445
Differential Revision: D15783824
Pulled By: maysamyabandeh
fbshipit-source-id: 413a0c3230b86f524fc7eeea2cf8e8375406e65b
Summary:
This PR contains the first commit for block cache trace analyzer. It reads a block cache trace file and prints statistics of the traces.
We will extend this class to provide more functionalities.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5425
Differential Revision: D15709580
Pulled By: HaoyuHuang
fbshipit-source-id: 2f43bd2311f460ab569880819d95eeae217c20bb
Summary:
When using `PRIu64` type of printf specifier, current code base does the following:
```
#ifndef __STDC_FORMAT_MACROS
#define __STDC_FORMAT_MACROS
#endif
#include <inttypes.h>
```
However, this can be simplified to
```
#include <cinttypes>
```
as long as flag `-std=c++11` is used.
This should solve issues like https://github.com/facebook/rocksdb/issues/5159
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5402
Differential Revision: D15701195
Pulled By: miasantreble
fbshipit-source-id: 6dac0a05f52aadb55e9728038599d3d2e4b59d03
Summary:
util/ means for lower level libraries. trace_replay is highly integrated to DB and sometimes call DB. Move it out to a separate directory.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5376
Differential Revision: D15550938
Pulled By: siying
fbshipit-source-id: f46dce5ceffdc05a73f26379c7bb1b79ebe6c207
Summary:
Many logging related source files are under util/. It will be more structured if they are together.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5387
Differential Revision: D15579036
Pulled By: siying
fbshipit-source-id: 3850134ed50b8c0bb40a0c8ae1f184fa4081303f
Summary:
1. Fix a bug in WAL replay in which write batches with old sequence numbers are mistakenly inserted into memtables.
2. Add support for benchmarking secondary instance to db_bench_tool.
With changes made in this PR, we can start benchmarking secondary instance
using two processes. It is also possible to vary the frequency at which the
secondary instance tries to catch up with the primary. The info log of the
secondary can be found in a directory whose path can be specified with
'-secondary_path'.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5170
Differential Revision: D15564608
Pulled By: riversand963
fbshipit-source-id: ce97688ed3d33f69d3a0b9266ebbbbf887aa0ec8
Summary:
When the --reopen option is non-zero, the DB is reopened after every ops_per_thread/(reopen+1) ops, with the check being done after every op. With MultiGet, we might do multiple ops in one iteration, which broke the logic that checked when to synchronize among the threads and reopen the DB. This PR fixes that logic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5374
Differential Revision: D15559780
Pulled By: anand1976
fbshipit-source-id: ee6563a68045df7f367eca3cbc2500d3e26359ef
Summary:
There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo
ve them to a new directory test_util/, so that util/ is cleaner.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377
Differential Revision: D15551366
Pulled By: siying
fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c
Summary:
util/ means for lower level libraries, so it's a good idea to move the files which requires knowledge to DB out. Create a file/ and move some files there.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5375
Differential Revision: D15550935
Pulled By: siying
fbshipit-source-id: 61a9715dcde5386eebfb43e93f847bba1ae0d3f2
Summary:
This enables the user to set TransactionDBOptions::skip_concurrency_control so the standard `DB::Write(const WriteOptions& opts, WriteBatch* updates)` would skip the concurrency control. This would give higher throughput to the users who know their use case doesn't need concurrency control.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5330
Differential Revision: D15525932
Pulled By: maysamyabandeh
fbshipit-source-id: 68421ac1ba34f549a4a8de9ce4c2dccf6fb4b06b
Summary:
In the current db_bench trace replay, the replay process strictly follows the timestamp to issue the queries. In some cases, user does not care about the time. Therefore, fast forward is needed for users to speed up the replay process.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5273
Differential Revision: D15389232
Pulled By: zhichao-cao
fbshipit-source-id: 735d629b9d2a167b05af3e4fa0ddf9d5d0be1806
Summary:
This PR has two fixes for crash test failures -
1. Fix a bug in TestMultiGet() in db_stress that was passing list of key to MultiGet() in the wrong order, thus ensuring that actual values don't match expected values
2. Remove an incorrect assertion in FilePickerMultiGet::GetNextFileInLevelWithKeys() that checks that files in a level are in sorted order. This is not true with MultiGet(), especially if there are duplicate keys and we may have to go back one file for the next key. Furthermore, this assertion makes more sense when a new version is created, rather than at lookup time
Test -
asan_crash and ubsan_crash tests
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5301
Differential Revision: D15337383
Pulled By: anand1976
fbshipit-source-id: 35092cb15bbc1700e5e823cbe07bfa62f1e9e6c6
Summary:
Performing unordered writes in rocksdb when unordered_write option is set to true. When enabled the writes to memtable are done without joining any write thread. This offers much higher write throughput since the upcoming writes would not have to wait for the slowest memtable write to finish. The tradeoff is that the writes visible to a snapshot might change over time. If the application cannot tolerate that, it should implement its own mechanisms to work around that. Using TransactionDB with WRITE_PREPARED write policy is one way to achieve that. Doing so increases the max throughput by 2.2x without however compromising the snapshot guarantees.
The patch is prepared based on an original by siying
Existing unit tests are extended to include unordered_write option.
Benchmark Results:
```
TEST_TMPDIR=/dev/shm/ ./db_bench_unordered --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions --unordered_write=1
```
With WAL
- Vanilla RocksDB: 78.6 MB/s
- WRITER_PREPARED with unordered_write: 177.8 MB/s (2.2x)
- unordered_write: 368.9 MB/s (4.7x with relaxed snapshot guarantees)
Without WAL
- Vanilla RocksDB: 111.3 MB/s
- WRITER_PREPARED with unordered_write: 259.3 MB/s MB/s (2.3x)
- unordered_write: 645.6 MB/s (5.8x with relaxed snapshot guarantees)
- WRITER_PREPARED with unordered_write disable concurrency control: 185.3 MB/s MB/s (2.35x)
Limitations:
- The feature is not yet extended to `max_successive_merges` > 0. The feature is also incompatible with `enable_pipelined_write` = true as well as with `allow_concurrent_memtable_write` = false.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5218
Differential Revision: D15219029
Pulled By: maysamyabandeh
fbshipit-source-id: 38f2abc4af8780148c6128acdba2b3227bc81759
Summary:
This PR fixes a couple of bugs in FilePickerMultiGet that were causing db_stress test failures. The failures were caused by -
1. Improper handling of a key that matches the user key portion of an L0 file's largest key. In this case, the curr_index_in_curr_level file index in L0 for that key was getting incremented, but batch_iter_ was not advanced. By design, all keys in a batch are supposed to be checked against an L0 file before advancing to the next L0 file. Not advancing to the next key in the batch was causing a double increment of curr_index_in_curr_level due to the same key being processed again
2. Improper handling of a key that matches the user key portion of the largest key in the last file of L1 and higher. This was resulting in a premature end to the processing of the batch for that level when the next key in the batch is a duplicate. Typically, the keys in MultiGet will not be duplicates, but its good to handle that case correctly
Test -
asan_crash
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5292
Differential Revision: D15282530
Pulled By: anand1976
fbshipit-source-id: d1a6a86e0af273169c3632db22a44d79c66a581f
Summary:
Disable it for now until we can get stress tests to pass consistently.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5284
Differential Revision: D15230727
Pulled By: anand1976
fbshipit-source-id: 239baacdb3c4cd4fb7c4447f7582b9042501d752
Summary:
Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list.
For simplicity, to avoid the feature is disabled in two cases: i) When more than one sub-compaction are sharing the same snapshot list, ii) when Range Delete is used in which the range delete aggregator has its own copy of snapshot list.
This fixes the reverted https://github.com/facebook/rocksdb/pull/5099 issue with range deletes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5278
Differential Revision: D15203291
Pulled By: maysamyabandeh
fbshipit-source-id: fa645611e606aa222c7ce53176dc5bb6f259c258
Summary:
This PR fixes three memory issues found by ASAN
* in db_stress, the key vector for MultiGet is created using `emplace_back` which could potentially invalidates references to the underlying storage (vector<string>) due to auto resizing. Fix by calling reserve in advance.
* Similar issue in construction of GetContext autovector in version_set.cc
* In multiget_context.h use T[] specialization for unique_ptr that holds a char array
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5279
Differential Revision: D15202893
Pulled By: miasantreble
fbshipit-source-id: 14cc2cda0ed64d29f2a1e264a6bfdaa4294ee75d
Summary:
The new option will pick a batch size randomly in the range 1-64. It will then space the keys in the batch by random intervals.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5264
Differential Revision: D15175522
Pulled By: anand1976
fbshipit-source-id: c16baa69d0f1ff4cf53c55c813ddd82c8aeb58fc
Summary:
At least one of the meta-block loading functions (`ReadRangeDelBlock`)
uses the same block reading function (`NewDataBlockIterator`) as data
block reads, which means it uses the dictionary handle. However, the
dictionary handle was uninitialized while reading meta-blocks, causing
readers to receive an error. This situation was only noticed when
`cache_index_and_filter_blocks=true`.
This PR initializes the handle to null while reading meta-blocks to
prevent the error. It also adds support to `db_stress` /
`db_crashtest.py` for `cache_index_and_filter_blocks`.
Fixes#5263.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5267
Differential Revision: D15149264
Pulled By: maysamyabandeh
fbshipit-source-id: 991d38a306c62db5976778bfb050fa3cd4a0671b
Summary:
Since currently pipelined write allows one thread to perform memtable writes
while another thread is traversing the `flush_scheduler_`, it will cause an
assertion failure in `FlushScheduler::Clear`. To unblock crash recoery tests,
we temporarily disable pipelined write when atomic flush is enabled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5266
Differential Revision: D15142285
Pulled By: riversand963
fbshipit-source-id: a0c20fe4ac543e08feaed602414f982054df7831
Summary:
Improve the iterators performance when the user explicitly sets the readahead size via `ReadOptions.readahead_size`.
1. Stop creating new table readers when the user explicitly sets readahead size.
2. Make use of an internal buffer based on `FilePrefetchBuffer` instead of using `ReadaheadRandomAccessFileReader`, to handle the user readahead requests (for both buffered and direct io cases).
3. Add `readahead_size` to db_bench.
**Benchmarks:**
https://gist.github.com/sagar0/53693edc320a18abeaeca94ca32f5737
For 1 MB readahead, Buffered IO performance improves by 28% and Direct IO performance improves by 50%.
For 512KB readahead, Buffered IO performance improves by 30% and Direct IO performance improves by 67%.
**Test Plan:**
Updated `DBIteratorTest.ReadAhead` test to make sure that:
- no new table readers are created for iterators on setting ReadOptions.readahead_size
- At least "readahead" number of bytes are actually getting read on each iterator read.
TODO later:
- Use similar logic for compactions as well.
- This ties in nicely with #4052 and paves the way for removing ReadaheadRandomAcessFile later.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5246
Differential Revision: D15107946
Pulled By: sagar0
fbshipit-source-id: 2c1149729ca7d779e4e8b7710ba6f4e8cbfd3bea
Summary:
I needed this change to be able to build the v6.0.1 release on Windows.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5227
Differential Revision: D15033815
Pulled By: sagar0
fbshipit-source-id: 579f3b8e694c34c0d43527eb2fa37175e37f5911
Summary:
Depending on the config, manual compaction (leveled compaction style) does following compactions:
L0->L1
L1->L2
...
Ln-1 -> Ln
Ln -> Ln
The final Ln -> Ln compaction is partly unnecessary as it recompacts all the files that were just generated by the Ln-1 -> Ln. We should avoid recompacting such files. This rule should be applied to Lmax only.
Resolves issue https://github.com/facebook/rocksdb/issues/4995
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5138
Differential Revision: D14940106
Pulled By: miasantreble
fbshipit-source-id: 8d3cf5507a17e76f3333cfd4bac5256d005636e5
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
Summary:
mv port/dirent.h to port/port_dirent.h to avoid compile err when use port dir as header dir output
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5152
Differential Revision: D14779409
Pulled By: siying
fbshipit-source-id: d4162c47c979c6e8cc6a9e601802864ab3768ecb
Summary:
Fix some hdfs-related code so that it can compile and run 'db_stress'
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5122
Differential Revision: D14675495
Pulled By: riversand963
fbshipit-source-id: cac280479efcf5451982558947eac1732e8bc45a
Summary:
The code convention we are following, Google C++ Style, discourage
alias in header files, especially public headers:
https://google.github.io/styleguide/cppguide.html#Aliases
Remove some of them. Might removed some from .cc files as well to be consistent.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5113
Differential Revision: D14633030
Pulled By: siying
fbshipit-source-id: b990edc919d5de60295992284f980195e501d424
Summary:
This PR allows RocksDB to run in single-primary, multi-secondary process mode.
The writer is a regular RocksDB (e.g. an `DBImpl`) instance playing the role of a primary.
Multiple `DBImplSecondary` processes (secondaries) share the same set of SST files, MANIFEST, WAL files with the primary. Secondaries tail the MANIFEST of the primary and apply updates to their own in-memory state of the file system, e.g. `VersionStorageInfo`.
This PR has several components:
1. (Originally in #4745). Add a `PathNotFound` subcode to `IOError` to denote the failure when a secondary tries to open a file which has been deleted by the primary.
2. (Similar to #4602). Add `FragmentBufferedReader` to handle partially-read, trailing record at the end of a log from where future read can continue.
3. (Originally in #4710 and #4820). Add implementation of the secondary, i.e. `DBImplSecondary`.
3.1 Tail the primary's MANIFEST during recovery.
3.2 Tail the primary's MANIFEST during normal processing by calling `ReadAndApply`.
3.3 Tailing WAL will be in a future PR.
4. Add an example in 'examples/multi_processes_example.cc' to demonstrate the usage of secondary RocksDB instance in a multi-process setting. Instructions to run the example can be found at the beginning of the source code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4899
Differential Revision: D14510945
Pulled By: riversand963
fbshipit-source-id: 4ac1c5693e6012ad23f7b4b42d3c374fecbe8886
Summary:
Right now ldb command doesn't allow cases where option values contain equals sign. For example,
```
ldb --db=/tmp/test scan --from='q=3' --max_keys=1
```
after parsing, ldb will have one option 'db', 'max_keys' and one flag 'from'.
This PR updates the parsing logic so that it now supports the above mentioned cases
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5088
Differential Revision: D14600869
Pulled By: miasantreble
fbshipit-source-id: c6ef518c74a98d7b6675ea5954ae08b1bda5554e
Summary:
This is a feature to sample data-block compressibility and and report them as stats. 1 in N (tunable) blocks is sampled for compressibility using two algorithms:
1. lz4 or snappy for fast compression
2. zstd or zlib for slow but higher compression.
The stats are reported to the caller as raw-bytes and compressed-bytes. The block continues to be compressed for storage using the specified CompressionType.
The db_bench_tool how has a command line option for specifying the sampling rate. It's default value is 0 (no sampling). To test the overhead for a certain value, users can compare the performance of db_bench_tool, varying the sampling rate. It is unlikely to have a noticeable impact for high values like 20.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4842
Differential Revision: D13629011
Pulled By: shobhitdayal
fbshipit-source-id: 14ca668bcab6499b2a1734edf848eb62a4f4fafa
Summary:
Since this feature affects the WAL behavior, it seems important our crash-recovery tests cover it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5070
Differential Revision: D14470085
Pulled By: miasantreble
fbshipit-source-id: 9b9682a718a926d57d055e0a5ec867efbd2eb9c1
Summary:
In the current trace_analyzer implementation, once the trace file has corrupted content, which can be caused by unexpected tracing operations or other reasons, trace_analyzer will print the error and stop analyzing.
By adding the -try_process_corrupted_trace option, user can try to process the corrupted trace file and get the analyzing results of the trace records from the beginning to the the first corrupted point in the trace file. Analyzing might fail even this option is enabled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5067
Differential Revision: D14433037
Pulled By: zhichao-cao
fbshipit-source-id: d095233ba371726869af0def0cdee23b69896831
Summary:
Without `total_order_seek=true`, using this command with `prefix_extractor` set skips over lots of keys.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5066
Differential Revision: D14425967
Pulled By: sagar0
fbshipit-source-id: f6f142733258d92604f920615be9266e1fe797f8
Summary:
In the MixGraph benchmark of db_bench, The max buffer size used for value of KV-pair might be extremely large (64MB), which might cause function stack overflow in some platforms, reduced to 1MB.
Added the finished ops printing in MixGraph benchmark.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5051
Differential Revision: D14379571
Pulled By: zhichao-cao
fbshipit-source-id: 24084fbe38f60f2902d9a40f6bc9a25e4e2c9bb9
Summary:
Added an option, `-use_existing_keys`, which can be set to run
benchmarks against an arbitrary existing database. Now users can
benchmark against their actual database rather than synthetic data.
Before the run begins, it loads all the keys into memory, then uses that
set of keys rather than synthesizing new ones in `GenerateKeyFromInt`.
This is mainly intended for small-scale DBs where the memory consumption
is not a concern.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5017
Differential Revision: D14270303
Pulled By: riversand963
fbshipit-source-id: 6328df9dffb5e19170270dd00a69f4bbe424e5ed
Summary:
Right now, users can change statistics.stats_level while DB is running, but TSAN may report
data race. We make stats_level_ to be atomic, and access them using accessors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5030
Differential Revision: D14267519
Pulled By: siying
fbshipit-source-id: 37d7ebeff7a43a406230143422a16af899163f73
Summary:
This the hackish script we used to find the root cause of failures in transaction stress tests. It is not well-written and does not require rigorous reviewing but it is better than starting from scratch each time we observe an issue. The stress tests would just say that at which snapshots the sum of all the keys in a set is inconsistent with another set. To help debugging one need to know which key exactly returned inconsistent results. The script looks at the transactions between two conflicting snapshots, and performs thee changes manually to see for which key the read value was inconsistent.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5033
Differential Revision: D14280362
Pulled By: maysamyabandeh
fbshipit-source-id: d5826055c46711460ba81480d96cb5ea082814a5
Summary:
Statistics cost too much CPU for some use cases. Add two stats levels
so that people can choose to skip two types of expensive stats, timers and
histograms.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5027
Differential Revision: D14252765
Pulled By: siying
fbshipit-source-id: 75ecec9eaa44c06118229df4f80c366115346592
Summary:
This PR adds public `GetStatsHistory` API to retrieve stats history in the form of an std map. The key of the map is the timestamp in microseconds when the stats snapshot is taken, the value is another std map from stats name to stats value (stored in std string). Two DBOptions are introduced: `stats_persist_period_sec` (default 10 minutes) controls the intervals between two snapshots are taken; `max_stats_history_count` (default 10) controls the max number of history snapshots to keep in memory. RocksDB will stop collecting stats snapshots if `stats_persist_period_sec` is set to 0.
(This PR is the in-memory part of https://github.com/facebook/rocksdb/pull/4535)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4748
Differential Revision: D13961471
Pulled By: miasantreble
fbshipit-source-id: ac836d401ecb84ea92216bf9966f969dedf4ad04
Summary:
MyRocks calls `GetForUpdate` on `INSERT`, for unique key check, and in almost all cases GetForUpdate returns empty result. For such cases, whole key bloom filter is helpful.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4985
Differential Revision: D14118257
Pulled By: miasantreble
fbshipit-source-id: d35cb7109c62fd5ad541a26968e3a3e16d3e85ea
Summary:
LITE mode has EventListener to be an empty class. However in db_bench,
it is used. When "override" is added to the functions, the build breaks. Fix it
by keeping the listener empty in LITE mode.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4989
Differential Revision: D14108132
Pulled By: siying
fbshipit-source-id: 80121aab35b1120e502b37b782301dd700692697
Summary:
We introduced ttl option in CompactionOptionsFIFO when ttl-based file
deletion (compaction) was supported only as part of FIFO Compaction. But
with the extension of ttl semantics even to Level compaction,
CompactionOptionsFIFO.ttl can now be deprecated. Instead we will start
using ColumnFamilyOptions.ttl for FIFO compaction as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4965
Differential Revision: D14072960
Pulled By: sagar0
fbshipit-source-id: c98cc2ae695a28136295787cd88d36a220fc219e
Summary:
if an operation just involves a single column family, then we do
not have to set the kInAtomicGroup tag when writing to MANIFEST. This change
can fix a compatibility test failure, i.e. 5.15 and earlier cannot recognize
kInAtomicGroup tag.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4981
Differential Revision: D14072687
Pulled By: riversand963
fbshipit-source-id: 46b0c61e399f16c6b7169de0b33430d0ed90d6d4
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
Summary:
Previously `finalize_and_sanitize` function was always zeroing out `compression_zstd_max_train_bytes`. It was only supposed to do that when non-ZSTD compression was used. But since `--compression_type` was an unknown argument (i.e., one that `db_crashtest.py` does not recognize and blindly forwards to `db_stress`), `finalize_and_sanitize` could not tell whether ZSTD was used. This PR fixes it simply by making `--compression_type` a known argument with snappy as default (same as `db_stress`).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4957
Differential Revision: D13994302
Pulled By: ajkr
fbshipit-source-id: 1b0baea7331397822830970d3698642eb7a7df65
Summary:
Cuckoo Hash is less useful than we initially expected. Remove it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4953
Differential Revision: D13979264
Pulled By: siying
fbshipit-source-id: 2a60afdaa989f045357398b43a1cc5d46f4492ed
Summary:
4985a9f73b (diff-e5276985b26a0551957144f4420a594bR511)
changes the meaning of latency reporting from running time per query, to elapse_time / #ops, without providing a reason why.
Considering that this is a counter-intuitive reporting, Reverting the change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4949
Differential Revision: D13964684
Pulled By: siying
fbshipit-source-id: d6304d3d4b5a802daa292302623c7dbca9a680bc
Summary:
Measure CPU time consumed for a compaction and report it in the stats report
Enable NowCPUNanos() to work for MacOS
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4889
Differential Revision: D13701276
Pulled By: zinoale
fbshipit-source-id: 5024e5bbccd4dd10fd90d947870237f436445055
Summary:
In the MixGraph benchmark of db_bench #4788 , the char array is initialized with an argument from user's input, which can cause build error on some platforms. Also, the msg char array size can be potentially smaller than the printed data, which should be extended from 100 to 256.
Tested with make check.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4918
Differential Revision: D13844298
Pulled By: sagar0
fbshipit-source-id: 33c4809c5c4438f0a9f7b289d3f42e20c545bbab
Summary:
- When building with internal dependencies, specify this toolchain by setting `ROCKSDB_FBCODE_BUILD_WITH_PLATFORM007=1`
- It is not enabled by default. However, it is enabled for TSAN builds in CI since there is a known problem with TSAN in gcc-5: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71090
- I did not add support for Lua since (1) we agreed to deprecate it, and (2) we only have an internal build for v5.3 with this toolchain while that has breaking changes compared to our current version (v5.2).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4923
Differential Revision: D13827226
Pulled By: ajkr
fbshipit-source-id: 9aa3388ed3679777cfb15ef8cbcb83c07f62f947
Summary:
Fixed clang static analyzer warning about division by 0.
```
ar: creating librocksdb_debug.a
tools/db_bench_tool.cc:4650:43: warning: Division by zero
int pos = static_cast<int>(rand_num % range_);
~~~~~~~~~^~~~~~~~
1 warning generated.
make: *** [analyze] Error 1
```
This is from the new code I recently merged in ce8e88d2d7.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4910
Differential Revision: D13788037
Pulled By: sagar0
fbshipit-source-id: f48851dca85047c19fbb1a361e25ce643aa4c7ea
Summary:
Based on the specific workload models (key access distribution, value size distribution, and iterator scan length distribution, the QPS variation), the MixGraph benchmark generate the synthetic workload according to these distributions which can reflect the real-world workload characteristics.
After user enable the tracing function, they will get the trace file. By analyzing the trace file with the trace_analyzer tool, user can generate a set of statistic data files including. The *_accessed_key_stats.txt, *-accessed_value_size_distribution.txt, *-iterator_length_distribution.txt, and *-qps_stats.txt are mainly used to fit the Matlab model fitting. After that, user can get the parameters of the workload distributions (the modeling details are described: [here](https://github.com/facebook/rocksdb/wiki/RocksDB-Trace%2C-Replay%2C-and-Analyzer))
The key access distribution follows the The two-term power model. The probability density function is: `f(x) = ax^{b}+c`. The corresponding parameters are key_dist_a, key_dist_b, and key_dist_c in db_bench
For the value size distribution and iterator scan length distribution, they both follow the Generalized Pareto Distribution. The probability density function is `f(x) = (1/sigma)(1+k*(x-theta)/sigma))^{-1-1/k)`. The parameters are: value_k, value_theta, value_sigma and iter_k, iter_theta, iter_sigma. For more information about the Generalized Pareto Distribution, users can find the [wiki](https://en.wikipedia.org/wiki/Generalized_Pareto_distribution) and [Matalb page](https://www.mathworks.com/help/stats/generalized-pareto-distribution.html)
As for the QPS, it follows the diurnal pattern. So Sine is a good model to fit it. `F(x) = sine_a*sin(sine_b*x + sine_c) + sine_d`. The trace_will tell you the average QPS in the print out resutls, which is sine_d. After user fit the "*-qps_stats.txt" to the Matlab model, user can get the sine_a, sine_b, and sine_c. By using the 4 parameters, user can control the QPS variation including the period, average, changes.
To use the bench mark, user can indicate the following parameters as examples:
```
-benchmarks="mixgraph" -key_dist_a=0.002312 -key_dist_b=0.3467 -value_k=0.9233 -value_sigma=226.4092 -iter_k=2.517 -iter_sigma=14.236 -mix_get_ratio=0.7 -mix_put_ratio=0.25 -mix_seek_ratio=0.05 -sine_mix_rate_interval_milliseconds=500 -sine_a=15000 -sine_b=1 -sine_d=20000
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4788
Differential Revision: D13573940
Pulled By: sagar0
fbshipit-source-id: e184c27e07b4f1bc0b436c2be36c5090c1fb0222
Summary:
This is essentially a re-submission of #4251 with a few improvements:
- Split `CompressionDict` into two separate classes: `CompressionDict` and `UncompressionDict`
- Eliminated `Init` functions. Instead do all initialization work in constructors.
- Added test case for parallel DB open, which is the scenario where #4251 failed under TSAN.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4849
Differential Revision: D13606039
Pulled By: ajkr
fbshipit-source-id: 08c236059798c710db9cbf545fce0f371232d447
Summary:
LDB is frequently used to exam data copied. wal_dir in option file is not modified and it usually points to the path it copied from.
The user experience will be better if when ldb sees wal_dir pointed by the option file doesn't exist, rather than fail, just ignore it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4875
Differential Revision: D13643173
Pulled By: siying
fbshipit-source-id: 2e64d4ea2ec49a6794b9a706b7fc1ba901128bb8
Summary:
as titled.
We can remove the assersion because we do not perform verification in
AtomicFlushStressTest::TestCheckpoint for similar reasons to TestGet, TestPut,
etc.
Therefore, we override TestCheckpoint in AtomicFlushStressTest so that the
assertion `rand_column_families.size() == rand_keys.size()' is removed, and we
do not verify the DB in this function.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4846
Differential Revision: D13583377
Pulled By: riversand963
fbshipit-source-id: 03647f3da67e27a397413fd666e3bb43003bf596
Summary:
The current implementation hardcode the default options in different
places, which makes it impossible to support other environments (like
encrypted environment).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4839
Differential Revision: D13573578
Pulled By: sagar0
fbshipit-source-id: 76b58b4b758902798d10ff2f52d9f39abff015e7
Summary:
Updated stress test will support testing of db in read-only mode.
The user has to make sure that only read/scan operations are enabled.
This PR relies on #4681.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4690
Differential Revision: D13102741
Pulled By: riversand963
fbshipit-source-id: f5a36b34db187fe12dd355f7eda161f99d6c75e4
Summary:
- To be consistent with the accounting of other optypes in `TableProperties`, we should count range tombstones in `TableProperties::num_entries` and `TableProperties::num_deletions`.
- Updated assertions in stress test's `OnTableFileCreated` handler to accept files with range tombstones only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4841
Differential Revision: D13568424
Pulled By: ajkr
fbshipit-source-id: 0139d7806494eda20ece67ec460d2458dbbf6026
Summary:
Separate flag for enabling option from flag for enabling dedicated atomic stress test. I have found setting the former without setting the latter can detect different problems.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4781
Differential Revision: D13463211
Pulled By: ajkr
fbshipit-source-id: 054f777885b2dc7d5ea99faafa21d6537eee45fd
Summary:
**Summary:**
Simplified the code layout by moving FIFOCompactionPicker to a separate file.
**Why?:**
While trying to add ttl functionality to universal compaction, I found that `FIFOCompactionPicker` class and its impl methods to be interspersed between `LevelCompactionPicker` methods which kind-of made the code a little hard to traverse. So I moved `FIFOCompactionPicker` to a separate compaction_picker_fifo.h/cc file, similar to `UniversalCompactionPicker`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4724
Differential Revision: D13227914
Pulled By: sagar0
fbshipit-source-id: 89471766ea67fa4d87664a41c057dd7df4b3d4e3
Summary:
A user friendly sst file reader is useful when we want to access sst
files outside of RocksDB. For example, we can generate an sst file
with SstFileWriter and send it to other places, then use SstFileReader
to read the file and process the entries in other ways.
Also rename the original SstFileReader to SstFileDumper because of
name conflict, and seems SstFileDumper is more appropriate for tools.
TODO: there is only a very simple test now, because I want to get some feedback first.
If the changes look good, I will add more tests soon.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4717
Differential Revision: D13212686
Pulled By: ajkr
fbshipit-source-id: 737593383264c954b79e63edaf44aaae0d947e56
Summary:
The error message of databases/rocksdb-lite (FreeBSD port) is as follows:
```
tools/db_bench_tool.cc:1976:16: error: private field 'trace_options_' is not used [-Werror,-Wunused-private-field]
TraceOptions trace_options_;
^
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4715
Differential Revision: D13207902
Pulled By: ajkr
fbshipit-source-id: be3c612eba656aeddb77e35e2f201dd25dc92f7e
Summary:
The new flag makes it possible to constrain iterator traversal
by the upper/lower bound the iterator is expected to pass. This allows
seekrandom results to be more easily comparable between DBs with and
without deletions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4660
Differential Revision: D13053111
Pulled By: abhimadan
fbshipit-source-id: 33e250f2e2d210b54c7726399da30a33f723c33c
Summary:
When iterator becomes invalid, there are two possibilities.
First, all data in the column family have been scanned and there is nothing
more to scan.
Second, an underlying error has occurred, causing `status()` to be !ok.
Therefore, we need to check for both cases when `!iter->Valid()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4648
Differential Revision: D12959601
Pulled By: riversand963
fbshipit-source-id: 49c9382c9ea9e78f2e2b6f3708f0670b822ca8dd
Summary:
Changes:
1. in current version, key size distribution is printed out as the result. In this change, the result will be output to a file to make further analyze easier
2. To understand how the unique keys are accessed over time, the total unique key number of each CF of each query type in each second over time is output to a file. In this way, user could know when the unique keys are accessed frequently or accessed rarely.
3. output the total QPS of each CF to a file
4. Add the print result of total queries of each CF of each query type.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4646
Differential Revision: D12968156
Pulled By: zhichao-cao
fbshipit-source-id: 6c411c7ec47c7843a70929136efd71a150db0e4c
Summary:
Ran the following commands to recursively change all the files under RocksDB:
```
find . -type f -name "*.cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} +
```
Running `make format` updated some formatting on the files touched.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638
Differential Revision: D12934992
Pulled By: sagar0
fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8
Summary:
We already exercised backup functionality in `db_stress` according to the `-backup_one_in` flag. This PR verifies the backup can be restored/opened and sanity checks a few keys. Changes in this PR:
- Extracted existing backup-related logic to a helper function, `TestBackupRestore`
- Added restore logic, which targets a hidden directory named "./.restore\<thread number\>", similar to how backups target hidden directories named "./.backup\<thread number\>".
- After restore, check the existence/non-existence of a few keys.
- With this PR, backup is no longer compatible with clearing column families.
- Also included unrelated fixes to set `ReadOptions::total_order_seek=true` when using `-compare_full_db_state_snapshot`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4655
Differential Revision: D12972496
Pulled By: ajkr
fbshipit-source-id: 481a40052d9a38d1bd5c5159aa4d7c5a4b546b80
Summary:
Originally, the manual flush calls in db_stress flushes only a single column
family, which is not sufficient when atomic flush is enabled.
With atomic flush, we should call `Flush(flush_opts, cfhs)` to better test this
new feature. Specifically, we manuall flush all column families so that
database verification is easier.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4608
Differential Revision: D12849160
Pulled By: riversand963
fbshipit-source-id: ae1f0dd825247b42c0aba520a5c967335102c876
Summary:
Since the number of range deletions are reported in
TableProperties, it is confusing to not report the number of merge
operands and point deletions as top-level properties; they are
accessible through the public API, but since they are not the "main"
properties, they do not appear in aggregated table properties, or the
string representation of table properties.
This change promotes those two property keys to
`rocksdb/table_properties.h`, adds corresponding uint64 members for
them, deprecates the old access methods `GetDeletedKeys()` and
`GetMergeOperands()` (though they are still usable for now), and removes
`InternalKeyPropertiesCollector`. The property key strings are the same
as before this change, so this should be able to read DBs written from older
versions (though I haven't tested this yet).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4594
Differential Revision: D12826893
Pulled By: abhimadan
fbshipit-source-id: 9e4e4fbdc5b0da161c89582566d184101ba8eb68
Summary:
Currently there are two contrun test failures:
* rocksdb-contrun-lite:
> tools/db_bench_tool.cc: In function ‘int rocksdb::db_bench_tool(int, char**)’:
tools/db_bench_tool.cc:5814:5: error: ‘DumpMallocStats’ is not a member of ‘rocksdb’
rocksdb::DumpMallocStats(&stats_string);
^
make: *** [tools/db_bench_tool.o] Error 1
* rocksdb-contrun-unity:
> In file included from unity.cc:44:0:
db/range_tombstone_fragmenter.cc: In member function ‘void rocksdb::FragmentedRangeTombstoneIterator::FragmentTombstones(std::unique_ptr<rocksdb::InternalIteratorBase<rocksdb::Slice> >, rocksdb::SequenceNumber)’:
db/range_tombstone_fragmenter.cc:90:14: error: reference to ‘ParsedInternalKeyComparator’ is ambiguous
auto cmp = ParsedInternalKeyComparator(icmp_);
This PR will fix them
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4587
Differential Revision: D10846554
Pulled By: miasantreble
fbshipit-source-id: 8d3358879e105060197b1379c84aecf51b352b93
Summary:
Option to print malloc stats to stdout at the end of db_bench. This is different from `--dump_malloc_stats`, which periodically print the same information to LOG file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4582
Differential Revision: D10520814
Pulled By: yiwu-arbug
fbshipit-source-id: beff5e514e414079d31092b630813f82939ffe5c
Summary:
On MacOS with clang the compilation of _tools/db_bench_tool.cc_ always fails because the format used in a `fprintf` call has the wrong type. This PR should hopefully fix this issue
```
tools/db_bench_tool.cc:4233:61: error: format specifies type 'unsigned long long' but the argument has type 'size_t' (aka 'unsigned long')
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4533
Differential Revision: D10471657
Pulled By: maysamyabandeh
fbshipit-source-id: f20f5f3756d3571b586c895c845d0d4d1e34a398
Summary:
Current `log::Reader` does not perform retry after encountering `EOF`. In the future, we need the log reader to be able to retry tailing the log even after `EOF`.
Current implementation is simple. It does not provide more advanced retry policies. Will address this in the future.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4394
Differential Revision: D9926508
Pulled By: riversand963
fbshipit-source-id: d86d145792a41bd64a72f642a2a08c7b7b5201e1
Summary:
The new flag allows tombstones to be generated after enough
keys have been written to the database, which makes it easier to ensure
that tombstones cover a lot of keys.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4538
Differential Revision: D10455685
Pulled By: abhimadan
fbshipit-source-id: f25d5421745a353c830dea12b79784e852056551
Summary:
Current implementation of perf context is level agnostic. Making it hard to do performance evaluation for the LSM tree. This PR adds `PerfContextByLevel` to decompose the counters by level.
This will be helpful when analyzing point and range query performance as well as tuning bloom filter
Also replaced __thread with thread_local keyword for perf_context
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4226
Differential Revision: D10369509
Pulled By: miasantreble
fbshipit-source-id: f1ced4e0de5fcebdb7f9cff36164516bc6382d82
Summary:
If the query types being analyzed do not appear in the trace, the current trace_analyzer will use 0 as the begin time, which create the time duration from 1970/01/01 to the now time. It will waste huge memory. Fixed by adding the trace_create_time to limit the duration.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4473
Differential Revision: D10246204
Pulled By: zhichao-cao
fbshipit-source-id: 42850b080b2e62f586fe73afd7737c2246d1a8c8
Summary:
This is a conceptually simple change, but it touches many files to
pass the allocator through function calls.
We introduce CacheAllocator, which can be used by clients to configure
custom allocator for cache blocks. Our motivation is to hook this up
with folly's `JemallocNodumpAllocator`
(f43ce6d686/folly/experimental/JemallocNodumpAllocator.h),
but there are many other possible use cases.
Additionally, this commit cleans up memory allocation in
`util/compression.h`, making sure that all allocations are wrapped in a
unique_ptr as soon as possible.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4437
Differential Revision: D10132814
Pulled By: yiwu-arbug
fbshipit-source-id: be1343a4b69f6048df127939fea9bbc96969f564
Summary:
I guess we didn't update this script when `--allow_concurrent_memtable_write` became true by default.
Fixes#4413.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4428
Differential Revision: D10036452
Pulled By: ajkr
fbshipit-source-id: f464be0642bd096d9040f82cdc3eae614a902183
Summary:
If range tombstones are generated every few writes, the
KeyGenerator's limit is now extended to account for the additional
Next() calls. This is primarily important for `filluniquerandom`
benchmarks that enforce the call limit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4404
Differential Revision: D9949326
Pulled By: abhimadan
fbshipit-source-id: 0bdfeb2cad2098dc0b8b029236dab5e4bef25e38
Summary:
The default for index_block_restart_interval is 1 but some use 16 in production. The patch extends crash test to test both values.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4383
Differential Revision: D9887304
Pulled By: maysamyabandeh
fbshipit-source-id: a8d00fea974a79ad563f9f4d9d7b069e9f746a8f
Summary:
This commit implements automatic recovery from a Status::NoSpace() error
during background operations such as write callback, flush and
compaction. The broad design is as follows -
1. Compaction errors are treated as soft errors and don't put the
database in read-only mode. A compaction is delayed until enough free
disk space is available to accomodate the compaction outputs, which is
estimated based on the input size. This means that users can continue to
write, and we rely on the WriteController to delay or stop writes if the
compaction debt becomes too high due to persistent low disk space
condition
2. Errors during write callback and flush are treated as hard errors,
i.e the database is put in read-only mode and goes back to read-write
only fater certain recovery actions are taken.
3. Both types of recovery rely on the SstFileManagerImpl to poll for
sufficient disk space. We assume that there is a 1-1 mapping between an
SFM and the underlying OS storage container. For cases where multiple
DBs are hosted on a single storage container, the user is expected to
allocate a single SFM instance and use the same one for all the DBs. If
no SFM is specified by the user, DBImpl::Open() will allocate one, but
this will be one per DB and each DB will recover independently. The
recovery implemented by SFM is as follows -
a) On the first occurance of an out of space error during compaction,
subsequent
compactions will be delayed until the disk free space check indicates
enough available space. The required space is computed as the sum of
input sizes.
b) The free space check requirement will be removed once the amount of
free space is greater than the size reserved by in progress
compactions when the first error occured
c) If the out of space error is a hard error, a background thread in
SFM will poll for sufficient headroom before triggering the recovery
of the database and putting it in write-only mode. The headroom is
calculated as the sum of the write_buffer_size of all the DB instances
associated with the SFM
4. EventListener callbacks will be called at the start and completion of
automatic recovery. Users can disable the auto recov ery in the start
callback, and later initiate it manually by calling DB::Resume()
Todo:
1. More extensive testing
2. Add disk full condition to db_stress (follow-on PR)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164
Differential Revision: D9846378
Pulled By: anand1976
fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a
Summary:
The code is dead in RocksDB as `log::Reader::initial_offset_` is always zero. We should delete it so we don't have to maintain it like in #4359.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4362
Differential Revision: D9817829
Pulled By: ajkr
fbshipit-source-id: 474a2c679e5bd273b40608f3a5332931d9eefe6d
Summary:
TransactionOptions::skip_concurrency_control allows pessimistic transactions to skip the overhead of concurrency control. This could be as an optimization if the application knows that the transaction would not have any conflict with concurrent transactions. It is currently used during recovery assuming (i) application guarantees no conflict between prepared transactions in the WAL (ii) application guarantees that recovered transactions will be rolled back/commit before new transactions start.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4346
Differential Revision: D9759149
Pulled By: maysamyabandeh
fbshipit-source-id: f896e84fa58b0b584be904c7fd3883a41ea3215b
Summary:
Reverting is needed to unblock a user building against master, who is blocked for multiple days due to a thread-safety issue in `GetEmptyDict`. We haven't been able to fix it quickly, so reverting.
Simply ran `git revert 6c40806e51a89386d2b066fddf73d3fd03a36f65`. There were no merge conflicts.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4347
Differential Revision: D9668365
Pulled By: ajkr
fbshipit-source-id: 0c56334f0a23cf5ee0233d4e4679eae6709739cd
Summary:
This is a followup to #4311. Checking `!RangeDelAggregator::IsEmpty()` before opening a dedicated range tombstone SST did not properly prevent empty SSTs from being generated. That's because it relies on `CollapsedRangeDelMap::Size`, which had an underflow bug when the map was empty. This PR fixes that underflow bug.
Also fixed an uninitialized variable in db_stress.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4336
Differential Revision: D9600080
Pulled By: ajkr
fbshipit-source-id: bc6980ca79d2cd01b825ebc9dbccd51c1a70cfc7
Summary:
Currently unity-test is failing because both trace_replay.cc and trace_analyzer_tool.cc defined `DecodeCFAndKey` under anonymous namespace. It is supposed to be fine except unity test will dump all source files together and now we have a conflict.
Another issue with trace_analyzer_tool.cc is that it is using some utility functions from ldb_cmd which is not included in Makefile for unity_test, I chose to update TESTHARNESS to include LIBOBJECTS. Feel free to comment if there is a less intrusive way to solve this.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4323
Differential Revision: D9599170
Pulled By: miasantreble
fbshipit-source-id: 38765b11f8e7de92b43c63bdcf43ea914abdc029
Summary:
This PR fixes issue 3842. We drop deletion markers iff
1. We are the bottom most level AND
2. All other occurrences of the key are in the same snapshot range as the delete
I've also enhanced db_stress_test to add an option that does a full compare of the keys. This is done by a single thread (thread # 0). For tests I've run (so far)
make check -j64
db_stress
db_stress --acquire_snapshot_one_in=1000 --ops_per_thread=100000 /* to verify that new code doesnt break existing tests */
./db_stress --compare_full_db_state_snapshot=true --acquire_snapshot_one_in=1000 --ops_per_thread=100000 /* to verify new test code */
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4289
Differential Revision: D9491165
Pulled By: shrikanthshankar
fbshipit-source-id: ce144834f31736c189aaca81bed356ba990331e2
Summary:
In RocksDB, for a given SST file, all data blocks are compressed with the same dictionary. When we compress a block using the dictionary's raw bytes, the compression library first has to digest the dictionary to get it into a usable form. This digestion work is redundant and ideally should be done once per file.
ZSTD offers APIs for the caller to create and reuse a digested dictionary object (`ZSTD_CDict`). In this PR, we call `ZSTD_createCDict` once per file to digest the raw bytes. Then we use `ZSTD_compress_usingCDict` to compress each data block using the pre-digested dictionary. Once the file's created `ZSTD_freeCDict` releases the resources held by the digested dictionary.
There are a couple other changes included in this PR:
- Changed the parameter object for (un)compression functions from `CompressionContext`/`UncompressionContext` to `CompressionInfo`/`UncompressionInfo`. This avoids the previous pattern, where `CompressionContext`/`UncompressionContext` had to be mutated before calling a (un)compression function depending on whether dictionary should be used. I felt that mutation was error-prone so eliminated it.
- Added support for digested uncompression dictionaries (`ZSTD_DDict`) as well. However, this PR does not support reusing them across uncompression calls for the same file. That work is deferred to a later PR when we will store the `ZSTD_DDict` objects in block cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4251
Differential Revision: D9257078
Pulled By: ajkr
fbshipit-source-id: 21b8cb6bbdd48e459f1c62343780ab66c0a64438
Summary:
The API comment on `OnTableFileCreationStarted` (b6280d01f9/include/rocksdb/listener.h (L331-L333)) led users to believe a call to `OnTableFileCreationStarted` will always be matched with a call to `OnTableFileCreated`. However, we were skipping the `OnTableFileCreated` call in one case: no error happens but also no file is generated since there's no data.
This PR adds the call to `OnTableFileCreated` for that case. The filename will be "(nil)" and the size will be zero.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4307
Differential Revision: D9485201
Pulled By: ajkr
fbshipit-source-id: 2f077ec7913f128487aae2624c69a50762394df6
Summary:
Add the unit test of Iterator (Seek and SeekForPrev) to trace_analyzer_test. The output files after analyzing the trace file are checked to make sure that analyzing results are correct.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4282
Differential Revision: D9436758
Pulled By: zhichao-cao
fbshipit-source-id: 88d471c9a69e07382d9c6a45eba72773b171e7c2
Summary:
We want to sample the file I/O issued by RocksDB and report the function calls. This requires us to include the file paths otherwise it's hard to tell what has been going on.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4039
Differential Revision: D8670178
Pulled By: riversand963
fbshipit-source-id: 97ee806d1c583a2983e28e213ee764dc6ac28f7a