rocksdb

Author	SHA1	Message	Date
andrew	622683000c	Allow users to stop manual compactions (#3971 ) Summary: Manual compaction may bring in very high load because sometime the amount of data involved in a compaction could be large, which may affect online service. So it would be good if the running compaction making the server busy can be stopped immediately. In this implementation, stopping manual compaction condition is only checked in slow process. We let deletion compaction and trivial move go through. Pull Request resolved: https://github.com/facebook/rocksdb/pull/3971 Test Plan: add tests at more spots. Differential Revision: D17369043 fbshipit-source-id: 575a624fb992ce0bb07d9443eb209e547740043c	2019-09-16 21:01:47 -07:00
Maysam Yabandeh	638d239507	Charge block cache for cache internal usage (#5797 ) Summary: For our default block cache, each additional entry has extra memory overhead. It include LRUHandle (72 bytes currently) and the cache key (two varint64, file id and offset). The usage is not negligible. For example for block_size=4k, the overhead accounts for an extra 2% memory usage for the cache. The patch charging the cache for the extra usage, reducing untracked memory usage outside block cache. The feature is enabled by default and can be disabled by passing kDontChargeCacheMetadata to the cache constructor. This PR builds up on https://github.com/facebook/rocksdb/issues/4258 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5797 Test Plan: - Existing tests are updated to either disable the feature when the test has too much dependency on the old way of accounting the usage or increasing the cache capacity to account for the additional charge of metadata. - The Usage tests in cache_test.cc are augmented to test the cache usage under kFullChargeCacheMetadata. Differential Revision: D17396833 Pulled By: maysamyabandeh fbshipit-source-id: 7684ccb9f8a40ca595e4f5efcdb03623afea0c6f	2019-09-16 15:26:21 -07:00
Yanqin Jin	5d9a67e718	Support loading custom objects in unit tests (#5676 ) Summary: Most existing RocksDB unit tests run on `Env::Default()`. It will be useful to port the unit tests to non-default environments, e.g. `HdfsEnv`, etc. This pull request is one step towards this goal. If RocksDB unit tests are built with a static library exposing a function `RegisterCustomObjects()`, then it is possible to implement custom object registrar logic in the library. RocksDB unit test can call `RegisterCustomObjects()` at the beginning. By default, `ROCKSDB_UNITTESTS_WITH_CUSTOM_OBJECTS_FROM_STATIC_LIBS` is not defined, thus this PR has no impact on existing RocksDB because `RegisterCustomObjects()` is a noop. Test plan (on devserver): ``` $make clean && COMPILE_WITH_ASAN=1 make -j32 all $make check ``` All unit tests must pass. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5676 Differential Revision: D16679157 Pulled By: riversand963 fbshipit-source-id: aca571af3fd0525277cdc674248d0fe06e060f9d	2019-08-09 15:12:08 -07:00
Vijay Nadimpalli	d150e01474	New API to get all merge operands for a Key (#5604 ) Summary: This is a new API added to db.h to allow for fetching all merge operands associated with a Key. The main motivation for this API is to support use cases where doing a full online merge is not necessary as it is performance sensitive. Example use-cases: 1. Update subset of columns and read subset of columns - Imagine a SQL Table, a row is encoded as a K/V pair (as it is done in MyRocks). If there are many columns and users only updated one of them, we can use merge operator to reduce write amplification. While users only read one or two columns in the read query, this feature can avoid a full merging of the whole row, and save some CPU. 2. Updating very few attributes in a value which is a JSON-like document - Updating one attribute can be done efficiently using merge operator, while reading back one attribute can be done more efficiently if we don't need to do a full merge. ---------------------------------------------------------------------------------------------------- API : Status GetMergeOperands( const ReadOptions& options, ColumnFamilyHandle* column_family, const Slice& key, PinnableSlice* merge_operands, GetMergeOperandsOptions* get_merge_operands_options, int* number_of_operands) Example usage : int size = 100; int number_of_operands = 0; std::vector<PinnableSlice> values(size); GetMergeOperandsOptions merge_operands_info; db_->GetMergeOperands(ReadOptions(), db_->DefaultColumnFamily(), "k1", values.data(), merge_operands_info, &number_of_operands); Description : Returns all the merge operands corresponding to the key. If the number of merge operands in DB is greater than merge_operands_options.expected_max_number_of_operands no merge operands are returned and status is Incomplete. Merge operands returned are in the order of insertion. merge_operands-> Points to an array of at-least merge_operands_options.expected_max_number_of_operands and the caller is responsible for allocating it. If the status returned is Incomplete then number_of_operands will contain the total number of merge operands found in DB for key. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5604 Test Plan: Added unit test and perf test in db_bench that can be run using the command: ./db_bench -benchmarks=getmergeoperands --merge_operator=sortlist Differential Revision: D16657366 Pulled By: vjnadimpalli fbshipit-source-id: 0faadd752351745224ee12d4ae9ef3cb529951bf	2019-08-06 14:26:44 -07:00
sdong	66b5613d0c	row_cache to share entry for recent snapshots (#5600 ) Summary: Right now, users cannot take advantage of row cache, unless no snapshot is used, or Get() is repeated for the same snapshots. This limits the usage of row cache. This change eliminate this restriction in some cases. If the snapshot used is newer than the largest sequence number in the file, and write callback function is not registered, the same row cache key is used as no snapshot is given. We still need the callback function restriction for now because the callback function may filter out different keys for different snapshots even if the snapshots are new. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5600 Test Plan: Add a unit test. Differential Revision: D16386616 fbshipit-source-id: 6b7d214bd215d191b03ccf55926ad4b703ec2e53	2019-07-22 18:56:19 -07:00
Sagar Vemuri	1b59a490ef	Fix flaky DBTest2.PresetCompressionDict test (#5378 ) Summary: Fix flaky DBTest2.PresetCompressionDict test. This PR fixes two issues with the test: 1. Replaces `GetSstFiles` with `TotalSize`, which is based on `DB::GetColumnFamilyMetaData` so that only the size of the live SST files is taken into consideration when computing the total size of all sst files. Earlier, with `GetSstFiles`, even obsolete files were getting picked up. 1. In ZSTD compression, it is sometimes possible that using a trained dictionary is not better than using an untrained one. Using a trained dictionary performs well in 99% of the cases, but still in the remaining ~1% of the cases (out of 10000 runs) using an untrained dictionary gets better compression results. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5378 Differential Revision: D15559100 Pulled By: sagar0 fbshipit-source-id: c35adbf13871f520a2cec48f8bad9ff27ff7a0b4	2019-05-30 16:11:27 -07:00
Siying Dong	4e0f2aadb0	DB::Close() to fail when there are unreleased snapshots (#5272 ) Summary: Sometimes, users might make mistake of not releasing snapshots before closing the DB. This is undocumented use of RocksDB and the behavior is unknown. We return DB::Close() to provide a way to check it for the users. Aborted() will be returned to users when they call DB::Close(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/5272 Differential Revision: D15159713 Pulled By: siying fbshipit-source-id: 39369def612398d9f239d83d396b5a28e5af65cd	2019-05-01 10:17:30 -07:00
Zhongyi Xie	baa5302447	Avoid double-compacting data in bottom level in manual compactions (#5138 ) Summary: Depending on the config, manual compaction (leveled compaction style) does following compactions: L0->L1 L1->L2 ... Ln-1 -> Ln Ln -> Ln The final Ln -> Ln compaction is partly unnecessary as it recompacts all the files that were just generated by the Ln-1 -> Ln. We should avoid recompacting such files. This rule should be applied to Lmax only. Resolves issue https://github.com/facebook/rocksdb/issues/4995 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5138 Differential Revision: D14940106 Pulled By: miasantreble fbshipit-source-id: 8d3cf5507a17e76f3333cfd4bac5256d005636e5	2019-04-16 23:32:20 -07:00
Siying Dong	beb44ec3eb	WriteBufferManager's dummy entry size to block cache 1MB -> 256KB (#5175 ) Summary: Dummy cache size of 1MB is too large for small block sizes. Our GetDefaultCacheShardBits() use min_shard_size = 512L * 1024L to determine number of shards, so 1MB will excceeds the size of the whole shard and make the cache excceeds the budget. Change it to 256KB accordingly. There shouldn't be obvious performance impact, since inserting a cache entry every 256KB of memtable inserts is still infrequently enough. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5175 Differential Revision: D14954289 Pulled By: siying fbshipit-source-id: 2c275255c1ac3992174e06529e44c55538325c94	2019-04-16 12:03:07 -07:00
Siying Dong	85b2bde3dd	Still implement StatisticsImpl::measureTime() (#5181 ) Summary: Since Statistics::measureTime() is deprecated, StatisticsImpl::measureTime() is not implemented. We realized that users might have a wrapped Statistics implementation in which measureTime() is implemented as forwarded to StatisticsImpl, and causes assert failure. In order to make the change less intrusive, we implement StatisticsImpl::measureTime(). We will revisit whether we need to remove it after several releases. Also, add a test to make sure that a Statistics implementation using the old interface still works. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5181 Differential Revision: D14907089 Pulled By: siying fbshipit-source-id: 29b6202fd04e30ed6f6adcaeb1000e87f10d1e1a	2019-04-12 11:00:35 -07:00
Siying Dong	ed9f5e21aa	Change OptimizeForPointLookup() and OptimizeForSmallDb() (#5165 ) Summary: Change the behavior of OptimizeForSmallDb() so that it is less likely to go out of memory. Change the behavior of OptimizeForPointLookup() to take advantage of the new memtable whole key filter, and move away from prefix extractor as well as hash-based indexing, as they are prone to misuse. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5165 Differential Revision: D14880709 Pulled By: siying fbshipit-source-id: 9af30e3c9e151eceea6d6b38701a58f1f9fb692d	2019-04-11 10:45:36 -07:00
Zhichao Cao	ebb9b2ed16	Fix the potential DB crash caused by call EndTrace before StartTrace (#5130 ) Summary: Although user should first call StartTrace to begin the RocksDB tracing function and call EndTrace to stop the tracing process, user can accidentally call EndTrace first. It will cause segment fault and crash the DB instance. The issue is fixed by checking the pointer first. Test case added in db_test2. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5130 Differential Revision: D14691420 Pulled By: zhichao-cao fbshipit-source-id: 3be13d2f944bc453728ef8eef67b68d7ad0939c8	2019-04-03 13:26:34 -07:00
Maysam Yabandeh	14b3f683a1	WriteUnPrepared: less virtual in iterator callback (#5049 ) Summary: WriteUnPrepared adds a virtual function, MaxUnpreparedSequenceNumber, to ReadCallback, which returns 0 unless WriteUnPrepared is enabled and the transaction has uncommitted data written to the DB. Together with snapshot sequence number, this determines the last sequence that is visible to reads. The patch clarifies the guarantees of the GetIterator API in WriteUnPrepared transactions and make use of that to statically initialize the read callback and thus avoid the virtual call. Furthermore it increases the minimum value for min_uncommitted from 0 to 1 as seq 0 is used only for last level keys that are committed in all snapshots. The following benchmark shows +0.26% higher throughput in seekrandom benchmark. Benchmark: ./db_bench --benchmarks=fillrandom --use_existing_db=0 --num=1000000 --db=/dev/shm/dbbench ./db_bench --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100 seekrandom [AVG 10 runs] : 20355 ops/sec; 225.2 MB/sec seekrandom [MEDIAN 10 runs] : 20425 ops/sec; 225.9 MB/sec ./db_bench_lessvirtual3 --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100 seekrandom [AVG 10 runs] : 20409 ops/sec; 225.8 MB/sec seekrandom [MEDIAN 10 runs] : 20487 ops/sec; 226.6 MB/sec Pull Request resolved: https://github.com/facebook/rocksdb/pull/5049 Differential Revision: D14366459 Pulled By: maysamyabandeh fbshipit-source-id: ebaff8908332a5ae9af7defeadabcb624be660ef	2019-04-02 14:47:16 -07:00
Shi Feng	01e6badbb6	Introduce CPU timers for iterator seek and next (#5076 ) Summary: Introduce CPU timers for iterator seek and next operations. Seek counter includes SeekToFirst, SeekToLast and SeekForPrev, w/ the caveat that SeekToLast timer doesn't include some post processing time if upper bound is defined. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5076 Differential Revision: D14525218 Pulled By: fredfsh fbshipit-source-id: 03ba25df3b22b06c072621e4de0eacfa1445f0d9	2019-03-26 16:32:13 -07:00
Wenjie Yang	36c2a7cfb1	Add an option to filter traces (#5082 ) Summary: Add an option to filter out READ or WRITE operations while tracing. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5082 Differential Revision: D14515083 Pulled By: mrmiywj fbshipit-source-id: 2504c89a9abf1dd629cad44b4104092702d77610	2019-03-19 14:36:51 -07:00
Maysam Yabandeh	a661c0d208	WritePrepared: optimize read path by avoiding virtual (#5018 ) Summary: The read path includes a callback function, ReadCallback, which would eventually calls IsInSnapshot to figure if a particular seq is in the reading snapshot or not. This callback is virtual, which adds the cost of multiple virtual function call to each read. The first few checks in IsInSnapshot, however, are quite trivial and take care of majority of the cases. The patch moves those to a non-virtual function in the the parent class, ReadCallback, to lower the virtual callback cost. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5018 Differential Revision: D14226562 Pulled By: maysamyabandeh fbshipit-source-id: 6feed5b34f3b082e52092c5ef143e29b49c46b44	2019-02-26 16:56:19 -08:00
Siying Dong	93f7e7a450	Temporarily Disable DBTest2.PresetCompressionDict (#5003 ) Summary: DBTest2.PresetCompressionDict is flaky. Temparily disable it for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5003 Differential Revision: D14139505 Pulled By: siying fbshipit-source-id: ebf1872d364b76b2cb021b489ea2f17ee997116a	2019-02-19 14:44:12 -08:00
Michael Liu	ca89ac2ba9	Apply modernize-use-override (2nd iteration) Summary: Use C++11’s override and remove virtual where applicable. Change are automatically generated. Reviewed By: Orvid Differential Revision: D14090024 fbshipit-source-id: 1e9432e87d2657e1ff0028e15370a85d1739ba2a	2019-02-14 14:41:36 -08:00
Andrew Kryczka	62f70f6d14	Reduce scope of compression dictionary to single SST (#4952 ) Summary: Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio. So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include: - The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called. - After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up. - Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952 Differential Revision: D13967980 Pulled By: ajkr fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f	2019-02-11 19:47:32 -08:00
tang-jianfeng	08809f5e6c	Implement trace sampling (#4963 ) Summary: Implement trace sampling to allow user to specify the sampling frequency, i.e. save one per how many requests, so that a user does not need to log all if he/she is interested in only a sampled set. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4963 Differential Revision: D14011190 Pulled By: tang-jianfeng fbshipit-source-id: 078b631d9319b67cb089dd2c30e21d0df8dc406a	2019-02-08 18:08:18 -08:00
Alexander Zinoviev	32a6dd9a41	Add a new CPU time counter to compaction report (#4889 ) Summary: Measure CPU time consumed for a compaction and report it in the stats report Enable NowCPUNanos() to work for MacOS Pull Request resolved: https://github.com/facebook/rocksdb/pull/4889 Differential Revision: D13701276 Pulled By: zinoale fbshipit-source-id: 5024e5bbccd4dd10fd90d947870237f436445055	2019-01-29 17:24:00 -08:00
Andrew Kryczka	01013ae766	Digest ZSTD compression dictionary once when writing SST file (#4849 ) Summary: This is essentially a re-submission of #4251 with a few improvements: - Split `CompressionDict` into two separate classes: `CompressionDict` and `UncompressionDict` - Eliminated `Init` functions. Instead do all initialization work in constructors. - Added test case for parallel DB open, which is the scenario where #4251 failed under TSAN. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4849 Differential Revision: D13606039 Pulled By: ajkr fbshipit-source-id: 08c236059798c710db9cbf545fce0f371232d447	2019-01-18 19:12:57 -08:00
Maysam Yabandeh	d56ac22b44	Remove duplicates from SnapshotList::GetAll (#4860 ) Summary: The vector returned by SnapshotList::GetAll could have duplicate entries if two separate snapshots have the same sequence number. However, when this vector is used in compaction the duplicate entires are of no use and could be safely ignored. Moreover not having duplicate entires simplifies reasoning in the compaction_iterator.cc code. For example when searching for the previous_snap we currently use the snapshot before the current one but the way the code uses that it expects it to be also less than the current snapshot, which would be simpler to read if there is no duplicate entry in the snapshot list. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4860 Differential Revision: D13615502 Pulled By: maysamyabandeh fbshipit-source-id: d45bf01213ead5f39db811f951802da6fcc3332b	2019-01-09 16:25:42 -08:00
Siying Dong	8641e9adf7	Non-initial file preloading should always prefetch index and filter (#4852 ) Summary: https://github.com/facebook/rocksdb/pull/3340 introduces preloading when max_open_files != -1. It doesn't preload index and filter in non-initial file loading case. This is a little bit too complicated to understand. We observed in one MyRocks use case where the filter is expected to be preloaded but is not. To simplify the use case, we simply always prefetch the index and filter. They anyway is expected to be loaded in the file verification phase anyway. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4852 Differential Revision: D13595402 Pulled By: siying fbshipit-source-id: d4d8624eb3e849e20aeb990df2100502d85aff31	2019-01-08 12:47:34 -08:00
Siying Dong	f0dda35d7d	Preload some files even if options.max_open_files (#3340 ) Summary: Choose to preload some files if options.max_open_files != -1. This can slightly narrow the gap of performance between options.max_open_files is -1 and a large number. To avoid a significant regression to DB reopen speed if options.max_open_files != -1. Limit the files to preload in DB open time to 16. Pull Request resolved: https://github.com/facebook/rocksdb/pull/3340 Differential Revision: D6686945 Pulled By: siying fbshipit-source-id: 8ec11bbdb46e3d0cdee7b6ad5897a09c5a07869f	2018-12-28 18:02:28 -08:00
Siying Dong	da1c64b6e7	Introduce a CPU time counter in perf_context (#4741 ) Summary: Introduce the first CPU timing counter, perf_context.get_cpu_nanos. This opens a door to more CPU counters in the future. Only Posix Env has it implemented using clock_gettime() with CLOCK_THREAD_CPUTIME_ID. How accurate the counter is depends on the platform. Make PerfStepTimer to take an Env as an argument, and sometimes pass it in. The direct reason is to make the unit tests to use SpecialEnv where we can ingest logic there. But in long term, this is a good change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4741 Differential Revision: D13287798 Pulled By: siying fbshipit-source-id: 090361049d9d5095d1d1a369fe1338d2e2e1c73f	2018-12-20 12:03:44 -08:00
Zhichao Cao	7125e24619	Add the max trace file size limitation option to Tracing (#4610 ) Summary: If user do not end the trace manually, the tracing will continue which can potential use up all the storage space and cause problem. In this PR, the max trace file size is added to the TraceOptions and user can set the value if they need or the default is 64GB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4610 Differential Revision: D12893400 Pulled By: zhichao-cao fbshipit-source-id: acf4b5a6076bb691778bdfbac4864e1006758953	2018-11-27 14:27:05 -08:00
Siying Dong	13579e8c5a	WriteBufferManger doens't cost to cache if no limit is set (#4695 ) Summary: WriteBufferManger is not invoked when allocating memory for memtable if the limit is not set even if a cache is passed. It is inconsistent from the comment syas. Fix it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4695 Differential Revision: D13112722 Pulled By: siying fbshipit-source-id: 0b27eef63867f679cd06033ea56907c0569597f4	2018-11-18 16:55:43 -08:00
Siying Dong	abb1a8fc23	Add a unit test to assert number of preads (#4657 ) Summary: We used to have a bug, which caused every block to be read twice, and none of our tests caught it. Add a very simply unit test to make sure that when reading a data block, we only issue one pread against the SST file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4657 Differential Revision: D13005260 Pulled By: siying fbshipit-source-id: 03167b554ad2451192b1707415536d7d05e9026c	2018-11-13 12:52:19 -08:00
Sagar Vemuri	2993cd2002	Fix RocksDB Lite build (#4675 ) Summary: Our internal CI test caught RocksDB Lite build failures. The failures are due to a new test introduced in #4665 using `SSTFileWriter` and `IngestExternalFile`, but these is not exposed under lite mode. Fixed by #ifdef'ing out the test. ``` db/db_test2.cc: In member function ‘virtual void rocksdb::DBTest2_TestCompactFiles_Test::TestBody()’: db/db_test2.cc:2907:3: error: ‘SstFileWriter’ is not a member of ‘rocksdb’ rocksdb::SstFileWriter sst_file_writer{rocksdb::EnvOptions(), options}; ^ In file included from ./util/testharness.h:15:0, from ./table/mock_table.h:23, from ./db/db_test_util.h:44, from db/db_test2.cc:13: db/db_test2.cc:2912:13: error: ‘sst_file_writer’ was not declared in this scope ASSERT_OK(sst_file_writer.Open(external_file1)); ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4675 Differential Revision: D13035984 Pulled By: sagar0 fbshipit-source-id: c1ceac550dfac1a85eeea436693dc7dd467519a6	2018-11-12 19:01:37 -08:00
DorianZheng	0f88160f67	Fix `CompactFiles` bug (#4665 ) Summary: `CompactFiles` gets `SuperVersion` before `WaitForIngestFile`, while `IngestExternalFile` may add files that overlap with `input_file_names` The timeline of execution flow is as follow: Let's say that level N has two file [1,2] and [5,6] ``` timeline user_thread1 user_thread2 t0 \| CompactFiles([1, 2], [5, 6]) begin t1 \| GetReferencedSuperVersion() t2 \| IngestExternalFile([3,4]) to level N begin t3 \| CompactFiles resume V ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4665 Differential Revision: D13030674 Pulled By: ajkr fbshipit-source-id: 8be19477fd6e505032267a979d32f3097cc3be51	2018-11-12 14:32:18 -08:00
Sagar Vemuri	dc3528077a	Update all unique/shared_ptr instances to be qualified with namespace std (#4638 ) Summary: Ran the following commands to recursively change all the files under RocksDB: ``` find . -type f -name ".cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} + ``` Running `make format` updated some formatting on the files touched. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638 Differential Revision: D12934992 Pulled By: sagar0 fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8	2018-11-09 11:19:58 -08:00
Peter Pei	09814f2cfc	support OnCompactionBegin (#4431 ) Summary: fix #4288 Add `OnCompactionBegin` support to `rocksdb::EventListener`. Currently, we only have these three callbacks: - OnFlushBegin - OnFlushCompleted - OnCompactionCompleted As paolococchi requested in #4288 , and ajkr agreed, we should also support `OnCompactionBegin`. This PR is a try to implement the support of `OnCompactionBegin`. Hope it is useful to you. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4431 Differential Revision: D10055515 Pulled By: yiwu-arbug fbshipit-source-id: 39c0f95f8e9ff1c7ca3a10787502a17f258d2334	2018-10-10 17:32:27 -07:00
DorianZheng	27090ae8f6	Fix DBImpl::GetColumnFamilyHandleUnlocked race condition (#4391 ) Summary: - Fix DBImpl API race condition The timeline of execution flow is as follow: ``` timeline user_thread1 user_thread2 t1 \| cfh = GetColumnFamilyHandleUnlocked(0) t2 \| id1 = cfh->GetID() t3 \| GetColumnFamilyHandleUnlocked(1) t4 \| id2 = cfh->GetID() V ``` The original implementation return a pointer to a stateful variable, so that the return `ColumnFamilyHandle` will be changed when another thread calls `GetColumnFamilyHandleUnlocked` with different `column family id` - Expose ColumnFamily ID to compaction event listener - Fix the return status of `DBImpl::GetLatestSequenceForKey` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4391 Differential Revision: D10221243 Pulled By: yiwu-arbug fbshipit-source-id: dec60ee9ff0c8261a2f2413a8506ec1063991993	2018-10-08 14:24:16 -07:00
Sagar Vemuri	3db584059c	Remove sync point from Block destructor (#4370 ) Summary: AddressSanitizer: heap-use-after-free in std::__atomic_base<bool>::load(std::memory_order) const ==1798517==ABORTING ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4370 Differential Revision: D9844146 Pulled By: sagar0 fbshipit-source-id: 18a2970b1d504b4f6c8fb04857f26e0f32124dd1	2018-09-15 00:12:57 -07:00
Siying Dong	f3d91a0b57	Add a unit test to verify iterators release data blocks after using them (#4170 ) Summary: Add a unit test to check that iterators release data blocks after it has moved away from it. Verify the same for compaction input iterators. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4170 Differential Revision: D8962513 Pulled By: siying fbshipit-source-id: 05a5b604d7d29887fb488f2cda7286f554a14407	2018-08-13 17:43:14 -07:00
Zhichao Cao	6d75319d95	Add tracing function of Seek() and SeekForPrev() to trace_replay (#4228 ) Summary: In the current trace_and replay, Get an WriteBatch are traced. This pull request track down the Seek() and SeekForPrev() to the trace file. <target_key, timestamp, column_family_id> are write to the file. Replay of Iterator is not supported in the current implementation. Tested with trace_analyzer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4228 Differential Revision: D9201381 Pulled By: zhichao-cao fbshipit-source-id: 6f9cc9cb3c20260af741bee065ec35c5c96354ab	2018-08-10 17:57:40 -07:00
Sagar Vemuri	12b6cdeed3	Trace and Replay for RocksDB (#3837 ) Summary: A framework for tracing and replaying RocksDB operations. A binary trace file is created by capturing the DB operations, and it can be replayed back at the same rate using db_bench. - Column-families are supported - Multi-threaded tracing is supported. - TraceReader and TraceWriter are exposed to the user, so that tracing to various destinations can be enabled (say, to other messaging/logging services). By default, a FileTraceReader and FileTraceWriter are implemented to capture to a file and replay from it. - This is not yet ideal to be enabled in production due to large performance overhead, but it can be safely tried out in a shadow setup, say, for analyzing RocksDB operations. Currently supported DB operations: - Writes: -- Put -- Merge -- Delete -- SingleDelete -- DeleteRange -- Write - Reads: -- Get (point lookups) Pull Request resolved: https://github.com/facebook/rocksdb/pull/3837 Differential Revision: D7974837 Pulled By: sagar0 fbshipit-source-id: 8ec65aaf336504bc1f6ed0feae67f6ed5ef97a72	2018-08-01 00:27:08 -07:00
Siying Dong	8425c8bd4d	BlockBasedTableReader: automatically adjust tail prefetch size (#4156 ) Summary: Right now we use one hard-coded prefetch size to prefetch data from the tail of the SST files. However, this may introduce a waste for some use cases, while not efficient for others. Introduce a way to adjust this prefetch size by tracking 32 recent times, and pick a value with which the wasted read is less than 10% Pull Request resolved: https://github.com/facebook/rocksdb/pull/4156 Differential Revision: D8916847 Pulled By: siying fbshipit-source-id: 8413f9eb3987e0033ed0bd910f83fc2eeaaf5758	2018-07-20 14:43:37 -07:00
Maysam Yabandeh	8581a93a6b	Per-thread unique test db names (#4135 ) Summary: The patch makes sure that two parallel test threads will operate on different db paths. This enables using open source tools such as gtest-parallel to run the tests of a file in parallel. Example: ``` ~/gtest-parallel/gtest-parallel ./table_test``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4135 Differential Revision: D8846653 Pulled By: maysamyabandeh fbshipit-source-id: 799bad1abb260e3d346bcb680d2ae207a852ba84	2018-07-13 17:27:39 -07:00
Maysam Yabandeh	0a5b5d88b2	Remove ReadOnly part of PinnableSliceAndMmapReads from Lite (#4070 ) Summary: Lite does not support readonly DBs. Closes https://github.com/facebook/rocksdb/pull/4070 Differential Revision: D8677858 Pulled By: maysamyabandeh fbshipit-source-id: 536887d2363ee2f5d8e1ea9f1a511e643a1707fa	2018-06-28 08:42:17 -07:00
Maysam Yabandeh	235ab9dd32	Pin mmap files in ReadOnlyDB (#4053 ) Summary: https://github.com/facebook/rocksdb/pull/3881 fixed a bug where PinnableSlice pin mmap files which could be deleted with background compaction. This is however a non-issue for ReadOnlyDB when there is no compaction running and max_open_files is -1. This patch reenables the pinning feature for that case. Closes https://github.com/facebook/rocksdb/pull/4053 Differential Revision: D8662546 Pulled By: maysamyabandeh fbshipit-source-id: 402962602eb0f644e17822748332999c3af029fd	2018-06-27 17:13:34 -07:00
Manuel Ung	a16e00b7b9	WriteUnPrepared Txn: Disable seek to snapshot optimization (#3955 ) Summary: This is implemented by extending ReadCallback with another function `MaxUnpreparedSequenceNumber` which returns the largest visible sequence number for the current transaction, if there is uncommitted data written to DB. Otherwise, it returns zero, indicating no uncommitted data. There are the places where reads had to be modified. - Get and Seek/Next was just updated to seek to max(snapshot_seq, MaxUnpreparedSequenceNumber()) instead, and iterate until a key was visible. - Prev did not need need updates since it did not use the Seek to sequence number optimization. Assuming that locks were held when writing unprepared keys, and ValidateSnapshot runs, there should only be committed keys and unprepared keys of the current transaction, all of which are visible. Prev will simply iterate to get the last visible key. - Reseeking to skip keys optimization was also disabled for write unprepared, since it's possible to hit the max_skip condition even while reseeking. There needs to be some way to resolve infinite looping in this case. Closes https://github.com/facebook/rocksdb/pull/3955 Differential Revision: D8286688 Pulled By: lth fbshipit-source-id: 25e42f47fdeb5f7accea0f4fd350ef35198caafe	2018-06-27 12:23:07 -07:00
Andrew Kryczka	fea2b1dfb2	Copy Get() result when file reads use mmap Summary: For iterator reads, a `SuperVersion` is pinned to preserve a snapshot of SST files, and `Block`s are pinned to allow `key()` and `value()` to return pointers directly into a RocksDB memory region. This works for both non-mmap reads, where the block owns the memory region, and mmap reads, where the file owns the memory region. For point reads with `PinnableSlice`, only the `Block` object is pinned. This works for non-mmap reads because the block owns the memory region, so even if the file is deleted after compaction, the memory region survives. However, for mmap reads, file deletion causes the memory region to which the `PinnableSlice` refers to be unmapped. The result is usually a segfault upon accessing the `PinnableSlice`, although sometimes it returned wrong results (I repro'd this a bunch of times with `db_stress`). This PR copies the value into the `PinnableSlice` when it comes from mmap'd memory. We can tell whether the `Block` owns its memory using `Block::cachable()`, which is unset when reads do not use the provided buffer as is the case with mmap file reads. When that is false we ensure the result of `Get()` is copied. This feels like a short-term solution as ideally we'd have the `PinnableSlice` pin the mmap'd memory so we can do zero-copy reads. It seemed hard so I chose this approach to fix correctness in the meantime. Closes https://github.com/facebook/rocksdb/pull/3881 Differential Revision: D8076288 Pulled By: ajkr fbshipit-source-id: 31d78ec010198723522323dbc6ea325122a46b08	2018-06-01 16:57:58 -07:00
Andrew Kryczka	072ae671a7	Apply use_direct_io_for_flush_and_compaction to writes only Summary: Previously `DBOptions::use_direct_io_for_flush_and_compaction=true` combined with `DBOptions::use_direct_reads=false` could cause RocksDB to simultaneously read from two file descriptors for the same file, where background reads used direct I/O and foreground reads used buffered I/O. Our measurements found this mixed-mode I/O negatively impacted foreground read perf, compared to when only buffered I/O was used. This PR makes the mixed-mode I/O situation impossible by repurposing `DBOptions::use_direct_io_for_flush_and_compaction` to only apply to background writes, and `DBOptions::use_direct_reads` to apply to all reads. There is no risk of direct background direct writes happening simultaneously with buffered reads since we never read from and write to the same file simultaneously. Closes https://github.com/facebook/rocksdb/pull/3829 Differential Revision: D7915443 Pulled By: ajkr fbshipit-source-id: 78bcbf276449b7e7766ab6b0db246f789fb1b279	2018-05-09 19:42:58 -07:00
David Lai	3be9b36453	comment unused parameters to turn on -Wunused-parameter flag Summary: This PR comments out the rest of the unused arguments which allow us to turn on the -Wunused-parameter flag. This is the second part of a codemod relating to https://github.com/facebook/rocksdb/pull/3557. Closes https://github.com/facebook/rocksdb/pull/3662 Differential Revision: D7426121 Pulled By: Dayvedde fbshipit-source-id: 223994923b42bd4953eb016a0129e47560f7e352	2018-04-12 17:59:16 -07:00
Phani Shekhar Mantripragada	446b32cfc3	Support for Column family specific paths. Summary: In this change, an option to set different paths for different column families is added. This option is set via cf_paths setting of ColumnFamilyOptions. This option will work in a similar fashion to db_paths setting. Cf_paths is a vector of Dbpath values which contains a pair of the absolute path and target size. Multiple levels in a Column family can go to different paths if cf_paths has more than one path. To maintain backward compatibility, if cf_paths is not specified for a column family, db_paths setting will be used. Note that, if db_paths setting is also not specified, RocksDB already has code to use db_name as the only path. Changes : 1) A new member "cf_paths" is added to ImmutableCfOptions. This is set, based on cf_paths setting of ColumnFamilyOptions and db_paths setting of ImmutableDbOptions. This member is used to identify the path information whenever files are accessed. 2) Validation checks are added for cf_paths setting based on existing checks for db_paths setting. 3) DestroyDB, PurgeObsoleteFiles etc. are edited to support multiple cf_paths. 4) Unit tests are added appropriately. Closes https://github.com/facebook/rocksdb/pull/3102 Differential Revision: D6951697 Pulled By: ajkr fbshipit-source-id: 60d2262862b0a8fd6605b09ccb0da32bb331787d	2018-04-05 19:58:20 -07:00
Dmitri Smirnov	c364eb42b5	Windows cumulative patch Summary: This patch addressed several issues. Portability including db_test std::thread -> port::Thread Cc: @ and %z to ROCKSDB portable macro. Cc: maysamyabandeh Implement Env::AreFilesSame Make the implementation of file unique number more robust Get rid of C-runtime and go directly to Windows API when dealing with file primitives. Implement GetSectorSize() and aling unbuffered read on the value if available. Adjust Windows Logger for the new interface, implement CloseImpl() Cc: anand1976 Fix test running script issue where $status var was of incorrect scope so the failures were swallowed and not reported. DestroyDB() creates a logger and opens a LOG file in the directory being cleaned up. This holds a lock on the folder and the cleanup is prevented. This fails one of the checkpoin tests. We observe the same in production. We close the log file in this change. Fix DBTest2.ReadAmpBitmapLiveInCacheAfterDBClose failure where the test attempts to open a directory with NewRandomAccessFile which does not work on Windows. Fix DBTest.SoftLimit as it is dependent on thread timing. CC: yiwu-arbug Closes https://github.com/facebook/rocksdb/pull/3552 Differential Revision: D7156304 Pulled By: siying fbshipit-source-id: 43db0a757f1dfceffeb2b7988043156639173f5b	2018-03-06 11:57:43 -08:00
Andrew Kryczka	5d68243e61	Comment out unused variables Summary: Submitting on behalf of another employee. Closes https://github.com/facebook/rocksdb/pull/3557 Differential Revision: D7146025 Pulled By: ajkr fbshipit-source-id: 495ca5db5beec3789e671e26f78170957704e77e	2018-03-05 13:13:41 -08:00
Igor Sugak	aba3409740	Back out "[codemod] - comment out unused parameters" Reviewed By: igorsugak fbshipit-source-id: 4a93675cc1931089ddd574cacdb15d228b1e5f37	2018-02-22 12:43:17 -08:00
David Lai	f4a030ce81	- comment out unused parameters Reviewed By: everiq, igorsugak Differential Revision: D7046710 fbshipit-source-id: 8e10b1f1e2aecebbfb229c742e214db887e5a461	2018-02-22 09:44:23 -08:00
Andrew Kryczka	ab5ab36ac2	fix DBTest2.ReadAmpBitmapLiveInCacheAfterDBClose file ID support check Summary: Updated the test case to handle tmpfs mounted at directories different from "/dev/shm/". Closes https://github.com/facebook/rocksdb/pull/3440 Differential Revision: D6848213 Pulled By: ajkr fbshipit-source-id: 465e9dbf0921d0930161f732db6b3766bb030589	2018-01-30 16:50:42 -08:00
Andrew Kryczka	46e599fc6b	fix live WALs purged while file deletions disabled Summary: When calling `DisableFileDeletions` followed by `GetSortedWalFiles`, we guarantee the files returned by the latter call won't be deleted until after file deletions are re-enabled. However, `GetSortedWalFiles` didn't omit files already planned for deletion via `PurgeObsoleteFiles`, so the guarantee could be broken. We fix it by making `GetSortedWalFiles` wait for the number of pending purges to hit zero if file deletions are disabled. This condition is eventually met since `PurgeObsoleteFiles` is guaranteed to be called for the existing pending purges, and new purges cannot be scheduled while file deletions are disabled. Once the condition is met, `GetSortedWalFiles` simply returns the content of DB and archive directories, which nobody can delete (except for deletion scheduler, for which I plan to fix this bug later) until deletions are re-enabled. Closes https://github.com/facebook/rocksdb/pull/3341 Differential Revision: D6681131 Pulled By: ajkr fbshipit-source-id: 90b1e2f2362ea9ef715623841c0826611a817634	2018-01-17 17:42:04 -08:00
Andrew Kryczka	93f69cb93a	use bottommost compression when base level is bottommost Summary: The previous compression type selection caused unexpected behavior when the base level was also the bottommost level. The following sequence of events could happen: - full compaction generates files with `bottommost_compression` type - now base level is bottommost level since all files are in the same level - any compaction causes files to be rewritten `compression_per_level` type since bottommost compression didn't apply to base level I changed the code to make bottommost compression apply to base level. Closes https://github.com/facebook/rocksdb/pull/3141 Differential Revision: D6264614 Pulled By: ajkr fbshipit-source-id: d7aaa8675126896684154a1f2c9034d6214fde82	2017-11-09 17:42:00 -08:00
Andrew Kryczka	24ad430600	pass key/value samples through zstd compression dictionary generator Summary: Instead of using samples directly, we now support passing the samples through zstd's dictionary generator when `CompressionOptions::zstd_max_train_bytes` is set to nonzero. If set to zero, we will use the samples directly as the dictionary -- same as before. Note this is the first step of #2987, extracted into a separate PR per reviewer request. Closes https://github.com/facebook/rocksdb/pull/3057 Differential Revision: D6116891 Pulled By: ajkr fbshipit-source-id: 70ab13cc4c734fa02e554180eed0618b75255497	2017-11-02 22:56:36 -07:00
Yi Wu	8e63cad078	fix lite build Summary: * make `checksum_type_string_map` available for lite * comment out `FilesPerLevel` in lite mode. * travis and legocastle lite build also build `all` target and run tests Closes https://github.com/facebook/rocksdb/pull/3015 Differential Revision: D6069822 Pulled By: yiwu-arbug fbshipit-source-id: 9fe92ac220e711e9e6ed4e921bd25ef4314796a0	2017-10-17 08:57:09 -07:00
Maysam Yabandeh	f46464d383	write-prepared txn: call IsInSnapshot Summary: This patch instruments the read path to verify each read value against an optional ReadCallback class. If the value is rejected, the reader moves on to the next value. The WritePreparedTxn makes use of this feature to skip sequence numbers that are not in the read snapshot. Closes https://github.com/facebook/rocksdb/pull/2850 Differential Revision: D5787375 Pulled By: maysamyabandeh fbshipit-source-id: 49d808b3062ab35e7ae98ad388f659757794184c	2017-09-11 09:14:48 -07:00
Artem Danilov	8a6708f5f2	Extend property map with compaction stats Summary: This branch extends existing property map which keeps values in doubles to keep values in strings so that it can be used to provide wider range of properties. The immediate need for that is to provide IO stall stats in an easy parseable way to MyRocks which is also part of this branch. Closes https://github.com/facebook/rocksdb/pull/2794 Differential Revision: D5717676 Pulled By: Tema fbshipit-source-id: e34ba5b79ba774697f7b97ce1138d8fd55471b8a	2017-08-30 15:26:55 -07:00
Yi Wu	3c840d1a6d	Allow DB reopen with reduced options.num_levels Summary: Allow user to reduce number of levels in LSM by issue a full CompactRange() and put the result in a lower level, and then reopen DB with reduced options.num_levels. Previous this will fail on reopen on when recovery replaying the previous MANIFEST and found a historical file was on a higher level than the new options.num_levels. The workaround was after CompactRange(), reopen the DB with old num_levels, which will create a new MANIFEST, and then reopen the DB again with new num_levels. This patch relax the check of levels during recovery. It allows DB to open if there was a historical file on level > options.num_levels, but was also deleted. Closes https://github.com/facebook/rocksdb/pull/2740 Differential Revision: D5629354 Pulled By: yiwu-arbug fbshipit-source-id: 545903f6b36b6083e8cbaf777176aef2f488021d	2017-08-24 16:10:54 -07:00
Siying Dong	666a005f9b	Support prefetch last 512KB with direct I/O in block based file reader Summary: Right now, if direct I/O is enabled, prefetching the last 512KB cannot be applied, except compaction inputs or readahead is enabled for iterators. This can create a lot of I/O for HDD cases. To solve the problem, the 512KB is prefetched in block based table if direct I/O is enabled. The prefetched buffer is passed in totegher with random access file reader, so that we try to read from the buffer before reading from the file. This can be extended in the future to support flexible user iterator readahead too. Closes https://github.com/facebook/rocksdb/pull/2708 Differential Revision: D5593091 Pulled By: siying fbshipit-source-id: ee36ff6d8af11c312a2622272b21957a7b5c81e7	2017-08-11 12:16:45 -07:00
Siying Dong	e7697b8ce8	Fix LITE unit tests Summary: Closes https://github.com/facebook/rocksdb/pull/2649 Differential Revision: D5505778 Pulled By: siying fbshipit-source-id: 7e935603ede3d958ea087ed6b8cfc4121e8797bc	2017-07-26 21:11:47 -07:00
Sagar Vemuri	72502cf227	Revert "comment out unused parameters" Summary: This reverts the previous commit `1d7048c598`, which broke the build. Did a `git revert 1d7048c`. Closes https://github.com/facebook/rocksdb/pull/2627 Differential Revision: D5476473 Pulled By: sagar0 fbshipit-source-id: 4756ff5c0dfc88c17eceb00e02c36176de728d06	2017-07-21 18:26:26 -07:00
Victor Gao	1d7048c598	comment out unused parameters Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually. Reviewed By: igorsugak Differential Revision: D5454343 fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2	2017-07-21 14:57:44 -07:00
Siying Dong	3c327ac2d0	Change RocksDB License Summary: Closes https://github.com/facebook/rocksdb/pull/2589 Differential Revision: D5431502 Pulled By: siying fbshipit-source-id: 8ebf8c87883daa9daa54b2303d11ce01ab1f6f75	2017-07-15 16:11:23 -07:00
Ewout Prangsma	51778612c9	Encryption at rest support Summary: This PR adds support for encrypting data stored by RocksDB when written to disk. It adds an `EncryptedEnv` override of the `Env` class with matching overrides for sequential&random access files. The encryption itself is done through a configurable `EncryptionProvider`. This class creates is asked to create `BlockAccessCipherStream` for a file. This is where the actual encryption/decryption is being done. Currently there is a Counter mode implementation of `BlockAccessCipherStream` with a `ROT13` block cipher (NOTE the `ROT13` is for demo purposes only!!). The Counter operation mode uses an initial counter & random initialization vector (IV). Both are created randomly for each file and stored in a 4K (default size) block that is prefixed to that file. The `EncryptedEnv` implementation is such that clients of the `Env` class do not see this prefix (nor data, nor in filesize). The largest part of the prefix block is also encrypted, and there is room left for implementation specific settings/values/keys in there. To test the encryption, the `DBTestBase` class has been extended to consider a new environment variable called `ENCRYPTED_ENV`. If set, the test will setup a encrypted instance of the `Env` class to use for all tests. Typically you would run it like this: ``` ENCRYPTED_ENV=1 make check_some ``` There is also an added test that checks that some data inserted into the database is or is not "visible" on disk. With `ENCRYPTED_ENV` active it must not find plain text strings, with `ENCRYPTED_ENV` unset, it must find the plain text strings. Closes https://github.com/facebook/rocksdb/pull/2424 Differential Revision: D5322178 Pulled By: sdwilsh fbshipit-source-id: 253b0a9c2c498cc98f580df7f2623cbf7678a27f	2017-06-26 16:56:24 -07:00
Andrew Kryczka	c217e0b9c7	Call RateLimiter for compaction reads Summary: Allow users to rate limit background work based on read bytes, written bytes, or sum of read and written bytes. Support these by changing the RateLimiter API, so no additional options were needed. Closes https://github.com/facebook/rocksdb/pull/2433 Differential Revision: D5216946 Pulled By: ajkr fbshipit-source-id: aec57a8357dbb4bfde2003261094d786d94f724e	2017-06-13 14:56:46 -07:00
Siying Dong	52a7f38b19	WriteOptions.low_pri which can throttle low pri writes if needed Summary: If ReadOptions.low_pri=true and compaction is behind, the write will either return immediate or be slowed down based on ReadOptions.no_slowdown. Closes https://github.com/facebook/rocksdb/pull/2369 Differential Revision: D5127619 Pulled By: siying fbshipit-source-id: d30e1cff515890af0eff32dfb869d2e4c9545eb0	2017-06-05 15:02:35 -07:00
Siying Dong	95b0e89b5d	Improve write buffer manager (and allow the size to be tracked in block cache) Summary: Improve write buffer manager in several ways: 1. Size is tracked when arena block is allocated, rather than every allocation, so that it can better track actual memory usage and the tracking overhead is slightly lower. 2. We start to trigger memtable flush when 7/8 of the memory cap hits, instead of 100%, and make 100% much harder to hit. 3. Allow a cache object to be passed into buffer manager and the size allocated by memtable can be costed there. This can help users have one single memory cap across block cache and memtable. Closes https://github.com/facebook/rocksdb/pull/2350 Differential Revision: D5110648 Pulled By: siying fbshipit-source-id: b4238113094bf22574001e446b5d88523ba00017	2017-06-02 14:26:56 -07:00
Andrew Kryczka	bb01c1880c	Introduce max_background_jobs mutable option Summary: - `max_background_flushes` and `max_background_compactions` are still supported for backwards compatibility - `base_background_compactions` is completely deprecated. Now we just throttle to one background compaction when there's no pressure. - `max_background_jobs` is added to automatically partition the concurrent background jobs into flushes vs compactions. Currently it's very simple as we just allocate one-fourth of the jobs to flushes, and the remaining can be used for compactions. - The test cases that set `base_background_compactions > 1` needed to be updated. I just grab the pressure token such that the desired number of compactions can be scheduled. Closes https://github.com/facebook/rocksdb/pull/2205 Differential Revision: D4937461 Pulled By: ajkr fbshipit-source-id: df52cbbd497e13bbc9a60560a5ac2a2526b3f1f9	2017-05-24 11:29:08 -07:00
Aaron Gao	492fc49a86	fix readampbitmap tests Summary: fix test failure of ReadAmpBitmap and ReadAmpBitmapLiveInCacheAfterDBClose. test ReadAmpBitmapLiveInCacheAfterDBClose individually and make check Closes https://github.com/facebook/rocksdb/pull/2271 Differential Revision: D5038133 Pulled By: lightmark fbshipit-source-id: 803cd6f45ccfdd14a9d9473c8af311033e164be8	2017-05-10 12:29:23 -07:00
Andrew Kryczka	be421b0b16	portable sched_getcpu calls Summary: - added a feature test in build_detect_platform to check whether sched_getcpu() is available. glibc offers it only on some platforms (e.g., linux but not mac); this way should be easier than maintaining a list of platforms on which it's available. - refactored PhysicalCoreID() to be simpler / less repetitive. ordered the conditional compilation clauses from most-to-least preferred Closes https://github.com/facebook/rocksdb/pull/2272 Differential Revision: D5038093 Pulled By: ajkr fbshipit-source-id: 81d7db3cc620250de220bdeb3194b2b3d7673de7	2017-05-10 12:29:23 -07:00
Aaron Gao	259a00eaca	unbiase readamp bitmap Summary: Consider BlockReadAmpBitmap with bytes_per_bit = 32. Suppose bytes [a, b) were used, while bytes [a-32, a) and [b+1, b+33) weren't used; more formally, the union of ranges passed to BlockReadAmpBitmap::Mark() contains [a, b) and doesn't intersect with [a-32, a) and [b+1, b+33). Then bits [floor(a/32), ceil(b/32)] will be set, and so the number of useful bytes will be estimated as (ceil(b/32) - floor(a/32)) * 32, which is on average equal to b-a+31. An extreme example: if we use 1 byte from each block, it'll be counted as 32 bytes from each block. It's easy to remove this bias by slightly changing the semantics of the bitmap. Currently each bit represents a byte range [i32, (i+1)32). This diff makes each bit represent a single byte: i32 + X, where X is a random number in [0, 31] generated when bitmap is created. So, e.g., if you read a single byte at random, with probability 31/32 it won't be counted at all, and with probability 1/32 it will be counted as 32 bytes; so, on average it's counted as 1 byte. But there is one exception: the last bit will always set with the old way.* (*) - assuming read_amp_bytes_per_bit = 32. Closes https://github.com/facebook/rocksdb/pull/2259 Differential Revision: D5035652 Pulled By: lightmark fbshipit-source-id: bd98b1b9b49fbe61f9e3781d07f624e3cbd92356	2017-05-10 01:49:52 -07:00
Siying Dong	d616ebea23	Add GPLv2 as an alternative license. Summary: Closes https://github.com/facebook/rocksdb/pull/2226 Differential Revision: D4967547 Pulled By: siying fbshipit-source-id: dd3b58ae1e7a106ab6bb6f37ab5c88575b125ab4	2017-04-27 18:06:12 -07:00
Tomas Kolda	04d58970cb	AIX and Solaris Sparc Support Summary: Replacement of #2147 The change was squashed due to a lot of conflicts. Closes https://github.com/facebook/rocksdb/pull/2194 Differential Revision: D4929799 Pulled By: siying fbshipit-source-id: 5cd49c254737a1d5ac13f3c035f128e86524c581	2017-04-21 20:48:04 -07:00
Aaron Gao	44fa8ece9b	change use_direct_writes to use_direct_io_for_flush_and_compaction Summary: Replace Options::use_direct_writes with Options::use_direct_io_for_flush_and_compaction Now if Options::use_direct_io_for_flush_and_compaction = true, we will enable direct io for both reads and writes for flush and compaction job. Whereas Options::use_direct_reads controls user reads like iterator and Get(). Closes https://github.com/facebook/rocksdb/pull/2117 Differential Revision: D4860912 Pulled By: lightmark fbshipit-source-id: d93575a8a5e780cf7e40797287edc425ee648c19	2017-04-13 16:12:04 -07:00
Sagar Vemuri	6a6723ee1e	Move MergeOperatorPinning tests to be with other merge operator tests Summary: Moved MergeOperatorPinning tests from db_test2.cc to db_merge_operator_test.cc. [This is the same code as PR #2104 , which has already been reviewed, but I am creating a new PR as I cannot import from #2104 onto phabricator anymore even after rebasing. I'll close and discard #2104.] Closes https://github.com/facebook/rocksdb/pull/2125 Differential Revision: D4863312 Pulled By: sagar0 fbshipit-source-id: 0f71a7690aa09c1d03ee85ce2bc1d2d89e4f4399	2017-04-11 16:15:06 -07:00
Maysam Yabandeh	8b0097b49b	Readers for partition filter Summary: This is the last split of this pull request: https://github.com/facebook/rocksdb/pull/1891 which includes the reader part as well as the tests. Closes https://github.com/facebook/rocksdb/pull/1961 Differential Revision: D4672216 Pulled By: maysamyabandeh fbshipit-source-id: 6a2b829	2017-03-22 09:24:15 -07:00
Siying Dong	8f5bf04468	Flush triggered by DB write buffer size picks the oldest unflushed CF Summary: Previously, when DB write buffer size triggers, we always pick the CF with most data in its memtable to flush. This approach can minimize total flush happens. Change the behavior to always pick the oldest unflushed CF, which makes it the same behavior when max_total_wal_size hits. This approach will minimize size used by max_total_wal_size. Closes https://github.com/facebook/rocksdb/pull/1987 Differential Revision: D4703214 Pulled By: siying fbshipit-source-id: 9ff8b09	2017-03-21 11:09:10 -07:00
Sagar Vemuri	97edc72d39	Add a memtable-only iterator Summary: This PR is to support a way to iterate over all the keys that are just in memtables. Closes https://github.com/facebook/rocksdb/pull/1953 Differential Revision: D4663500 Pulled By: sagar0 fbshipit-source-id: 144e177	2017-03-07 11:54:10 -08:00
Aaron Gao	286a36db7f	posix writablefile truncate Summary: we occasionally missing this call so the file size will be wrong Closes https://github.com/facebook/rocksdb/pull/1894 Differential Revision: D4598446 Pulled By: lightmark fbshipit-source-id: 42b6ef5	2017-02-22 10:09:14 -08:00
Dmitri Smirnov	0a4cdde50a	Windows thread Summary: introduce new methods into a public threadpool interface, - allow submission of std::functions as they allow greater flexibility. - add Joining methods to the implementation to join scheduled and submitted jobs with an option to cancel jobs that did not start executing. - Remove ugly `#ifdefs` between pthread and std implementation, make it uniform. - introduce pimpl for a drop in replacement of the implementation - Introduce rocksdb::port::Thread typedef which is a replacement for std::thread. On Posix Thread defaults as before std::thread. - Implement WindowsThread that allocates memory in a more controllable manner than windows std::thread with a replaceable implementation. - should be no functionality changes. Closes https://github.com/facebook/rocksdb/pull/1823 Differential Revision: D4492902 Pulled By: siying fbshipit-source-id: c74cb11	2017-02-06 14:54:18 -08:00
Siying Dong	036d668b19	Fix wrong result in data race case related to Get() Summary: In theory, Get() can get a wrong result, if it races in a special with with flush. The bug can be reproduced in DBTest2.GetRaceFlush. Fix this bug by getting snapshot after referencing the super version. Closes https://github.com/facebook/rocksdb/pull/1816 Differential Revision: D4475958 Pulled By: siying fbshipit-source-id: bd9e67a	2017-02-03 11:39:15 -08:00
Siying Dong	f25f1ec60b	Add test DBTest2.GetRaceFlush which can expose a data race bug Summary: A current data race issue in Get() and Flush() can cause a Get() to return wrong results when a flush happened in the middle. Disable the test for now. Closes https://github.com/facebook/rocksdb/pull/1813 Differential Revision: D4472310 Pulled By: siying fbshipit-source-id: 5755ebd	2017-01-26 16:39:14 -08:00
Siying Dong	0e8dfd6062	Fix OptimizeForPointLookup() Summary: If users directly call OptimizeForPointLookup(), it is broken as the option isn't compatible with parallel memtable insert. Fix it by using memtable bloomo filter instead. Closes https://github.com/facebook/rocksdb/pull/1791 Differential Revision: D4442836 Pulled By: siying fbshipit-source-id: bf6c9cd	2017-01-20 10:54:12 -08:00
Yi Wu	5d1457dbbf	Dump persistent cache options Summary: Dump persistent cache options Closes https://github.com/facebook/rocksdb/pull/1679 Differential Revision: D4337019 Pulled By: yiwu-arbug fbshipit-source-id: 3812f8a	2016-12-19 14:09:12 -08:00
Karthikeyan Radhakrishnan	4118e13330	Persistent Cache: Expose stats to user via public API Summary: Exposing persistent cache stats (counters) to the user via public API. Closes https://github.com/facebook/rocksdb/pull/1485 Differential Revision: D4155274 Pulled By: siying fbshipit-source-id: 30a9f50	2016-11-21 17:39:13 -08:00
Siying Dong	972e3ff295	Enable allow_concurrent_memtable_write and enable_write_thread_adaptive_yield by default Summary: Closes https://github.com/facebook/rocksdb/pull/1496 Differential Revision: D4168080 Pulled By: siying fbshipit-source-id: 056ae62	2016-11-16 09:39:09 -08:00
Maysam Yabandeh	361010d447	Exporting compaction stats in the form of a map Summary: Currently the compaction stats are printed to stdout. We want to export the compaction stats in a map format so that the upper layer apps (e.g., MySQL) could present the stats in any format required by the them. Closes https://github.com/facebook/rocksdb/pull/1477 Differential Revision: D4149836 Pulled By: maysamyabandeh fbshipit-source-id: b3df19f	2016-11-11 20:54:14 -08:00
Islam AbdelRahman	5691a1d8a4	Fix compaction conflict with running compaction Summary: Issue scenario: (1) We have 3 files in L1 and we issue a compaction that will compact them into 1 file in L2 (2) While compaction (1) is running, we flush a file into L0 and trigger another compaction that decide to move this file to L1 and then move it again to L2 (this file don't overlap with any other files) (3) compaction (1) finishes and install the file it generated in L2, but this file overlap with the file we generated in (2) so we break the LSM consistency Looks like this issue can be triggered by using non-exclusive manual compaction or AddFile() Test Plan: unit tests Reviewers: sdong Reviewed By: sdong Subscribers: hermanlee4, jkedgar, andrewkr, dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D64947	2016-10-13 10:49:06 -07:00
Aaron Gao	715256338a	forbid merge during recovery Summary: Mitigate regression bug of options.max_successive_merges hit during DB Recovery For https://reviews.facebook.net/D62625 Test Plan: make all check Reviewers: horuff, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D62655	2016-09-21 11:05:07 -07:00
sdong	b666f85445	Consider more factors when determining preallocation size of WAL files Summary: Currently the WAL file preallocation size is 1.1 * write_buffer_size. This, however, will be over-estimated if options.db_write_buffer_size or options.max_total_wal_size is set and is much smaller. Test Plan: Add a unit test. Reviewers: andrewkr, yiwu Reviewed By: yiwu Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D63957	2016-09-19 12:04:35 -07:00
sdong	607628d349	Support ZSTD with finalized format Summary: ZSTD 1.0.0 is coming. We can finally add a support of ZSTD without worrying about compatibility. Still keep ZSTDNotFinal for compatibility reason. Test Plan: Run all tests. Run db_bench with ZSTD version with RocksDB built with ZSTD 1.0 and older. Reviewers: andrewkr, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: cyan, igor, IslamAbdelRahman, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D63141	2016-09-06 12:22:16 -07:00
sdong	32149059f9	Merge options source_compaction_factor, max_grandparent_overlap_bytes and expanded_compaction_factor into max_compaction_bytes Summary: To reduce number of options, merge source_compaction_factor, max_grandparent_overlap_bytes and expanded_compaction_factor into max_compaction_bytes. Test Plan: Add two new unit tests. Run all existing tests, including jtest. Reviewers: yhchiang, igor, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59829	2016-09-01 14:33:24 -07:00
Islam AbdelRahman	b49b92cf28	Introduce Read amplification bitmap (read amp statistics) Summary: Add ReadOptions::read_amp_bytes_per_bit option which allow us to create a bitmap for every data block we read the bitmap will contain (block_size / read_amp_bytes_per_bit) bits. We will use this bitmap to mark which bytes have been used of the block so we can calculate the read amplification Test Plan: added new tests Reviewers: andrewkr, yhchiang, sdong Reviewed By: sdong Subscribers: yiwu, leveldb, march, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D58707	2016-08-26 18:55:58 -07:00
sdong	dade61ac26	Mitigate regression bug of options.max_successive_merges hit during DB Recovery Summary: After `1b8a2e8fdd`, DB Pointer is passed to WriteBatchInternal::InsertInto() while DB recovery. This can cause deadlock if options.max_successive_merges hits. In that case DB::Get() will be called. Get() will try to acquire the DB mutex, which is already held by the DB::Open(), causing a deadlock condition. This commit mitigates the problem by not passing the DB pointer unless 2PC is allowed. Test Plan: Add a new test and run it. Reviewers: IslamAbdelRahman, andrewkr, kradhakrishnan, horuff Reviewed By: kradhakrishnan Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D62625	2016-08-25 17:30:34 -07:00
Islam AbdelRahman	d11c09d9e2	Eliminate memcpy from ForwardIterator Summary: This diff update ForwardIterator to support pinning keys and values, which will allow DBIter to take advantage of that and eliminate memcpy when executing merge operators This diff is stacked on D61305 Test Plan: existing tests (updated them to test tailing iterator) new test Reviewers: andrewkr, yhchiang, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D60009	2016-08-11 19:10:16 -07:00
omegaga	e70020e4f6	Only cache level 0 indexes and filter when opening table reader Summary: In T8216281 we decided to disable prefetching the index and filter during opening table handlers during startup (max_open_files = -1). Test Plan: Rely on `IndexAndFilterBlocksOfNewTableAddedToCache` to guarantee L0 indexes and filters are still cached and change `PinL0IndexAndFilterBlocksTest` to make sure other levels are not cached (maybe add one more test to test we don't cache other levels?) Reviewers: sdong, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59913	2016-07-20 11:23:31 -07:00
Islam AbdelRahman	68a8e6b8fa	Introduce FullMergeV2 (eliminate memcpy from merge operators) Summary: This diff update the code to pin the merge operator operands while the merge operation is done, so that we can eliminate the memcpy cost, to do that we need a new public API for FullMerge that replace the std::deque<std::string> with std::vector<Slice> This diff is stacked on top of D56493 and D56511 In this diff we - Update FullMergeV2 arguments to be encapsulated in MergeOperationInput and MergeOperationOutput which will make it easier to add new arguments in the future - Replace std::deque<std::string> with std::vector<Slice> to pass operands - Replace MergeContext std::deque with std::vector (based on a simple benchmark I ran https://gist.github.com/IslamAbdelRahman/78fc86c9ab9f52b1df791e58943fb187) - Allow FullMergeV2 output to be an existing operand ``` [Everything in Memtable \| 10K operands \| 10 KB each \| 1 operand per key] DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=10000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000 [FullMergeV2] readseq : 0.607 micros/op 1648235 ops/sec; 16121.2 MB/s readseq : 0.478 micros/op 2091546 ops/sec; 20457.2 MB/s readseq : 0.252 micros/op 3972081 ops/sec; 38850.5 MB/s readseq : 0.237 micros/op 4218328 ops/sec; 41259.0 MB/s readseq : 0.247 micros/op 4043927 ops/sec; 39553.2 MB/s [master] readseq : 3.935 micros/op 254140 ops/sec; 2485.7 MB/s readseq : 3.722 micros/op 268657 ops/sec; 2627.7 MB/s readseq : 3.149 micros/op 317605 ops/sec; 3106.5 MB/s readseq : 3.125 micros/op 320024 ops/sec; 3130.1 MB/s readseq : 4.075 micros/op 245374 ops/sec; 2400.0 MB/s ``` ``` [Everything in Memtable \| 10K operands \| 10 KB each \| 10 operand per key] DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=1000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000 [FullMergeV2] readseq : 3.472 micros/op 288018 ops/sec; 2817.1 MB/s readseq : 2.304 micros/op 434027 ops/sec; 4245.2 MB/s readseq : 1.163 micros/op 859845 ops/sec; 8410.0 MB/s readseq : 1.192 micros/op 838926 ops/sec; 8205.4 MB/s readseq : 1.250 micros/op 800000 ops/sec; 7824.7 MB/s [master] readseq : 24.025 micros/op 41623 ops/sec; 407.1 MB/s readseq : 18.489 micros/op 54086 ops/sec; 529.0 MB/s readseq : 18.693 micros/op 53495 ops/sec; 523.2 MB/s readseq : 23.621 micros/op 42335 ops/sec; 414.1 MB/s readseq : 18.775 micros/op 53262 ops/sec; 521.0 MB/s ``` ``` [Everything in Block cache \| 10K operands \| 10 KB each \| 1 operand per key] [FullMergeV2] $ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions readseq : 14.741 micros/op 67837 ops/sec; 663.5 MB/s readseq : 1.029 micros/op 971446 ops/sec; 9501.6 MB/s readseq : 0.974 micros/op 1026229 ops/sec; 10037.4 MB/s readseq : 0.965 micros/op 1036080 ops/sec; 10133.8 MB/s readseq : 0.943 micros/op 1060657 ops/sec; 10374.2 MB/s [master] readseq : 16.735 micros/op 59755 ops/sec; 584.5 MB/s readseq : 3.029 micros/op 330151 ops/sec; 3229.2 MB/s readseq : 3.136 micros/op 318883 ops/sec; 3119.0 MB/s readseq : 3.065 micros/op 326245 ops/sec; 3191.0 MB/s readseq : 3.014 micros/op 331813 ops/sec; 3245.4 MB/s ``` ``` [Everything in Block cache \| 10K operands \| 10 KB each \| 10 operand per key] DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10-operands-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions [FullMergeV2] readseq : 24.325 micros/op 41109 ops/sec; 402.1 MB/s readseq : 1.470 micros/op 680272 ops/sec; 6653.7 MB/s readseq : 1.231 micros/op 812347 ops/sec; 7945.5 MB/s readseq : 1.091 micros/op 916590 ops/sec; 8965.1 MB/s readseq : 1.109 micros/op 901713 ops/sec; 8819.6 MB/s [master] readseq : 27.257 micros/op 36687 ops/sec; 358.8 MB/s readseq : 4.443 micros/op 225073 ops/sec; 2201.4 MB/s readseq : 5.830 micros/op 171526 ops/sec; 1677.7 MB/s readseq : 4.173 micros/op 239635 ops/sec; 2343.8 MB/s readseq : 4.150 micros/op 240963 ops/sec; 2356.8 MB/s ``` Test Plan: COMPILE_WITH_ASAN=1 make check -j64 Reviewers: yhchiang, andrewkr, sdong Reviewed By: sdong Subscribers: lovro, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57075	2016-07-20 09:49:03 -07:00
omegaga	cd4178a015	Add a new feature to enforce a sync point only active on a thread Summary: Add markers to sync points. A marked sync point will only be active when it is on the same thread as the marker sync point. Test Plan: Write a unit test to validate. Reviewers: sdong, IslamAbdelRahman, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D60375	2016-07-07 11:29:14 -07:00
sdong	32df9733d1	Add options.write_buffer_manager: control total memtable size across DB instances Summary: Add option write_buffer_manager to help users control total memory spent on memtables across multiple DB instances. Test Plan: Add a new unit test. Reviewers: yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: adela, benj, sumeet, muthu, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59925	2016-07-05 18:11:25 -07:00

1 2 3 4

169 Commits