rocksdb

Author	SHA1	Message	Date
Vijay Nadimpalli	931c9df886	Use separate status code for column family drop and db shutdown in progress (#5275 ) Summary: Currently RocksDB uses Status::ShutdownInProgress to inform about column family drop. I would like to have a separate Status code for this event. https://github.com/facebook/rocksdb/blob/master/include/rocksdb/status.h#L55 Comment on this: `abc4202e47/db/version_set.cc (L2742)`:L2743 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5275 Differential Revision: D15204583 Pulled By: vjnadimpalli fbshipit-source-id: 95e99e34b27bc165b554ecb8a48a7f8e60f21e2a	2019-05-20 10:47:32 -07:00
Zhichao Cao	a13026fb2f	Added trace replay fast forward function (#5273 ) Summary: In the current db_bench trace replay, the replay process strictly follows the timestamp to issue the queries. In some cases, user does not care about the time. Therefore, fast forward is needed for users to speed up the replay process. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5273 Differential Revision: D15389232 Pulled By: zhichao-cao fbshipit-source-id: 735d629b9d2a167b05af3e4fa0ddf9d5d0be1806	2019-05-16 20:21:18 -07:00
Maysam Yabandeh	f0e8216197	WritePrepared: Fix deadlock in WriteRecoverableState (#5306 ) Summary: The recent improvement in https://github.com/facebook/rocksdb/pull/3661 could cause a deadlock: When writing recoverable state, we also commit its sequence number to commit table, which could result into evicting existing commit entry, which could result into advancing max_evicted_seq_, which would need to get snapshots from database, which requires obtaining db mutex. The patch releases db_mutex before calling the callback in WriteRecoverableState to avoid the potential deadlock. It also improves the stress tests to let the issue be manifested in the tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5306 Differential Revision: D15341458 Pulled By: maysamyabandeh fbshipit-source-id: 05dcbed7e21b789fd1e5fd5ee8eea08077162323	2019-05-15 13:53:54 -07:00
Maysam Yabandeh	6a40ee5eb1	Refresh snapshot list during long compactions (2nd attempt) (#5278 ) Summary: Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list. For simplicity, to avoid the feature is disabled in two cases: i) When more than one sub-compaction are sharing the same snapshot list, ii) when Range Delete is used in which the range delete aggregator has its own copy of snapshot list. This fixes the reverted https://github.com/facebook/rocksdb/pull/5099 issue with range deletes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5278 Differential Revision: D15203291 Pulled By: maysamyabandeh fbshipit-source-id: fa645611e606aa222c7ce53176dc5bb6f259c258	2019-05-03 17:30:22 -07:00
Maysam Yabandeh	521d234bda	Revert snap_refresh_nanos feature (#5269 ) Summary: Our daily stress tests are failing after this feature. Reverting temporarily until we figure the reason for test failures. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5269 Differential Revision: D15151285 Pulled By: maysamyabandeh fbshipit-source-id: e4002b99690a97df30d4b4b58bf0f61e9591bc6e	2019-05-01 10:07:30 -07:00
Yuqi Gu	03c7ae24c2	RocksDB CRC32c optimization with ARMv8 Intrinsic (#5221 ) Summary: 1. Add Arm linear crc32c implemtation for RocksDB. 2. Arm runtime check for crc32 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5221 Differential Revision: D15013685 Pulled By: siying fbshipit-source-id: 2c2983743d26656d93f212dc7c1a3cf66a1acf12	2019-04-30 10:59:05 -07:00
Maysam Yabandeh	506e8448be	Refresh snapshot list during long compactions (#5099 ) Summary: Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5099 Differential Revision: D15086710 Pulled By: maysamyabandeh fbshipit-source-id: 7649f56c3b6b2fb334962048150142a3bf9c1a12	2019-04-25 18:17:22 -07:00
jsteemann	de76909464	refactor SavePoints (#5192 ) Summary: Savepoints are assumed to be used in a stack-wise fashion (only the top element should be used), so they were stored by `WriteBatch` in a member variable `save_points` using an std::stack. Conceptually this is fine, but the implementation had a few issues: - the `save_points_` instance variable was a plain pointer to a heap- allocated `SavePoints` struct. The destructor of `WriteBatch` simply deletes this pointer. However, the copy constructor of WriteBatch just copied that pointer, meaning that copying a WriteBatch with active savepoints will very likely have crashed before. Now a proper copy of the savepoints is made in the copy constructor, and not just a copy of the pointer - `save_points_` was an std::stack, which defaults to `std::deque` for the underlying container. A deque is a bit over the top here, as we only need access to the most recent savepoint (i.e. stack.top()) but never any elements at the front. std::deque is rather expensive to initialize in common environments. For example, the STL implementation shipped with GNU g++ will perform a heap allocation of more than 500 bytes to create an empty deque object. Although the `save_points_` container is created lazily by RocksDB, moving from a deque to a plain `std::vector` is much more memory-efficient. So `save_points_` is now a vector. - `save_points_` was changed from a plain pointer to an `std::unique_ptr`, making ownership more explicit. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5192 Differential Revision: D15024074 Pulled By: maysamyabandeh fbshipit-source-id: 5b128786d3789cde94e46465c9e91badd07a25d7	2019-04-19 20:33:04 -07:00
Fosco Marotto	6c2bf9e916	Add copyright headers per FB open-source checkup tool. (#5199 ) Summary: internal task: T35568575 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5199 Differential Revision: D14962794 Pulled By: gfosco fbshipit-source-id: 93838ede6d0235eaecff90d200faed9a8515bbbe	2019-04-18 10:55:01 -07:00
anand76	fefd4b98c5	Introduce a new MultiGet batching implementation (#5011 ) Summary: This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching. Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to - 1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch() 2. Bloom filter cachelines can be prefetched, hiding the cache miss latency The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress. Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32). Batch Sizes 1 \| 2 \| 4 \| 8 \| 16 \| 32 Random pattern (Stride length 0) 4.158 \| 4.109 \| 4.026 \| 4.05 \| 4.1 \| 4.074 - Get 4.438 \| 4.302 \| 4.165 \| 4.122 \| 4.096 \| 4.075 - MultiGet (no batching) 4.461 \| 4.256 \| 4.277 \| 4.11 \| 4.182 \| 4.14 - MultiGet (w/ batching) Good locality (Stride length 16) 4.048 \| 3.659 \| 3.248 \| 2.99 \| 2.84 \| 2.753 4.429 \| 3.728 \| 3.406 \| 3.053 \| 2.911 \| 2.781 4.452 \| 3.45 \| 2.833 \| 2.451 \| 2.233 \| 2.135 Good locality (Stride length 256) 4.066 \| 3.786 \| 3.581 \| 3.447 \| 3.415 \| 3.232 4.406 \| 4.005 \| 3.644 \| 3.49 \| 3.381 \| 3.268 4.393 \| 3.649 \| 3.186 \| 2.882 \| 2.676 \| 2.62 Medium locality (Stride length 4096) 4.012 \| 3.922 \| 3.768 \| 3.61 \| 3.582 \| 3.555 4.364 \| 4.057 \| 3.791 \| 3.65 \| 3.57 \| 3.465 4.479 \| 3.758 \| 3.316 \| 3.077 \| 2.959 \| 2.891 dbbench command used (on a DB with 4 levels, 12 million keys)- TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011 Differential Revision: D14348703 Pulled By: anand1976 fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b	2019-04-11 14:28:26 -07:00
Sagar Vemuri	d3d20dcdca	Periodic Compactions (#5166 ) Summary: Introducing Periodic Compactions. This feature allows all the files in a CF to be periodically compacted. It could help in catching any corruptions that could creep into the DB proactively as every file is constantly getting re-compacted. And also, of course, it helps to cleanup data older than certain threshold. - Introduced a new option `periodic_compaction_time` to control how long a file can live without being compacted in a CF. - This works across all levels. - The files are put in the same level after going through the compaction. (Related files in the same level are picked up as `ExpandInputstoCleanCut` is used). - Compaction filters, if any, are invoked as usual. - A new table property, `file_creation_time`, is introduced to implement this feature. This property is set to the time at which the SST file was created (and that time is given by the underlying Env/OS). This feature can be enabled on its own, or in conjunction with `ttl`. It is possible to set a different time threshold for the bottom level when used in conjunction with ttl. Since `ttl` works only on 0 to last but one levels, you could set `ttl` to, say, 1 day, and `periodic_compaction_time` to, say, 7 days. Since `ttl < periodic_compaction_time` all files in last but one levels keep getting picked up based on ttl, and almost never based on periodic_compaction_time. The files in the bottom level get picked up for compaction based on `periodic_compaction_time`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5166 Differential Revision: D14884441 Pulled By: sagar0 fbshipit-source-id: 408426cbacb409c06386a98632dcf90bfa1bda47	2019-04-10 19:31:18 -07:00
Siying Dong	0bb555630f	Consolidate hash function used for non-persistent data in a new function (#5155 ) Summary: Create new function NPHash64() and GetSliceNPHash64(), which are currently implemented using murmurhash. Replace the current direct call of murmurhash() to use the new functions if the hash results are not used in on-disk format. This will make it easier to try out or switch to alternative functions in the uses where data format compatibility doesn't need to be considered. This part shouldn't have any performance impact. Also, the sharded cache hash function is changed to the new format, because it falls into this categoery. It doesn't show visible performance impact in db_bench results. CPU showed by perf is increased from about 0.2% to 0.4% in an extreme benchmark setting (4KB blocks, no-compression, everything cached in block cache). We've known that the current hash function used, our own Hash() has serious hash quality problem. It can generate a lots of conflicts with similar input. In this use case, it means extra lock contention for reads from the same file. This slight CPU regression is worthy to me to counter the potential bad performance with hot keys. And hopefully this will get further improved in the future with a better hash function. cache_test's condition is relaxed a little bit to. The new hash is slightly more skewed in this use case, but I manually checked the data and see the hash results are still in a reasonable range. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5155 Differential Revision: D14834821 Pulled By: siying fbshipit-source-id: ec9a2c0a2f8ae4b54d08b13a5c2e9cc97aa80cb5	2019-04-08 13:32:06 -07:00
Adam Simpkins	c06c4c01c5	Fix many bugs in log statement arguments (#5089 ) Summary: Annotate all of the logging functions to inform the compiler that these use printf-style formatting arguments. This allows the compiler to emit warnings if the format arguments are incorrect. This also fixes many problems reported now that format string checking is enabled. Many of these are simply mix-ups in the argument type (e.g, int vs uint64_t), but in several cases the wrong number of arguments were being passed in which can cause the code to crash. The primary motivation for this was to fix the log message in `DBImpl::SwitchMemtable()` which caused a segfault due to an extra %s format parameter with no argument supplied. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5089 Differential Revision: D14574795 Pulled By: simpkins fbshipit-source-id: 0921b03f0743652bf4ae21e414ff54b3bb65422a	2019-04-04 12:12:11 -07:00
Zhongyi Xie	26015f3b48	add compression options to table properties (#5081 ) Summary: Since we are planning to use dictionary compression and to use different compression level, it is quite useful to add compression options to TableProperties. For example, in MyRocks, if the feature is available, we can query from information_schema.rocksdb_sst_props to see if all sst files are converted to ZSTD dictionary compressions. Resolves https://github.com/facebook/rocksdb/issues/4992 With this PR, user can query table properties through `GetPropertiesOfAllTables` API and get compression options as std::string: `window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0;` or table_properties->ToString() will also contain it `# data blocks=1; # entries=13; # deletions=0; # merge operands=0; # range deletions=0; raw key size=143; raw average key size=11.000000; raw value size=39; raw average value size=3.000000; data block size=120; index block size (user-key? 0, delta-value? 0)=27; filter block size=0; (estimated) table size=147; filter policy name=N/A; prefix extractor name=nullptr; column family ID=0; column family name=default; comparator name=leveldb.BytewiseComparator; merge operator name=nullptr; property collectors names=[]; SST file compression algo=Snappy; SST file compression options=window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=0; ; creation time=1552946632; time stamp of earliest key=1552946632;` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5081 Differential Revision: D14716692 Pulled By: miasantreble fbshipit-source-id: 7d2f2cf84e052bff876e71b4212cfdebf5be32dd	2019-04-02 14:52:34 -07:00
Yanqin Jin	09957ded1d	Update RepeatableThreadTest with MockTimeEnv (#5107 ) Summary: This PR updates RepeatableThread::wait, breaking some tests on OS X. The rest of the PR fixes the tests on OS X. `RepeatableThreadTest.MockEnvTest` uses `MockTimeEnv` and `RepeatableThread`. If `RepeatableThread::wait` calls `TimedWait` with a time smaller than or equal to the current (real) time, `TimedWait` returns immediately on certain platforms, e.g. OS X. #4560 addresses this issue by replacing `TimedWait` with `Wait` in test. This fixes the test but makes test/production code diverge, which is not optimal for test coverage. This PR proposes an alternative fix which unifies test and production code path for `RepeatableThread::wait`. We obtain the current (real) time in seconds and add 10 extra seconds to ensure that `RepeatableThread::wait` invokes `TimedWait` with a time greater than (real) current time. This is to prevent the `TimedWait` function from returning immediately without sleeping and releasing the mutex. If `TimedWait` returns immediately, the mutex will not be released, and `RepeatableThread::TEST_WaitForRun` never has a chance to execute the callback which, in this case, updates the result returned by `mock_env->NowMicros()`. Consequently, `RepeatableThread::wait` cannot break out of the loop, causing test to hang. The extra 10 seconds is a best-effort approach because there seems no reliable and deterministic way to provide the aforementioned guarantee. By the time `RepeatableThread::wait` is called, there is no guarantee that the `delay + mock_env->NowMicros()` will be greater than the current real time. However, 10 seconds should be sufficient in most cases. We will keep an eye for possible flakiness of this test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5107 Differential Revision: D14680885 Pulled By: riversand963 fbshipit-source-id: d1ecbe10e1dacd110bd464cd01e188bfee72b89e	2019-03-29 10:08:50 -07:00
anand76	dae3b5545c	Smooth the deletion of WAL files (#5116 ) Summary: WAL files are currently not subject to deletion rate limiting by DeleteScheduler. If the size of the WAL files is significant, this can cause a high delete rate on SSDs that may affect other operations. To fix it, force WAL file deletions to go through the SstFileManager. Original PR for this is #2768 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5116 Differential Revision: D14669437 Pulled By: anand1976 fbshipit-source-id: c5f62d0640cebaa1574de841a1d01e4ce2faadf0	2019-03-28 15:17:13 -07:00
Yi Wu	d69241586e	Fix perf_context.user_key_comparison_count for range scan (#5098 ) Summary: Currently `perf_context.user_key_comparison_count` is bump only in `InternalKeyComparator`. For places user comparator is used directly the counter is not bump. Fixing the majority of it. Index iterator and filter code also use user comparator directly and don't bump the counter. It is not fixed in this patch. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5098 Differential Revision: D14603753 Pulled By: siying fbshipit-source-id: 1cd41035644ca9e49b97a51030a5d1e15f5f3cae	2019-03-27 10:34:27 -07:00
Yanqin Jin	9358178edc	Support for single-primary, multi-secondary instances (#4899 ) Summary: This PR allows RocksDB to run in single-primary, multi-secondary process mode. The writer is a regular RocksDB (e.g. an `DBImpl`) instance playing the role of a primary. Multiple `DBImplSecondary` processes (secondaries) share the same set of SST files, MANIFEST, WAL files with the primary. Secondaries tail the MANIFEST of the primary and apply updates to their own in-memory state of the file system, e.g. `VersionStorageInfo`. This PR has several components: 1. (Originally in #4745). Add a `PathNotFound` subcode to `IOError` to denote the failure when a secondary tries to open a file which has been deleted by the primary. 2. (Similar to #4602). Add `FragmentBufferedReader` to handle partially-read, trailing record at the end of a log from where future read can continue. 3. (Originally in #4710 and #4820). Add implementation of the secondary, i.e. `DBImplSecondary`. 3.1 Tail the primary's MANIFEST during recovery. 3.2 Tail the primary's MANIFEST during normal processing by calling `ReadAndApply`. 3.3 Tailing WAL will be in a future PR. 4. Add an example in 'examples/multi_processes_example.cc' to demonstrate the usage of secondary RocksDB instance in a multi-process setting. Instructions to run the example can be found at the beginning of the source code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4899 Differential Revision: D14510945 Pulled By: riversand963 fbshipit-source-id: 4ac1c5693e6012ad23f7b4b42d3c374fecbe8886	2019-03-26 16:45:31 -07:00
anand76	959f86e5f8	Use placement new and delete in autovector (#5080 ) Summary: The stack buffer in rocksdb::autovector is currently defined as an array of elements of the template type. This results in unnecessary construction of those objects, which can be a significant overhead in some cases. This PR changes the type of the stack buf to char* and uses placement new to construct new objects when they are inserted into the autovector. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5080 Differential Revision: D14533221 Pulled By: anand1976 fbshipit-source-id: 9378985c7d03f4e1a28951bdd2403c72f10f23d7	2019-03-20 10:42:04 -07:00
Zhongyi Xie	a291f3a1e5	Collect compaction stats by priority and dump to info LOG (#5050 ) Summary: In order to better understand compaction done by different priority thread pool, we now collect compaction stats by priority and also print them to info LOG through stats dump. ``` Compaction Stats [default] Priority Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Low 0/0 0.00 KB 0.0 16.8 11.3 5.5 5.6 0.1 0.0 0.0 406.4 136.1 42.24 34.96 45 0.939 13M 8865K High 0/0 0.00 KB 0.0 0.0 0.0 0.0 11.4 11.4 0.0 0.0 0.0 76.2 153.00 35.74 12185 0.013 0 0 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5050 Differential Revision: D14408583 Pulled By: miasantreble fbshipit-source-id: e53746586ea27cb8abc9fec35805bd80ed30f608	2019-03-19 17:28:19 -07:00
Wenjie Yang	36c2a7cfb1	Add an option to filter traces (#5082 ) Summary: Add an option to filter out READ or WRITE operations while tracing. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5082 Differential Revision: D14515083 Pulled By: mrmiywj fbshipit-source-id: 2504c89a9abf1dd629cad44b4104092702d77610	2019-03-19 14:36:51 -07:00
Shobhit Dayal	b45b1cde3e	Feature for sampling and reporting compressibility (#4842 ) Summary: This is a feature to sample data-block compressibility and and report them as stats. 1 in N (tunable) blocks is sampled for compressibility using two algorithms: 1. lz4 or snappy for fast compression 2. zstd or zlib for slow but higher compression. The stats are reported to the caller as raw-bytes and compressed-bytes. The block continues to be compressed for storage using the specified CompressionType. The db_bench_tool how has a command line option for specifying the sampling rate. It's default value is 0 (no sampling). To test the overhead for a certain value, users can compare the performance of db_bench_tool, varying the sampling rate. It is unlikely to have a noticeable impact for high values like 20. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4842 Differential Revision: D13629011 Pulled By: shobhitdayal fbshipit-source-id: 14ca668bcab6499b2a1734edf848eb62a4f4fafa	2019-03-18 12:15:34 -07:00
Zhongyi Xie	fdc72a5c5d	add OptionType kInt32T and kInt64T Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5061 Differential Revision: D14418581 Pulled By: miasantreble fbshipit-source-id: be7f90e16586666ddd0cce36971e403782ab0892	2019-03-12 13:49:52 -07:00
Siying Dong	aef763b6d6	Make statistics's stats_level change thread-safe (#5030 ) Summary: Right now, users can change statistics.stats_level while DB is running, but TSAN may report data race. We make stats_level_ to be atomic, and access them using accessors. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5030 Differential Revision: D14267519 Pulled By: siying fbshipit-source-id: 37d7ebeff7a43a406230143422a16af899163f73	2019-03-01 10:42:09 -08:00
Siying Dong	5e298f865b	Add two more StatsLevel (#5027 ) Summary: Statistics cost too much CPU for some use cases. Add two stats levels so that people can choose to skip two types of expensive stats, timers and histograms. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5027 Differential Revision: D14252765 Pulled By: siying fbshipit-source-id: 75ecec9eaa44c06118229df4f80c366115346592	2019-02-28 10:27:59 -08:00
Siying Dong	06f378d75e	When closing BlobDB, should first wait for all background tasks (#5005 ) Summary: When closing a BlobDB, it only waits for background tasks to finish as the last thing, but the background task may access some variables that are destroyed. The fix is to introduce a shutdown function in the timer queue and call the function as the first thing when destorying BlobDB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5005 Differential Revision: D14170342 Pulled By: siying fbshipit-source-id: 081e6a2d99b9765d5956cf6cdfc290c07270c233	2019-02-21 17:26:01 -08:00
Zhongyi Xie	c4f5d0aa15	add GetStatsHistory to retrieve stats snapshots (#4748 ) Summary: This PR adds public `GetStatsHistory` API to retrieve stats history in the form of an std map. The key of the map is the timestamp in microseconds when the stats snapshot is taken, the value is another std map from stats name to stats value (stored in std string). Two DBOptions are introduced: `stats_persist_period_sec` (default 10 minutes) controls the intervals between two snapshots are taken; `max_stats_history_count` (default 10) controls the max number of history snapshots to keep in memory. RocksDB will stop collecting stats snapshots if `stats_persist_period_sec` is set to 0. (This PR is the in-memory part of https://github.com/facebook/rocksdb/pull/4535) Pull Request resolved: https://github.com/facebook/rocksdb/pull/4748 Differential Revision: D13961471 Pulled By: miasantreble fbshipit-source-id: ac836d401ecb84ea92216bf9966f969dedf4ad04	2019-02-20 15:52:54 -08:00
Maysam Yabandeh	0f4244fe00	WritePrepared: Improve stress tests with slow threads (#4974 ) Summary: The transaction stress tests, stress a high concurrency scenario. In WritePrepared/WriteUnPrepared we need to also stress the scenarios where an inserting/reading transaction is very slow. This would stress the corner cases that the caching is not sufficient and other slower data structures are engaged. To emulate such cases we make use of slow inserter/verifier threads and also reduce the size of cache data structures. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4974 Differential Revision: D14143070 Pulled By: maysamyabandeh fbshipit-source-id: 81eb674678faf9fae0f654cd60ebcc74e26aeee7	2019-02-19 16:56:49 -08:00
Zhongyi Xie	ed995c6a69	add whole key bloom filter support in memtables (#4985 ) Summary: MyRocks calls `GetForUpdate` on `INSERT`, for unique key check, and in almost all cases GetForUpdate returns empty result. For such cases, whole key bloom filter is helpful. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4985 Differential Revision: D14118257 Pulled By: miasantreble fbshipit-source-id: d35cb7109c62fd5ad541a26968e3a3e16d3e85ea	2019-02-19 12:15:39 -08:00
Aubin Sanyal	3231a2e581	Deprecate ttl option from CompactionOptionsFIFO (#4965 ) Summary: We introduced ttl option in CompactionOptionsFIFO when ttl-based file deletion (compaction) was supported only as part of FIFO Compaction. But with the extension of ttl semantics even to Level compaction, CompactionOptionsFIFO.ttl can now be deprecated. Instead we will start using ColumnFamilyOptions.ttl for FIFO compaction as well. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4965 Differential Revision: D14072960 Pulled By: sagar0 fbshipit-source-id: c98cc2ae695a28136295787cd88d36a220fc219e	2019-02-15 09:51:41 -08:00
Michael Liu	ca89ac2ba9	Apply modernize-use-override (2nd iteration) Summary: Use C++11’s override and remove virtual where applicable. Change are automatically generated. Reviewed By: Orvid Differential Revision: D14090024 fbshipit-source-id: 1e9432e87d2657e1ff0028e15370a85d1739ba2a	2019-02-14 14:41:36 -08:00
Yanqin Jin	a69d4deefb	Atomic ingest (#4895 ) Summary: Make file ingestion atomic. as title. Ingesting external SST files into multiple column families should be atomic. If a crash occurs and db reopens, either all column families have successfully ingested the files before the crash, or non of the ingestions have any effect on the state of the db. Also add unit tests for atomic ingestion. Note that the unit test here does not cover the case of incomplete atomic group in the MANIFEST, which is covered in VersionSetTest already. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4895 Differential Revision: D13718245 Pulled By: riversand963 fbshipit-source-id: 7df97cc483af73ad44dd6993008f99b083852198	2019-02-12 19:16:17 -08:00
Sagar Vemuri	eafb09a380	Fix issues found by Clang Analyzer (#4976 ) Summary: Fix issues found by Clang Analyzer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4976 Differential Revision: D14054211 Pulled By: sagar0 fbshipit-source-id: ec2053bae43af3b2ff3425306824c677e3ba70c2	2019-02-12 13:59:44 -08:00
Andrew Kryczka	62f70f6d14	Reduce scope of compression dictionary to single SST (#4952 ) Summary: Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio. So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include: - The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called. - After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up. - Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952 Differential Revision: D13967980 Pulled By: ajkr fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f	2019-02-11 19:47:32 -08:00
Maysam Yabandeh	d6b9b3b884	Enhance transaction_test_util with delays (#4970 ) Summary: Enhance ::Insert and ::Verify test functions to add artificial delay between prepare and commit, and take snapshot and reads respectively. A future PR will make use of these to improve stress tests to test against long-running transactions as well as long-running backup jobs. Also randomly sets set_snapshot to false for inserters to skip setting the snapshot in the initialization phase and let the snapshot be taken later explicitly. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4970 Differential Revision: D14031342 Pulled By: maysamyabandeh fbshipit-source-id: b52b453751f0b25b81b23c48892bc1d152464cab	2019-02-11 16:02:37 -08:00
tang-jianfeng	08809f5e6c	Implement trace sampling (#4963 ) Summary: Implement trace sampling to allow user to specify the sampling frequency, i.e. save one per how many requests, so that a user does not need to log all if he/she is interested in only a sampled set. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4963 Differential Revision: D14011190 Pulled By: tang-jianfeng fbshipit-source-id: 078b631d9319b67cb089dd2c30e21d0df8dc406a	2019-02-08 18:08:18 -08:00
Alexander Zinoviev	32a6dd9a41	Add a new CPU time counter to compaction report (#4889 ) Summary: Measure CPU time consumed for a compaction and report it in the stats report Enable NowCPUNanos() to work for MacOS Pull Request resolved: https://github.com/facebook/rocksdb/pull/4889 Differential Revision: D13701276 Pulled By: zinoale fbshipit-source-id: 5024e5bbccd4dd10fd90d947870237f436445055	2019-01-29 17:24:00 -08:00
anand76	d0d484b132	Always delete Blob DB files in the background (#4928 ) Summary: Blob DB files are not tracked by the SFM, so they currently don't get deleted in the background. Force them to be deleted in background so rate limiting can be applied Pull Request resolved: https://github.com/facebook/rocksdb/pull/4928 Differential Revision: D13854649 Pulled By: anand1976 fbshipit-source-id: 8031ce66842ff0af440c715d886b377983dad7d8	2019-01-29 15:50:03 -08:00
Siying Dong	fc53839bfa	Disallow customized hash function in DynamicBloom (#4915 ) Summary: I didn't find where customized hash function is used in DynamicBloom. This can only reduce performance. Remove it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4915 Differential Revision: D13794452 Pulled By: siying fbshipit-source-id: e38669b11e01444d2d782da11c7decabbd851819	2019-01-24 10:34:30 -08:00
Andrew Kryczka	8ec3e72551	Cache dictionary used for decompressing data blocks (#4881 ) Summary: - If block cache disabled or not used for meta-blocks, `BlockBasedTableReader::Rep::uncompression_dict` owns the `UncompressionDict`. It is preloaded during `PrefetchIndexAndFilterBlocks`. - If block cache is enabled and used for meta-blocks, block cache owns the `UncompressionDict`, which holds dictionary and digested dictionary when needed. It is never prefetched though there is a TODO for this in the code. The cache key is simply the compression dictionary block handle. - New stats for compression dictionary accesses in block cache: "BLOCK_CACHE_COMPRESSION_DICT_*" and "compression_dict_block_read_count" Pull Request resolved: https://github.com/facebook/rocksdb/pull/4881 Differential Revision: D13663801 Pulled By: ajkr fbshipit-source-id: bdcc54044e180855cdcc57639b493b0e016c9a3f	2019-01-23 18:15:47 -08:00
Siying Dong	08b8cea69f	Deleting Blob files also goes through SstFileManager (#4904 ) Summary: Right now, deleting blob files is not rate limited, even if SstFileManger is specified. On the other hand, rate limiting blob deletion is not supported. With this change, Blob file deletion will go through SstFileManager too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4904 Differential Revision: D13772545 Pulled By: siying fbshipit-source-id: bd1b1d0beb26d5167385e00b7ecb8b94b879de84	2019-01-22 17:00:29 -08:00
Andrew Kryczka	01013ae766	Digest ZSTD compression dictionary once when writing SST file (#4849 ) Summary: This is essentially a re-submission of #4251 with a few improvements: - Split `CompressionDict` into two separate classes: `CompressionDict` and `UncompressionDict` - Eliminated `Init` functions. Instead do all initialization work in constructors. - Added test case for parallel DB open, which is the scenario where #4251 failed under TSAN. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4849 Differential Revision: D13606039 Pulled By: ajkr fbshipit-source-id: 08c236059798c710db9cbf545fce0f371232d447	2019-01-18 19:12:57 -08:00
Yanqin Jin	e79df377c5	Use chrono::time_point instead of time_t (#4868 ) Summary: By convention, time_t almost always stores the integral number of seconds since 00:00 hours, Jan 1, 1970 UTC, according to http://www.cplusplus.com/reference/ctime/time_t/. We surely want more precision than seconds. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4868 Differential Revision: D13633046 Pulled By: riversand963 fbshipit-source-id: 4e01e23a22e8838023c51a91247a286dbf3a5396	2019-01-16 09:51:05 -08:00
Yi Wu	77a8d4d476	Detect if Jemalloc is linked with the binary (#4844 ) Summary: Declare Jemalloc non-standard APIs as weak symbols, so that if Jemalloc is linked with the binary, these symbols will be replaced by Jemalloc's, otherwise they will be nullptr. This is similar to how folly detect jemalloc, but we assume the main program use jemalloc as long as jemalloc is linked: https://github.com/facebook/folly/blob/master/folly/memory/Malloc.h#L147 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4844 Differential Revision: D13574934 Pulled By: yiwu-arbug fbshipit-source-id: 7ea871beb1be7d5a1259cc38f9b78078793db2db	2019-01-03 16:30:12 -08:00
Burton Li	46e3209e0d	Compaction limiter miscs (#4795 ) Summary: 1. Remove unused API SubtractCompactionTask(). 2. Assert outstanding tasks drop to zero in ConcurrentTaskLimiterImpl destructor. 3. Remove GetOutstandingTask() check from manual compaction test, as TEST_WaitForCompact() doesn't synced with 'delete prepicked_compaction' in DBImpl::BGWorkCompaction(), which may make the test flaky. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4795 Differential Revision: D13542183 Pulled By: siying fbshipit-source-id: 5eb2a47e62efe4126937149aa0df6e243ebefc33	2018-12-26 13:59:35 -08:00
Abhishek Madan	abf931afa6	Add compaction logic to RangeDelAggregatorV2 (#4758 ) Summary: RangeDelAggregatorV2 now supports ShouldDelete calls on snapshot stripes and creation of range tombstone compaction iterators. RangeDelAggregator is no longer used on any non-test code path, and will be removed in a future commit. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4758 Differential Revision: D13439254 Pulled By: abhimadan fbshipit-source-id: fe105bcf8e3d4a2df37a622d5510843cd71b0401	2018-12-17 13:20:51 -08:00
Burton Li	a8b9891f95	Concurrent task limiter for compaction thread control (#4332 ) Summary: The PR is targeting to resolve the issue of: https://github.com/facebook/rocksdb/issues/3972#issue-330771918 We have a rocksdb created with leveled-compaction with multiple column families (CFs), some of CFs are using HDD to store big and less frequently accessed data and others are using SSD. When there are continuously write traffics going on to all CFs, the compaction thread pool is mostly occupied by those slow HDD compactions, which blocks fully utilize SSD bandwidth. Since atomic write and transaction is needed across CFs, so splitting it to multiple rocksdb instance is not an option for us. With the compaction thread control, we got 30%+ HDD write throughput gain, and also a lot smooth SSD write since less write stall happening. ConcurrentTaskLimiter can be shared with multi-CFs across rocksdb instances, so the feature does not only work for multi-CFs scenarios, but also for multi-rocksdbs scenarios, who need disk IO resource control per tenant. The usage is straight forward: e.g.: // // Enable compaction thread limiter thru ColumnFamilyOptions // std::shared_ptr<ConcurrentTaskLimiter> ctl(NewConcurrentTaskLimiter("foo_limiter", 4)); Options options; ColumnFamilyOptions cf_opt(options); cf_opt.compaction_thread_limiter = ctl; ... // // Compaction thread limiter can be tuned or disabled on-the-fly // ctl->SetMaxOutstandingTask(12); // enlarge to 12 tasks ... ctl->ResetMaxOutstandingTask(); // disable (bypass) thread limiter ctl->SetMaxOutstandingTask(-1); // Same as above ... ctl->SetMaxOutstandingTask(0); // full throttle (0 task) // // Sharing compaction thread limiter among CFs (to resolve multiple storage perf issue) // std::shared_ptr<ConcurrentTaskLimiter> ctl_ssd(NewConcurrentTaskLimiter("ssd_limiter", 8)); std::shared_ptr<ConcurrentTaskLimiter> ctl_hdd(NewConcurrentTaskLimiter("hdd_limiter", 4)); Options options; ColumnFamilyOptions cf_opt_ssd1(options); ColumnFamilyOptions cf_opt_ssd2(options); ColumnFamilyOptions cf_opt_hdd1(options); ColumnFamilyOptions cf_opt_hdd2(options); ColumnFamilyOptions cf_opt_hdd3(options); // SSD CFs cf_opt_ssd1.compaction_thread_limiter = ctl_ssd; cf_opt_ssd2.compaction_thread_limiter = ctl_ssd; // HDD CFs cf_opt_hdd1.compaction_thread_limiter = ctl_hdd; cf_opt_hdd2.compaction_thread_limiter = ctl_hdd; cf_opt_hdd3.compaction_thread_limiter = ctl_hdd; ... // // The limiter is disabled by default (or set to nullptr explicitly) // Options options; ColumnFamilyOptions cf_opt(options); cf_opt.compaction_thread_limiter = nullptr; Pull Request resolved: https://github.com/facebook/rocksdb/pull/4332 Differential Revision: D13226590 Pulled By: siying fbshipit-source-id: 14307aec55b8bd59c8223d04aa6db3c03d1b0c1d	2018-12-13 13:18:28 -08:00
Siying Dong	ae25546a7a	Direct I/O Close() shouldn't rewrite the last block (#4771 ) Summary: In Direct I/O case, WritableFileWriter::Close() rewrites the last block again, even if there is nothing new. The reason is that, Close() flushes the buffer. For non-direct I/O case, the buffer is empty in this case so it is a no-op. However, in direct I/O case, the partial data in the last block is kept in the buffer because it needs to be rewritten for the next write. This piece of data is flushed again. This commit fixes it by skipping this write out if `pending_sync_` flag shows that there isn't new data sync last sync. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4771 Differential Revision: D13420426 Pulled By: siying fbshipit-source-id: 9d39ec9a215b1425d4ed40d85e0eba1f5daa75c6	2018-12-11 13:55:02 -08:00
Maysam Yabandeh	b878f93c70	Extend Transaction::GetForUpdate with do_validate (#4680 ) Summary: Transaction::GetForUpdate is extended with a do_validate parameter with default value of true. If false it skips validating the snapshot (if there is any) before doing the read. After the read it also returns the latest value (expects the ReadOptions::snapshot to be nullptr). This allows RocksDB applications to use GetForUpdate similarly to how InnoDB does. Similarly ::Merge, ::Put, ::Delete, and ::SingleDelete are extended with assume_exclusive_tracked with default value of false. It true it indicates that call is assumed to be after a ::GetForUpdate(do_validate=false). The Java APIs are accordingly updated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4680 Differential Revision: D13068508 Pulled By: maysamyabandeh fbshipit-source-id: f0b59db28f7f6a078b60844d902057140765e67d	2018-12-06 17:49:00 -08:00
Zhongyi Xie	b0f3d9b478	fix unused param "options" error in jemalloc_nodump_allocator.cc (#4738 ) Summary: Currently tests are failing on master with the following message: > util/jemalloc_nodump_allocator.cc:132:8: error: unused parameter ‘options’ [-Werror=unused-parameter] Status NewJemallocNodumpAllocator( This PR attempts to fix the issue Pull Request resolved: https://github.com/facebook/rocksdb/pull/4738 Differential Revision: D13278804 Pulled By: miasantreble fbshipit-source-id: 64a6204aa685bd85d8b5080655cafef9980fac2f	2018-11-30 12:08:55 -08:00
Yi Wu	cf1df5d3cb	JemallocNodumpAllocator: option to limit tcache memory usage (#4736 ) Summary: Add option to limit tcache usage by allocation size. This is to reduce total tcache size in case there are many user threads accessing the allocator and incur non-trivial memory usage. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4736 Differential Revision: D13269305 Pulled By: yiwu-arbug fbshipit-source-id: 95a9b7fc67facd66837c849137e30e137112e19d	2018-11-29 17:33:40 -08:00
Abhishek Madan	8fe1e06ca0	Clean up FragmentedRangeTombstoneList (#4692 ) Summary: Removed `one_time_use` flag, which removed the need for some tests, and changed all `NewRangeTombstoneIterator` methods to return `FragmentedRangeTombstoneIterators`. These changes also led to removing `RangeDelAggregatorV2::AddUnfragmentedTombstones` and one of the `MemTableListVersion::AddRangeTombstoneIterators` methods. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4692 Differential Revision: D13106570 Pulled By: abhimadan fbshipit-source-id: cbab5432d7fc2d9cdfd8d9d40361a1bffaa8f845	2018-11-28 15:29:02 -08:00
Zhichao Cao	7125e24619	Add the max trace file size limitation option to Tracing (#4610 ) Summary: If user do not end the trace manually, the tracing will continue which can potential use up all the storage space and cause problem. In this PR, the max trace file size is added to the TraceOptions and user can set the value if they need or the default is 64GB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4610 Differential Revision: D12893400 Pulled By: zhichao-cao fbshipit-source-id: acf4b5a6076bb691778bdfbac4864e1006758953	2018-11-27 14:27:05 -08:00
Soli	f1837595a3	FIX #3278 : Move global const object definitions from .h to .cc (#4691 ) Summary: Summary We should declare constants in headers and define them in source files. But this commit is only aimed at compound types. I don't know if it is necessary to do the same thing to fundamental types. I used this command to find all of the constant definitions in header files. `find . -name ".h" \| xargs grep -e "^const .=."` And here is what I found: ``` ./db/version_edit.h:const uint64_t kFileNumberMask = 0x3FFFFFFFFFFFFFFF; ./include/rocksdb/env.h:const size_t kDefaultPageSize = 4 1024; ./include/rocksdb/statistics.h:const std::vector<std::pair<Tickers, std::string>> TickersNameMap = { ./include/rocksdb/statistics.h:const std::vector<std::pair<Histograms, std::string>> HistogramsNameMap = { ./include/rocksdb/table.h:const uint32_t kPlainTableVariableLength = 0; ./include/rocksdb/utilities/transaction_db.h:const uint32_t kInitialMaxDeadlocks = 5; ./port/port_posix.h:const uint32_t kMaxUint32 = std::numeric_limits<uint32_t>::max(); ./port/port_posix.h:const int kMaxInt32 = std::numeric_limits<int32_t>::max(); ./port/port_posix.h:const uint64_t kMaxUint64 = std::numeric_limits<uint64_t>::max(); ./port/port_posix.h:const int64_t kMaxInt64 = std::numeric_limits<int64_t>::max(); ./port/port_posix.h:const size_t kMaxSizet = std::numeric_limits<size_t>::max(); ./port/win/port_win.h:const uint32_t kMaxUint32 = UINT32_MAX; ./port/win/port_win.h:const int kMaxInt32 = INT32_MAX; ./port/win/port_win.h:const int64_t kMaxInt64 = INT64_MAX; ./port/win/port_win.h:const uint64_t kMaxUint64 = UINT64_MAX; ./port/win/port_win.h:const size_t kMaxSizet = UINT64_MAX; ./port/win/port_win.h:const size_t kMaxSizet = UINT_MAX; ./port/win/port_win.h:const uint32_t kMaxUint32 = std::numeric_limits<uint32_t>::max(); ./port/win/port_win.h:const int kMaxInt32 = std::numeric_limits<int>::max(); ./port/win/port_win.h:const uint64_t kMaxUint64 = std::numeric_limits<uint64_t>::max(); ./port/win/port_win.h:const int64_t kMaxInt64 = std::numeric_limits<int64_t>::max(); ./port/win/port_win.h:const size_t kMaxSizet = std::numeric_limits<size_t>::max(); ./port/win/port_win.h:const bool kLittleEndian = true; ./table/cuckoo_table_factory.h:const uint32_t kCuckooMurmurSeedMultiplier = 816922183; ./table/data_block_hash_index.h:const uint8_t kNoEntry = 255; ./table/data_block_hash_index.h:const uint8_t kCollision = 254; ./table/data_block_hash_index.h:const uint8_t kMaxRestartSupportedByHashIndex = 253; ./table/data_block_hash_index.h:const size_t kMaxBlockSizeSupportedByHashIndex = 1u << 16; ./table/data_block_hash_index.h:const double kDefaultUtilRatio = 0.75; ./table/filter_block.h:const uint64_t kNotValid = ULLONG_MAX; ./table/format.h:const int kMagicNumberLengthByte = 8; ./third-party/fbson/FbsonJsonParser.h:const char* const kJsonDelim = " ,]}\t\r\n"; ./third-party/fbson/FbsonJsonParser.h:const char* const kWhiteSpace = " \t\n\r"; ./third-party/gtest-1.7.0/fused-src/gtest/gtest.h:const BiggestInt kMaxBiggestInt = ./third-party/gtest-1.7.0/fused-src/gtest/gtest.h:const char kDeathTestStyleFlag[] = "death_test_style"; ./third-party/gtest-1.7.0/fused-src/gtest/gtest.h:const char kDeathTestUseFork[] = "death_test_use_fork"; ./third-party/gtest-1.7.0/fused-src/gtest/gtest.h:const char kInternalRunDeathTestFlag[] = "internal_run_death_test"; ./third-party/gtest-1.7.0/fused-src/gtest/gtest.h:const char* pets[] = {"cat", "dog"}; ./third-party/gtest-1.7.0/fused-src/gtest/gtest.h:const size_t kProtobufOneLinerMaxLength = 50; ./third-party/gtest-1.7.0/fused-src/gtest/gtest.h:const int kMaxStackTraceDepth = 100; ./third-party/gtest-1.7.0/fused-src/gtest/gtest.h:const T* WithParamInterface<T>::parameter_ = NULL; ./util/coding.h:const unsigned int kMaxVarint64Length = 10; ./util/filename.h:const size_t kFormatFileNumberBufSize = 38; ./util/testutil.h:const SliceTransform* RandomSliceTransform(Random* rnd, int pre_defined = -1); ./util/trace_replay.h:const std::string kTraceMagic = "feedcafedeadbeef"; ./util/trace_replay.h:const unsigned int kTraceTimestampSize = 8; ./util/trace_replay.h:const unsigned int kTraceTypeSize = 1; ./util/trace_replay.h:const unsigned int kTracePayloadLengthSize = 4; ./util/trace_replay.h:const unsigned int kTraceMetadataSize = ./utilities/cassandra/serialize.h:const int64_t kCharMask = 0xFFLL; ./utilities/cassandra/serialize.h:const int32_t kBitsPerByte = 8; ``` And these 3 lines are related to this commit: ``` ./include/rocksdb/statistics.h:const std::vector<std::pair<Tickers, std::string>> TickersNameMap = { ./include/rocksdb/statistics.h:const std::vector<std::pair<Histograms, std::string>> HistogramsNameMap = { ./util/trace_replay.h:const std::string kTraceMagic = "feedcafedeadbeef"; ``` Any comments would be appreciated. Thanks. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4691 Differential Revision: D13208049 Pulled By: ajkr fbshipit-source-id: e5ee55fdaec5447fc5798c6721e2821e7cdc0d5b	2018-11-26 21:32:03 -08:00
Abhishek Madan	85394a96ca	Speed up range scans with range tombstones (#4677 ) Summary: Previously, every range tombstone iterator was seeked on every ShouldDelete call, which quickly degraded performance for long range scans. This PR improves performance by tracking iterator positions and only advancing iterators when necessary. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4677 Differential Revision: D13205373 Pulled By: abhimadan fbshipit-source-id: 80c199dace1e19362a4c61c686bf01913eae87cb	2018-11-26 16:33:41 -08:00
Yi Wu	327097c952	JemallocAllocator: thread-local tcache (#4603 ) Summary: Add option to support thread-local tcache to reduce mutex contention inside Jemalloc arena. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4603 Differential Revision: D12830738 Pulled By: yiwu-arbug fbshipit-source-id: 59bd25b165b903f23a6a8531b18d72e140d69f65	2018-11-19 22:39:08 -08:00
Simon Liu	a2de8e52bb	optimized the performance of autovector::emplace_back. (#4606 ) Summary: It called the autovector::push_back simply in autovector::emplace_back. This was not efficient, and then optimazed this function through the perfect forwarding. This was the src and result of the benchmark(using the google'benchmark library, the type of elem in autovector was std::string, and call emplace_back with the "char *" type): https://gist.github.com/monadbobo/93448b89a42737b08cbada81de75c5cd PS: The benchmark's result of previous PR was not accurate, and so I update the test case and result. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4606 Differential Revision: D13046813 Pulled By: sagar0 fbshipit-source-id: 19cde1bcadafe899aa454b703acb35737a1cc02d	2018-11-13 14:39:03 -08:00
Andrew Kryczka	ea9454700a	Backup engine support for direct I/O reads (#4640 ) Summary: Use the `DBOptions` that the backup engine already holds to figure out the right `EnvOptions` to use when reading the DB files. This means that, if a user opened a DB instance with `use_direct_reads=true`, then using `BackupEngine` to back up that DB instance will use direct I/O to read files when calculating checksums and copying. Currently the WALs and manifests would still be read using buffered I/O to prevent mixing direct I/O reads with concurrent buffered I/O writes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4640 Differential Revision: D13015268 Pulled By: ajkr fbshipit-source-id: 77006ad6f3e00ce58374ca4793b785eea0db6269	2018-11-13 11:17:25 -08:00
Sagar Vemuri	dc3528077a	Update all unique/shared_ptr instances to be qualified with namespace std (#4638 ) Summary: Ran the following commands to recursively change all the files under RocksDB: ``` find . -type f -name ".cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} + ``` Running `make format` updated some formatting on the files touched. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638 Differential Revision: D12934992 Pulled By: sagar0 fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8	2018-11-09 11:19:58 -08:00
Andrew Gallagher	0148f717ff	Move `#include` outside of namespace (#4629 ) Summary: clang modules warns about `#include`s inside of namespaces. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4629 Reviewed By: ajkr Differential Revision: D12927333 Pulled By: andrewjcg fbshipit-source-id: a9e0b069e63d8224f78b7c3be1c3acf09bb83d3f	2018-11-06 17:18:28 -08:00
Soli	3f8f81cfeb	FIX #3820 : shorter file name in logs (#4616 ) Summary: Long absolute file names in log make it hard to read the LOG files. So we shorter them to relative to the root of RocksDB project path. In most cases, they will only have one level directory and one file name. There was [a talk](#4316) about making "util/logging.h" a public header file. But we concern the conflicts that might be introduced in for macros named `STRINGIFY`, `TOSTRING`, and `PREPEND_FILE_LINE`. So I prepend a prefix `ROCKS_LOG_` to them. I also remove the line that includes "port.h" which seems unneccessary here. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4616 Differential Revision: D12892857 Pulled By: siying fbshipit-source-id: af79aaf82153b8fd66b5966aced39a51fbca9c6c	2018-11-01 16:19:01 -07:00
Bo Hou	cd9404bb77	xxhash 64 support Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4607 Reviewed By: siying Differential Revision: D12836696 Pulled By: jsjhoubo fbshipit-source-id: 7122ccb712d0b0f1cd998aa4477e0da1401bd870	2018-11-01 15:44:06 -07:00
Yi Wu	5f5fddabc7	port folly::JemallocNodumpAllocator (#4534 ) Summary: Introduce `JemallocNodumpAllocator`, which allow exclusion of block cache usage from core dump. It utilize custom hook of jemalloc arena, and when jemalloc arena request memory from system, the allocator use the hook to set `MADV_DONTDUMP ` to the memory. The implementation is basically the same as `folly::JemallocNodumpAllocator`, except for some minor difference: 1. It only support jemalloc >= 5.0 2. When the allocator destruct, it explicitly destruct the corresponding arena via `arena.<i>.destroy` via `mallctl`. Depending on #4502. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4534 Differential Revision: D10435474 Pulled By: yiwu-arbug fbshipit-source-id: e80edea755d3853182485d2be710376384ce0bb4	2018-10-26 17:29:18 -07:00
Yi Wu	f560c8f5c8	s/CacheAllocator/MemoryAllocator/g (#4590 ) Summary: Rename the interface, as it is mean to be a generic interface for memory allocation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4590 Differential Revision: D10866340 Pulled By: yiwu-arbug fbshipit-source-id: 85cb753351a40cb856c046aeaa3f3b369eef3d16	2018-10-26 14:30:30 -07:00
Abhishek Madan	8c78348c77	Use only "local" range tombstones during Get (#4449 ) Summary: Previously, range tombstones were accumulated from every level, which was necessary if a range tombstone in a higher level covered a key in a lower level. However, RangeDelAggregator::AddTombstones's complexity is based on the number of tombstones that are currently stored in it, which is wasteful in the Get case, where we only need to know the highest sequence number of range tombstones that cover the key from higher levels, and compute the highest covering sequence number at the current level. This change introduces this optimization, and removes the use of RangeDelAggregator from the Get path. In the benchmark results, the following command was used to initialize the database: ``` ./db_bench -db=/dev/shm/5k-rts -use_existing_db=false -benchmarks=filluniquerandom -write_buffer_size=1048576 -compression_type=lz4 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -value_size=112 -key_size=16 -block_size=4096 -level_compaction_dynamic_level_bytes=true -num=5000000 -max_background_jobs=12 -benchmark_write_rate_limit=20971520 -range_tombstone_width=100 -writes_per_range_tombstone=100 -max_num_range_tombstones=50000 -bloom_bits=8 ``` ...and the following command was used to measure read throughput: ``` ./db_bench -db=/dev/shm/5k-rts/ -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=5000000 -reads=100000 -threads=32 ``` The filluniquerandom command was only run once, and the resulting database was used to measure read performance before and after the PR. Both binaries were compiled with `DEBUG_LEVEL=0`. Readrandom results before PR: ``` readrandom : 4.544 micros/op 220090 ops/sec; 16.9 MB/s (63103 of 100000 found) ``` Readrandom results after PR: ``` readrandom : 11.147 micros/op 89707 ops/sec; 6.9 MB/s (63103 of 100000 found) ``` So it's actually slower right now, but this PR paves the way for future optimizations (see #4493). ---- Pull Request resolved: https://github.com/facebook/rocksdb/pull/4449 Differential Revision: D10370575 Pulled By: abhimadan fbshipit-source-id: 9a2e152be1ef36969055c0e9eb4beb0d96c11f4d	2018-10-24 12:31:12 -07:00
Neil Mayhew	43dbd4411e	Adapt three unit tests with newer compiler/libraries (#4562 ) Summary: This fixes three tests that fail with relatively recent tools and libraries: The tests are: * `spatial_db_test` * `table_test` * `db_universal_compaction_test` I'm using: * `gcc` 7.3.0 * `glibc` 2.27 * `snappy` 1.1.7 * `gflags` 2.2.1 * `zlib` 1.2.11 * `bzip2` 1.0.6.0.1 * `lz4` 1.8.2 * `jemalloc` 5.0.1 The versions used in the Travis environment (which is two Ubuntu LTS versions behind the current one and doesn't use `lz4` or `jemalloc`) don't seem to have a problem. However, to be safe, I verified that these tests pass with and without my changes in a trusty Docker container without `lz4` and `jemalloc`. However, I do get an unrelated set of other failures when using a trusty Docker container that uses `lz4` and `jemalloc`: ``` db/db_universal_compaction_test.cc:506: Failure Value of: num + 1 Actual: 3 Expected: NumSortedRuns(1) Which is: 4 [ FAILED ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/0, where GetParam() = (1, false) (1189 ms) [ RUN ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/1 db/db_universal_compaction_test.cc:506: Failure Value of: num + 1 Actual: 3 Expected: NumSortedRuns(1) Which is: 4 [ FAILED ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/1, where GetParam() = (1, true) (1246 ms) [ RUN ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/2 db/db_universal_compaction_test.cc:506: Failure Value of: num + 1 Actual: 3 Expected: NumSortedRuns(1) Which is: 4 [ FAILED ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/2, where GetParam() = (3, false) (1237 ms) [ RUN ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/3 db/db_universal_compaction_test.cc:506: Failure Value of: num + 1 Actual: 3 Expected: NumSortedRuns(1) Which is: 4 [ FAILED ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/3, where GetParam() = (3, true) (1195 ms) [ RUN ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/4 db/db_universal_compaction_test.cc:506: Failure Value of: num + 1 Actual: 3 Expected: NumSortedRuns(1) Which is: 4 [ FAILED ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/4, where GetParam() = (5, false) (1161 ms) [ RUN ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/5 db/db_universal_compaction_test.cc:506: Failure Value of: num + 1 Actual: 3 Expected: NumSortedRuns(1) Which is: 4 [ FAILED ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/5, where GetParam() = (5, true) (1229 ms) ``` I haven't attempted to fix these since I'm not using trusty and Travis doesn't use `lz4` and `jemalloc`. However, the final commit in this PR does at least fix the compilation errors that occur when using trusty's version of `lz4`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4562 Differential Revision: D10510917 Pulled By: maysamyabandeh fbshipit-source-id: 59534042015ec339270e5fc2f6ac4d859370d189	2018-10-24 08:17:56 -07:00
Yi Wu	c7a45ca91f	BlobDB: handle IO error on write (#4580 ) Summary: A fix similar to #4410 but on the write path. On IO error on `SelectBlobFile()` we didn't return error code properly, but simply a nullptr of `BlobFile`. The `AppendBlob()` method didn't have null check for the pointer and caused crash. The fix make sure we properly return error code in this case. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4580 Differential Revision: D10513849 Pulled By: yiwu-arbug fbshipit-source-id: 80bca920d1d7a3541149de981015ad83e0aa14b5	2018-10-23 15:03:45 -07:00
jsteemann	d1c0d3f358	Small issues (#4564 ) Summary: Couple of very minor improvements (typos in comments, full qualification of class name, reordering members of a struct to make it smaller) Pull Request resolved: https://github.com/facebook/rocksdb/pull/4564 Differential Revision: D10510183 Pulled By: maysamyabandeh fbshipit-source-id: c7ddf9bfbf2db08cd31896c3fd93789d3fa68c8b	2018-10-23 10:35:57 -07:00
Yi Wu	933250e355	Fix RepeatableThreadTest::MockEnvTest hang (#4560 ) Summary: When `MockTimeEnv` is used in test to mock time methods, we cannot use `CondVar::TimedWait` because it is using real time, not the mocked time for wait timeout. On Mac the method can return immediately without awaking other waiting threads, if the real time is larger than `wait_until` (which is a mocked time). When that happen, the `wait()` method will fall into an infinite loop. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4560 Differential Revision: D10472851 Pulled By: yiwu-arbug fbshipit-source-id: 898902546ace7db7ac509337dd8677a527209d19	2018-10-21 20:17:18 -07:00
Yanqin Jin	da4aa59b4c	Add read retry support to log reader (#4394 ) Summary: Current `log::Reader` does not perform retry after encountering `EOF`. In the future, we need the log reader to be able to retry tailing the log even after `EOF`. Current implementation is simple. It does not provide more advanced retry policies. Will address this in the future. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4394 Differential Revision: D9926508 Pulled By: riversand963 fbshipit-source-id: d86d145792a41bd64a72f642a2a08c7b7b5201e1	2018-10-19 11:53:00 -07:00
Yanqin Jin	ce52274640	Replace 'string' with 'const string&' in FileOperationInfo (#4491 ) Summary: Using const string& can avoid one extra string copy. This PR addresses a recent comment made by siying on #3933. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4491 Differential Revision: D10381211 Pulled By: riversand963 fbshipit-source-id: 27fc2d65d84bc7cd07833c77cdc47f06dcfaeb31	2018-10-15 13:46:01 -07:00
Yanqin Jin	729a617b5b	Add listener to sample file io (#3933 ) Summary: We would like to collect file-system-level statistics including file name, offset, length, return code, latency, etc., which requires to add callbacks to intercept file IO function calls when RocksDB is running. To collect file-system-level statistics, users can inherit the class `EventListener`, as in `TestFileOperationListener `. Note that `TestFileOperationListener::ShouldBeNotifiedOnFileIO()` returns true. Pull Request resolved: https://github.com/facebook/rocksdb/pull/3933 Differential Revision: D10219571 Pulled By: riversand963 fbshipit-source-id: 7acc577a2d31097766a27adb6f78eaf8b1e8ff15	2018-10-12 18:36:11 -07:00
zpalmtree	46dd8b1e13	C++17 support (#4482 ) Summary: Closes https://github.com/facebook/rocksdb/issues/4462 I'm not sure if you'll be happy with `std::random_device{}`, perhaps you would want to use your rand instance instead. I didn't test to see if your rand instance supports the requirements that `std::shuffle` takes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4482 Differential Revision: D10325133 Pulled By: yiwu-arbug fbshipit-source-id: 47b7adaf4bb2b8d64cf090ea6b1b48ef53180581	2018-10-11 10:50:04 -07:00
Jiri Appl	b0026e1f5f	Enable building of ARM32 (#4349 ) Summary: The original logic was assuming that the only architectures that the code would build for on Windows were x86 and x64. This change will enable building for arm32 on Windows as well. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4349 Differential Revision: D10280887 Pulled By: sagar0 fbshipit-source-id: 9ca0bede25505d22e13acf916d38aeeaaf5d981a	2018-10-09 16:58:25 -07:00
Maysam Yabandeh	21b51dfec4	Add inline comments to flush job (#4464 ) Summary: It also renames InstallMemtableFlushResults to MaybeInstallMemtableFlushResults to clarify its contract. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4464 Differential Revision: D10224918 Pulled By: maysamyabandeh fbshipit-source-id: 04e3f2d8542002cb9f8010cb436f5152751b3cbe	2018-10-05 15:41:17 -07:00
Yanqin Jin	b41b2d431e	Improve error message when opening file for truncation (#4454 ) Summary: The old error message was misleading because it led people to believe the truncation operation failed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4454 Differential Revision: D10203575 Pulled By: riversand963 fbshipit-source-id: c76482a132566635cb55d4c73d45c461f295ec43	2018-10-04 14:53:36 -07:00
Zhongyi Xie	ce1fc5af09	fix unused param `allocator` in compression.h (#4453 ) Summary: this should fix currently failing contrun test: rocksdb-contrun-no_compression, rocksdb-contrun-tsan, rocksdb-contrun-tsan_crash Pull Request resolved: https://github.com/facebook/rocksdb/pull/4453 Differential Revision: D10202626 Pulled By: miasantreble fbshipit-source-id: 850b07f14f671b5998c22d8239e2a55b2fc1e355	2018-10-04 13:24:22 -07:00
Igor Canadi	1cf5deb8fd	Introduce CacheAllocator, a custom allocator for cache blocks (#4437 ) Summary: This is a conceptually simple change, but it touches many files to pass the allocator through function calls. We introduce CacheAllocator, which can be used by clients to configure custom allocator for cache blocks. Our motivation is to hook this up with folly's `JemallocNodumpAllocator` (`f43ce6d686/folly/experimental/JemallocNodumpAllocator.h`), but there are many other possible use cases. Additionally, this commit cleans up memory allocation in `util/compression.h`, making sure that all allocations are wrapped in a unique_ptr as soon as possible. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4437 Differential Revision: D10132814 Pulled By: yiwu-arbug fbshipit-source-id: be1343a4b69f6048df127939fea9bbc96969f564	2018-10-02 17:24:58 -07:00
Yi Wu	d6f2ecf49c	Utility to run task periodically in a thread (#4423 ) Summary: Introduce `RepeatableThread` utility to run task periodically in a separate thread. It is basically the same as the the same class in fbcode, and in addition provide a helper method to let tests mock time and trigger execution one at a time. We can use this class to replace `TimerQueue` in #4382 and `BlobDB`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4423 Differential Revision: D10020932 Pulled By: yiwu-arbug fbshipit-source-id: 3616bef108c39a33c92eedb1256de424b7c04087	2018-09-27 15:28:00 -07:00
Yi Wu	04d373b260	BlobDB: handle IO error on read (#4410 ) Summary: Fix IO error on read not being handle and crashing the DB. With the fix we properly return the error. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4410 Differential Revision: D9979246 Pulled By: yiwu-arbug fbshipit-source-id: 111a85675067a29c03cb60e9a34103f4ff636694	2018-09-20 16:58:45 -07:00
Anand Ananthabhotla	30c21df97c	Fix regression test failures introduced by PR #4164 (#4375 ) Summary: 1. Add override keyword to overridden virtual functions in EventListener 2. Fix a memory corruption that can happen during DB shutdown when in read-only mode due to a background write error 3. Fix uninitialized buffers in error_handler_test.cc that cause valgrind to complain Pull Request resolved: https://github.com/facebook/rocksdb/pull/4375 Differential Revision: D9875779 Pulled By: anand1976 fbshipit-source-id: 022ede1edc01a9f7e21ecf4c61ef7d46545d0640	2018-09-17 13:14:07 -07:00
Anand Ananthabhotla	a27fce408e	Auto recovery from out of space errors (#4164 ) Summary: This commit implements automatic recovery from a Status::NoSpace() error during background operations such as write callback, flush and compaction. The broad design is as follows - 1. Compaction errors are treated as soft errors and don't put the database in read-only mode. A compaction is delayed until enough free disk space is available to accomodate the compaction outputs, which is estimated based on the input size. This means that users can continue to write, and we rely on the WriteController to delay or stop writes if the compaction debt becomes too high due to persistent low disk space condition 2. Errors during write callback and flush are treated as hard errors, i.e the database is put in read-only mode and goes back to read-write only fater certain recovery actions are taken. 3. Both types of recovery rely on the SstFileManagerImpl to poll for sufficient disk space. We assume that there is a 1-1 mapping between an SFM and the underlying OS storage container. For cases where multiple DBs are hosted on a single storage container, the user is expected to allocate a single SFM instance and use the same one for all the DBs. If no SFM is specified by the user, DBImpl::Open() will allocate one, but this will be one per DB and each DB will recover independently. The recovery implemented by SFM is as follows - a) On the first occurance of an out of space error during compaction, subsequent compactions will be delayed until the disk free space check indicates enough available space. The required space is computed as the sum of input sizes. b) The free space check requirement will be removed once the amount of free space is greater than the size reserved by in progress compactions when the first error occured c) If the out of space error is a hard error, a background thread in SFM will poll for sufficient headroom before triggering the recovery of the database and putting it in write-only mode. The headroom is calculated as the sum of the write_buffer_size of all the DB instances associated with the SFM 4. EventListener callbacks will be called at the start and completion of automatic recovery. Users can disable the auto recov ery in the start callback, and later initiate it manually by calling DB::Resume() Todo: 1. More extensive testing 2. Add disk full condition to db_stress (follow-on PR) Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164 Differential Revision: D9846378 Pulled By: anand1976 fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a	2018-09-15 13:43:04 -07:00
Yanqin Jin	8959063c9c	Store the return value of Fsync for check Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4361 Differential Revision: D9803723 Pulled By: riversand963 fbshipit-source-id: 5a0d4cd3e57fd195571dcd5822895ee00547fa6a	2018-09-14 13:29:56 -07:00
Andrew Kryczka	2c14662213	Revert "Digest ZSTD compression dictionary once per SST file (#4251 )" (#4347 ) Summary: Reverting is needed to unblock a user building against master, who is blocked for multiple days due to a thread-safety issue in `GetEmptyDict`. We haven't been able to fix it quickly, so reverting. Simply ran `git revert 6c40806e51a89386d2b066fddf73d3fd03a36f65`. There were no merge conflicts. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4347 Differential Revision: D9668365 Pulled By: ajkr fbshipit-source-id: 0c56334f0a23cf5ee0233d4e4679eae6709739cd	2018-09-06 09:58:34 -07:00
cngzhnp	64324e329e	Support pragma once in all header files and cleanup some warnings (#4339 ) Summary: As you know, almost all compilers support "pragma once" keyword instead of using include guards. To be keep consistency between header files, all header files are edited. Besides this, try to fix some warnings about loss of data. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4339 Differential Revision: D9654990 Pulled By: ajkr fbshipit-source-id: c2cf3d2d03a599847684bed81378c401920ca848	2018-09-05 18:13:31 -07:00
Yi Wu	462ed70d64	BlobDB: GetLiveFiles and GetLiveFilesMetadata return relative path (#4326 ) Summary: `GetLiveFiles` and `GetLiveFilesMetadata` should return path relative to db path. It is a separate issue when `path_relative` is false how can we return relative path. But `DBImpl::GetLiveFiles` don't handle it as well when there are multiple `db_paths`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4326 Differential Revision: D9545904 Pulled By: yiwu-arbug fbshipit-source-id: 6762d879fcb561df2b612e6fdfb4a6b51db03f5d	2018-08-31 12:12:49 -07:00
Andrew Kryczka	42733637e1	Sync CURRENT file during checkpoint (#4322 ) Summary: For the CURRENT file forged during checkpoint, we were forgetting to `fsync` or `fdatasync` it after its creation. This PR fixes it. Differential Revision: D9525939 Pulled By: ajkr fbshipit-source-id: a505483644026ee3f501cfc0dcbe74832165b2e3	2018-08-28 12:43:18 -07:00
Andrew Kryczka	6c40806e51	Digest ZSTD compression dictionary once per SST file (#4251 ) Summary: In RocksDB, for a given SST file, all data blocks are compressed with the same dictionary. When we compress a block using the dictionary's raw bytes, the compression library first has to digest the dictionary to get it into a usable form. This digestion work is redundant and ideally should be done once per file. ZSTD offers APIs for the caller to create and reuse a digested dictionary object (`ZSTD_CDict`). In this PR, we call `ZSTD_createCDict` once per file to digest the raw bytes. Then we use `ZSTD_compress_usingCDict` to compress each data block using the pre-digested dictionary. Once the file's created `ZSTD_freeCDict` releases the resources held by the digested dictionary. There are a couple other changes included in this PR: - Changed the parameter object for (un)compression functions from `CompressionContext`/`UncompressionContext` to `CompressionInfo`/`UncompressionInfo`. This avoids the previous pattern, where `CompressionContext`/`UncompressionContext` had to be mutated before calling a (un)compression function depending on whether dictionary should be used. I felt that mutation was error-prone so eliminated it. - Added support for digested uncompression dictionaries (`ZSTD_DDict`) as well. However, this PR does not support reusing them across uncompression calls for the same file. That work is deferred to a later PR when we will store the `ZSTD_DDict` objects in block cache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4251 Differential Revision: D9257078 Pulled By: ajkr fbshipit-source-id: 21b8cb6bbdd48e459f1c62343780ab66c0a64438	2018-08-23 19:28:18 -07:00
Yanqin Jin	bb5dcea98e	Add path to WritableFileWriter. (#4039 ) Summary: We want to sample the file I/O issued by RocksDB and report the function calls. This requires us to include the file paths otherwise it's hard to tell what has been going on. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4039 Differential Revision: D8670178 Pulled By: riversand963 fbshipit-source-id: 97ee806d1c583a2983e28e213ee764dc6ac28f7a	2018-08-23 10:12:58 -07:00
Andrew Kryczka	b6280d01f9	Require ZSTD 1.1.3+ to use dictionary trainer (#4295 ) Summary: ZSTD's dynamic library exports `ZDICT_trainFromBuffer` symbol since v1.1.3, and its static library exports it since v0.6.1. We don't know whether linkage is static or dynamic, so just require v1.1.3 to use dictionary trainer. Fixes the issue reported here: https://jira.mariadb.org/browse/MDEV-16525. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4295 Differential Revision: D9417183 Pulled By: ajkr fbshipit-source-id: 0e89d2f48d9e7f6eee73e7f4572660a9f7122db8	2018-08-22 18:27:52 -07:00
Yi Wu	4f12d49daf	Suppress clang analyzer error (#4299 ) Summary: Suppress multiple clang-analyzer error. All of them are clang false-positive. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4299 Differential Revision: D9430740 Pulled By: yiwu-arbug fbshipit-source-id: fbdd575bdc214d124826d61d35a117995c509279	2018-08-21 16:43:05 -07:00
Siying Dong	d5612b43de	Two code changes to make "clang analyze" happy (#4292 ) Summary: Clang analyze is not happy in two pieces of code, with "Potential memory leak". No idea what the problem but slightly changing the code makes clang happy. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4292 Differential Revision: D9413555 Pulled By: siying fbshipit-source-id: 9428c9d3664530c72129feefd135ee63d8386137	2018-08-20 17:43:41 -07:00
Fenggang Wu	19ec44fd39	Improve point-lookup performance using a data block hash index (#4174 ) Summary: Add hash index support to data blocks, which helps to reduce the CPU utilization of point-lookup operations. This feature is backward compatible with the data block created without the hash index. It is disabled by default unless `BlockBasedTableOptions::data_block_index_type` is set to `data_block_index_type = kDataBlockBinaryAndHash.` The DB size would be bigger with the hash index option as a hash table is added at the end of each data block. If the hash utilization ratio is 1:1, the space overhead is one byte per key. The hash table utilization ratio is adjustable using `BlockBasedTableOptions::data_block_hash_table_util_ratio`. A lower utilization ratio will improve more on the point-lookup efficiency, but take more space too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4174 Differential Revision: D8965914 Pulled By: fgwu fbshipit-source-id: 1c6bae5d1fc39c80282d8890a72e9e67bc247198	2018-08-15 14:30:03 -07:00
jsteemann	33ad9060d3	fix compilation with g++ option `-Wsuggest-override` (#4272 ) Summary: Fixes compilation warnings (which are turned into compilation errors by default) when compiling with g++ option `-Wsuggest-override`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4272 Differential Revision: D9322556 Pulled By: siying fbshipit-source-id: abd57a29ec8f544bee77c0bb438f31be830b7244	2018-08-14 15:13:10 -07:00
Zhichao Cao	999d955e4f	RocksDB Trace Analyzer (#4091 ) Summary: A framework of trace analyzing for RocksDB After collecting the trace by using the tool of [PR #3837](https://github.com/facebook/rocksdb/pull/3837). User can use the Trace Analyzer to interpret, analyze, and characterize the collected workload. Input: 1. trace file 2. Whole keys space file Statistics: 1. Access count of each operation (Get, Put, Delete, SingleDelete, DeleteRange, Merge) in each column family. 2. Key hotness (access count) of each one 3. Key space separation based on given prefix 4. Key size distribution 5. Value size distribution if appliable 6. Top K accessed keys 7. QPS statistics including the average QPS and peak QPS 8. Top K accessed prefix 9. The query correlation analyzing, output the number of X after Y and the corresponding average time intervals Output: 1. key access heat map (either in the accessed key space or whole key space) 2. trace sequence file (interpret the raw trace file to line base text file for future use) 3. Time serial (The key space ID and its access time) 4. Key access count distritbution 5. Key size distribution 6. Value size distribution (in each intervals) 7. whole key space separation by the prefix 8. Accessed key space separation by the prefix 9. QPS of each operation and each column family 10. Top K QPS and their accessed prefix range Test: 1. Added the unit test of analyzing Get, Put, Delete, SingleDelete, DeleteRange, Merge 2. Generated the trace and analyze the trace Implemented but not tested (due to the limitation of trace_replay): 1. Analyzing Iterator, supporting Seek() and SeekForPrev() analyzing 2. Analyzing the number of Key found by Get Future Work: 1. Support execution time analyzing of each requests 2. Support cache hit situation and block read situation of Get Pull Request resolved: https://github.com/facebook/rocksdb/pull/4091 Differential Revision: D9256157 Pulled By: zhichao-cao fbshipit-source-id: f0ceacb7eedbc43a3eee6e85b76087d7832a8fe6	2018-08-13 11:44:02 -07:00
Zhichao Cao	6d75319d95	Add tracing function of Seek() and SeekForPrev() to trace_replay (#4228 ) Summary: In the current trace_and replay, Get an WriteBatch are traced. This pull request track down the Seek() and SeekForPrev() to the trace file. <target_key, timestamp, column_family_id> are write to the file. Replay of Iterator is not supported in the current implementation. Tested with trace_analyzer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4228 Differential Revision: D9201381 Pulled By: zhichao-cao fbshipit-source-id: 6f9cc9cb3c20260af741bee065ec35c5c96354ab	2018-08-10 17:57:40 -07:00
Maysam Yabandeh	caf0f53a74	Index value delta encoding (#3983 ) Summary: Given that index value is a BlockHandle, which is basically an <offset, size> pair we can apply delta encoding on the values. The first value at each index restart interval encoded the full BlockHandle but the rest encode only the size. Refer to IndexBlockIter::DecodeCurrentValue for the detail of the encoding. This reduces the index size which helps using the block cache more efficiently. The feature is enabled with using format_version 4. The feature comes with a bit of cpu overhead which should be paid back by the higher cache hits due to smaller index block size. Results with sysbench read-only using 4k blocks and using 16 index restart interval: Format 2: 19585 rocksdb read-only range=100 Format 3: 19569 rocksdb read-only range=100 Format 4: 19352 rocksdb read-only range=100 Pull Request resolved: https://github.com/facebook/rocksdb/pull/3983 Differential Revision: D8361343 Pulled By: maysamyabandeh fbshipit-source-id: f882ee082322acac32b0072e2bdbb0b5f854e651	2018-08-09 16:58:40 -07:00
Jingguo Yao	ceb5fea1e3	Improve FullFilterBitsReader::HashMayMatch's doc (#4202 ) Summary: HashMayMatch is related to AddKey() instead of CreateFilter(). Also applies some minor Fixes #4191 #4200 #3910 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4202 Differential Revision: D9180945 Pulled By: maysamyabandeh fbshipit-source-id: 6f07b81c5bb9bda5c0273475b486ba8a030471e6	2018-08-06 11:13:18 -07:00
Gustav Davidsson	a15354d04e	Expose GetTotalTrashSize in SstFileManager interface (#4206 ) Summary: Hi, it would be great if we could expose this API, so that LogDevice can use it to track the total size of trash files and alarm if it grows too large in relation to disk size. There's probably other customers that would be interested in this as well. :) Pull Request resolved: https://github.com/facebook/rocksdb/pull/4206 Differential Revision: D9115516 Pulled By: gdavidsson fbshipit-source-id: f34993a940e39cb0a0b544ae8298546499b7e047	2018-08-04 17:57:48 -07:00
Sagar Vemuri	12b6cdeed3	Trace and Replay for RocksDB (#3837 ) Summary: A framework for tracing and replaying RocksDB operations. A binary trace file is created by capturing the DB operations, and it can be replayed back at the same rate using db_bench. - Column-families are supported - Multi-threaded tracing is supported. - TraceReader and TraceWriter are exposed to the user, so that tracing to various destinations can be enabled (say, to other messaging/logging services). By default, a FileTraceReader and FileTraceWriter are implemented to capture to a file and replay from it. - This is not yet ideal to be enabled in production due to large performance overhead, but it can be safely tried out in a shadow setup, say, for analyzing RocksDB operations. Currently supported DB operations: - Writes: -- Put -- Merge -- Delete -- SingleDelete -- DeleteRange -- Write - Reads: -- Get (point lookups) Pull Request resolved: https://github.com/facebook/rocksdb/pull/3837 Differential Revision: D7974837 Pulled By: sagar0 fbshipit-source-id: 8ec65aaf336504bc1f6ed0feae67f6ed5ef97a72	2018-08-01 00:27:08 -07:00
Andrew Kryczka	a1a546a634	Avoid integer division in filter probing (#4071 ) Summary: The cache line size was computed dynamically based on the length of the filter bits, and the number of cache-lines encoded in the footer. This calculation had to be dynamic in case users migrate their data between platforms with different cache line sizes. The downside, though, was bloom filter probing became expensive as it did integer mod and division. However, since we know all possible cache line sizes are powers of two, we should be able to use bit shift to find the cache line, and bitwise-and to find the bit within the cache line. To do this, we compute the log-base-two of cache line size in the constructor, and use that in bitwise operations to replace division/mod. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4071 Differential Revision: D8684067 Pulled By: ajkr fbshipit-source-id: 50298872fba5acd01e8269cd7abcc51a095e0f61	2018-07-30 17:57:44 -07:00
Manuel Ung	ea212e5316	WriteUnPrepared: Implement unprepared batches for transactions (#4104 ) Summary: This adds support for writing unprepared batches based on size defined in `TransactionOptions::max_write_batch_size`. This is done by overriding methods that modify data (Put/Delete/SingleDelete/Merge) and checking first if write batch size has exceeded threshold. If so, the write batch is written to DB as an unprepared batch. Support for Commit/Rollback for unprepared batch is added as well. This has been done by simply extending the WritePrepared Commit/Rollback logic to take care of all unprep_seq numbers either when updating prepare heap, or adding to commit map. For updating the commit map, this logic exists inside `WriteUnpreparedCommitEntryPreReleaseCallback`. A test change was also made to have transactions unregister themselves when committing without prepare. This is because with write unprepared, there may be unprepared entries (which act similarly to prepared entries) already when a commit is done without prepare. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4104 Differential Revision: D8785717 Pulled By: lth fbshipit-source-id: c02006e281ec1ce00f628e2a7beec0ee73096a91	2018-07-24 00:13:18 -07:00
Chang Su	374c37da5b	move static msgs out of Status class (#4144 ) Summary: The member msgs of class Status contains all types of status messages. When users dump a Status object, msgs will confuse users. So move it out of class Status by making it as file-local static variable. Closes #3831 . Pull Request resolved: https://github.com/facebook/rocksdb/pull/4144 Differential Revision: D8941419 Pulled By: sagar0 fbshipit-source-id: 56b0510258465ff26db15aa6b04e01532e053e3d	2018-07-23 15:44:16 -07:00
Siying Dong	a5e851e113	Reformatting some recent changes (#4161 ) Summary: Lint is not happy with some new code recently committed. Format them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4161 Differential Revision: D8940582 Pulled By: siying fbshipit-source-id: c9b43b1ef8c88b5e923911058b44eb77234b36b7	2018-07-20 14:43:38 -07:00
Siying Dong	8425c8bd4d	BlockBasedTableReader: automatically adjust tail prefetch size (#4156 ) Summary: Right now we use one hard-coded prefetch size to prefetch data from the tail of the SST files. However, this may introduce a waste for some use cases, while not efficient for others. Introduce a way to adjust this prefetch size by tracking 32 recent times, and pick a value with which the wasted read is less than 10% Pull Request resolved: https://github.com/facebook/rocksdb/pull/4156 Differential Revision: D8916847 Pulled By: siying fbshipit-source-id: 8413f9eb3987e0033ed0bd910f83fc2eeaaf5758	2018-07-20 14:43:37 -07:00
Dmitri Smirnov	78ab11cd71	Return new operator for Status allocations for Windows (#4128 ) Summary: Windows requires new/delete for memory allocations to be overriden. Refactor to be less intrusive. Differential Revision: D8878047 Pulled By: siying fbshipit-source-id: 35f2b5fec2f88ea48c9be926539c6469060aab36	2018-07-19 15:09:06 -07:00
Siying Dong	4bb1e239b5	Cap concurrent arena's shard block size to 128KB (#4147 ) Summary: Users sometime see their memtable size far smaller than expected. They probably have hit a fragementation of shard blocks. Cap their size anyway to reduce the impact of problem. 128KB is conservative so I don't imagine it can cause any performance problem. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4147 Differential Revision: D8886706 Pulled By: siying fbshipit-source-id: 8528a2a4196aa4457274522e2565fd3ff28f621e	2018-07-18 10:43:54 -07:00
Fenggang Wu	5a59ce4149	Coding.h: Added Fixed16 support (#4142 ) Summary: Added Get Put Encode Decode support for Fixed16 (uint16_t). Unit test added in `coding_test.cc` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4142 Differential Revision: D8873516 Pulled By: fgwu fbshipit-source-id: 331913e0a9a8fe9c95606a08e856e953477d64d3	2018-07-16 23:43:41 -07:00
Zhongyi Xie	91d7c03cdc	Exclude time waiting for rate limiter from rocksdb.sst.read.micros (#4102 ) Summary: Our "rocksdb.sst.read.micros" stat includes time spent waiting for rate limiter. It probably only affects people rate limiting compaction reads, which is fairly rare. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4102 Differential Revision: D8848506 Pulled By: miasantreble fbshipit-source-id: 01258ac5ae56e4eee372978cfc9143a6869f8bfc	2018-07-13 18:44:14 -07:00
Maysam Yabandeh	8581a93a6b	Per-thread unique test db names (#4135 ) Summary: The patch makes sure that two parallel test threads will operate on different db paths. This enables using open source tools such as gtest-parallel to run the tests of a file in parallel. Example: ``` ~/gtest-parallel/gtest-parallel ./table_test``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4135 Differential Revision: D8846653 Pulled By: maysamyabandeh fbshipit-source-id: 799bad1abb260e3d346bcb680d2ae207a852ba84	2018-07-13 17:27:39 -07:00
Fosco Marotto	8527012bb6	Converted db/merge_test.cc to use gtest (#4114 ) Summary: Picked up a task to convert this to use the gtest framework. It can't be this simple, can it? It works, but should all the std::cout be removed? ``` [$] ~/git/rocksdb [gft !]: ./merge_test [==========] Running 2 tests from 1 test case. [----------] Global test environment set-up. [----------] 2 tests from MergeTest [ RUN ] MergeTest.MergeDbTest Test read-modify-write counters... a: 3 1 2 a: 3 b: 1225 3 Compaction started ... Compaction ended a: 3 b: 1225 Test merge-based counters... a: 3 1 2 a: 3 b: 1225 3 Test merge in memtable... a: 3 1 2 a: 3 b: 1225 3 Test Partial-Merge Test merge-operator not set after reopen [ OK ] MergeTest.MergeDbTest (93 ms) [ RUN ] MergeTest.MergeDbTtlTest Opening database with TTL Test read-modify-write counters... a: 3 1 2 a: 3 b: 1225 3 Compaction started ... Compaction ended a: 3 b: 1225 Test merge-based counters... a: 3 1 2 a: 3 b: 1225 3 Test merge in memtable... Opening database with TTL a: 3 1 2 a: 3 b: 1225 3 Test Partial-Merge Opening database with TTL Opening database with TTL Opening database with TTL Opening database with TTL Test merge-operator not set after reopen [ OK ] MergeTest.MergeDbTtlTest (97 ms) [----------] 2 tests from MergeTest (190 ms total) [----------] Global test environment tear-down [==========] 2 tests from 1 test case ran. (190 ms total) [ PASSED ] 2 tests. ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4114 Differential Revision: D8822886 Pulled By: gfosco fbshipit-source-id: c299d008e883c3bb911d2b357a2e9e4423f8e91a	2018-07-13 14:13:07 -07:00
Tamir Duberstein	7bee48bdbd	Add GCC 8 to Travis (#3433 ) Summary: - Avoid `strdup` to use jemalloc on Windows - Use `size_t` for consistency - Add GCC 8 to Travis - Add CMAKE_BUILD_TYPE=Release to Travis Pull Request resolved: https://github.com/facebook/rocksdb/pull/3433 Differential Revision: D6837948 Pulled By: sagar0 fbshipit-source-id: b8543c3a4da9cd07ee9a33f9f4623188e233261f	2018-07-13 10:58:06 -07:00
Sagar Vemuri	1c912196de	Remove external tracking of AlignedBuffer's size (#4105 ) Summary: Remove external tracking of AlignedBuffer's size in `ReadaheadRandomAccessFile` and `FilePrefetchBuffer`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4105 Differential Revision: D8805724 Pulled By: sagar0 fbshipit-source-id: d61d8c203c7c500e3f36e912132d7852026ed023	2018-07-11 15:57:49 -07:00
Sagar Vemuri	440621aab8	Fix Copying of data between buffers in FilePrefetchBuffer (#4100 ) Summary: Copy data between buffers inside FilePrefetchBuffer only when chunk length is greater than 0. Otherwise AlignedBuffer was accessing memory out of its range causing crashes. Removing the tracking of buffer length outside of `AlignedBuffer`, i.e. in `FilePrefetchBuffer` and `ReadaheadRandomAccessFile`, will follow in a separate PR, as it is not the root cause of the crash reported in #4051. (`FilePrefetchBuffer` itself has been this way from its inception, and `ReadaheadRandomAccessFile` was updated to add the buffer length at some point). Comprehensive tests for `FilePrefetchBuffer` also to follow in a separate PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4100 Differential Revision: D8792590 Pulled By: sagar0 fbshipit-source-id: 3578f45761cf6884243e767f749db4016ccc93e1	2018-07-11 12:28:13 -07:00
Siying Dong	926f3a78a6	In delete scheduler, before ftruncate file for slow delete, check whether there is other hard links (#4093 ) Summary: Right now slow deletion with ftruncate doesn't work well with checkpoints because it ruin hard linked files in checkpoints. To fix it, check the file has no other hard link before ftruncate it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4093 Differential Revision: D8730360 Pulled By: siying fbshipit-source-id: 756eea5bce8a87b9a2ea3a5bfa190b2cab6f75df	2018-07-09 15:28:12 -07:00
Andrew Kryczka	25403c2265	Prefetch cache lines for filter lookup (#4068 ) Summary: Since the filter data is unaligned, even though we ensure all probes are within a span of `cache_line_size` bytes, those bytes can span two cache lines. In that case I doubt hardware prefetching does a great job considering we don't necessarily access those two cache lines in order. This guess seems correct since adding explicit prefetch instructions reduced filter lookup overhead by 19.4%. Closes https://github.com/facebook/rocksdb/pull/4068 Differential Revision: D8674189 Pulled By: ajkr fbshipit-source-id: 747427d9a17900151c17820488e3f7efe06b1871	2018-06-28 13:20:29 -07:00
Anand Ananthabhotla	52d4c9b7f6	Allow DB resume after background errors (#3997 ) Summary: Currently, if RocksDB encounters errors during a write operation (user requested or BG operations), it sets DBImpl::bg_error_ and fails subsequent writes. This PR allows the DB to be resumed for certain classes of errors. It consists of 3 parts - 1. Introduce Status::Severity in rocksdb::Status to indicate whether a given error can be recovered from or not 2. Refactor the error handling code so that setting bg_error_ and deciding on severity is in one place 3. Provide an API for the user to clear the error and resume the DB instance This whole change is broken up into multiple PRs. Initially, we only allow clearing the error for Status::NoSpace() errors during background flush/compaction. Subsequent PRs will expand this to include more errors and foreground operations such as Put(), and implement a polling mechanism for out-of-space errors. Closes https://github.com/facebook/rocksdb/pull/3997 Differential Revision: D8653831 Pulled By: anand1976 fbshipit-source-id: 6dc835c76122443a7668497c0226b4f072bc6afd	2018-06-28 12:34:40 -07:00
Daniel Black	e5ae1bb465	Remove bogus gcc-8.1 warning (#3870 ) Summary: Various rearrangements of the cch maths failed or replacing = '\0' with memset failed to convince the compiler it was nul terminated. So took the perverse option of changing strncpy to strcpy. Return null if memory couldn't be allocated. util/status.cc: In static member function ‘static const char* rocksdb::Status::CopyState(const char)’: util/status.cc:28:15: error: ‘char strncpy(char, const char, size_t)’ output truncated before terminating nul copying as many bytes from a string as its length [-Werror=stringop-truncation] std::strncpy(result, state, cch - 1); ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~ util/status.cc:19:18: note: length computed here std::strlen(state) + 1; // +1 for the null terminator ~~~~~~~~~~~^~~~~~~ cc1plus: all warnings being treated as errors make: *** [Makefile:645: shared-objects/util/status.o] Error 1 closes #2705 Closes https://github.com/facebook/rocksdb/pull/3870 Differential Revision: D8594114 Pulled By: anand1976 fbshipit-source-id: ab20f3a456a711e4d29144ebe630e4fe3c99ec25	2018-06-27 12:23:07 -07:00
Zhongyi Xie	408205a36b	use user_key and iterate_upper_bound to determine compatibility of bloom filters (#3899 ) Summary: Previously in https://github.com/facebook/rocksdb/pull/3601 bloom filter will only be checked if `prefix_extractor` in the mutable_cf_options matches the one found in the SST file. This PR relaxes the requirement by checking if all keys in the range [user_key, iterate_upper_bound) all share the same prefix after transforming using the BF in the SST file. If so, the bloom filter is considered compatible and will continue to be looked at. Closes https://github.com/facebook/rocksdb/pull/3899 Differential Revision: D8157459 Pulled By: miasantreble fbshipit-source-id: 18d17cba56a1005162f8d5db7a27aba277089c41	2018-06-26 15:57:26 -07:00
Maysam Yabandeh	80ade9ad83	Pin top-level index on partitioned index/filter blocks (#4037 ) Summary: Top-level index in partitioned index/filter blocks are small and could be pinned in memory. So far we use that by cache_index_and_filter_blocks to false. This however make it difficult to keep account of the total memory usage. This patch introduces pin_top_level_index_and_filter which in combination with cache_index_and_filter_blocks=true keeps the top-level index in cache and yet pinned them to avoid cache misses and also cache lookup overhead. Closes https://github.com/facebook/rocksdb/pull/4037 Differential Revision: D8596218 Pulled By: maysamyabandeh fbshipit-source-id: 3a5f7f9ca6b4b525b03ff6bd82354881ae974ad2	2018-06-22 15:27:46 -07:00
Sagar Vemuri	7103559f49	Improve direct IO range scan performance with readahead (#3884 ) Summary: This PR extends the improvements in #3282 to also work when using Direct IO. We see 4.5X performance improvement in seekrandom benchmark doing long range scans, when using direct reads, on flash. Description: This change improves the performance of iterators doing long range scans (e.g. big/full index or table scans in MyRocks) by using readahead and prefetching additional data on each disk IO, and storing in a local buffer. This prefetching is automatically enabled on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of 256 KB. This helps in cutting down the number of IOs needed to complete the range scan. Implementation Details: - Used `FilePrefetchBuffer` as the underlying buffer to store the readahead data. `FilePrefetchBuffer` can now take file_reader, readahead_size and max_readahead_size as input to the constructor, and automatically do readahead. - `FilePrefetchBuffer::TryReadFromCache` can now call `FilePrefetchBuffer::Prefetch` if readahead is enabled. - `AlignedBuffer` (which is the underlying store for `FilePrefetchBuffer`) now takes a few additional args in `AlignedBuffer::AllocateNewBuffer` to allow copying data from the old buffer. - Made sure not to re-read partial chunks of data that were already available in the buffer, from device again. - Fixed a couple of cases where `AlignedBuffer::cursize_` was not being properly kept up-to-date. Constraints: - Similar to #3282, this gets currently enabled only when ReadOptions.readahead_size = 0 (which is the default value). - Since the prefetched data is stored in a temporary buffer allocated on heap, this could increase the memory usage if you have many iterators doing long range scans simultaneously. - Enabled only for user reads, and disabled for compactions. Compaction reads are controlled by the options `use_direct_io_for_flush_and_compaction` and `compaction_readahead_size`, and the current feature takes precautions not to mess with them. Benchmarks: I used the same benchmark as used in #3282. Data fill: ``` TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=fillrandom -num=1000000000 -compression_type="none" -level_compaction_dynamic_level_bytes ``` Do a long range scan: Seekrandom with large number of nexts ``` TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=seekrandom -use_direct_reads -duration=60 -num=1000000000 -use_existing_db -seek_nexts=10000 -statistics -histogram ``` ``` Before: seekrandom : 37939.906 micros/op 26 ops/sec; 29.2 MB/s (1636 of 1999 found) With this change: seekrandom : 8527.720 micros/op 117 ops/sec; 129.7 MB/s (6530 of 7999 found) ``` ~4.5X perf improvement. Taken on an average of 3 runs. Closes https://github.com/facebook/rocksdb/pull/3884 Differential Revision: D8082143 Pulled By: sagar0 fbshipit-source-id: 4d7a8561cbac03478663713df4d31ad2620253bb	2018-06-21 11:13:08 -07:00
Yanqin Jin	524c6e6b72	Add file name info to SequentialFileReader. (#4026 ) Summary: We potentially need this information for tracing, profiling and diagnosis. Closes https://github.com/facebook/rocksdb/pull/4026 Differential Revision: D8555214 Pulled By: riversand963 fbshipit-source-id: 4263e06c00b6d5410b46aa46eb4e358ff2161dd2	2018-06-21 08:42:24 -07:00
Tomas Kolda	906a602c2c	Build and tests fixes for Solaris Sparc (#4000 ) Summary: Here are some fixes for build on Solaris Sparc. It is also fixing CRC test on BigEndian platforms. Closes https://github.com/facebook/rocksdb/pull/4000 Differential Revision: D8455394 Pulled By: ajkr fbshipit-source-id: c9289a7b541a5628139c6b77e84368e14dc3d174	2018-06-15 12:42:53 -07:00
Andrew Kryczka	1f32dc7d2b	Check with PosixEnv before opening LOCK file (#3993 ) Summary: Rebased and resubmitting #1831 on behalf of stevelittle. The problem is when a single process attempts to open the same DB twice, the second attempt fails due to LOCK file held. If the second attempt had opened the LOCK file, it'll now need to close it, and closing causes the file to be unlocked. Then, any subsequent attempt to open the DB will succeed, which is the wrong behavior. The solution was to track which files a process has locked in PosixEnv, and check those before opening a LOCK file. Fixes #1780. Closes https://github.com/facebook/rocksdb/pull/3993 Differential Revision: D8398984 Pulled By: ajkr fbshipit-source-id: 2755fe66950a0c9de63075f932f9e15768041918	2018-06-13 17:32:04 -07:00
Andrew Kryczka	a720401877	Avoid acquiring SyncPoint mutex when it is disabled (#3991 ) Summary: In `db_stress` profile the vast majority of CPU time is spent acquiring the `SyncPoint` mutex. I mistakenly assumed #3939 had fixed this mutex contention problem by disabling `SyncPoint` processing. But actually the lock was still being acquired just to check whether processing is enabled. We can avoid that overhead by using an atomic to track whether it's enabled. Closes https://github.com/facebook/rocksdb/pull/3991 Differential Revision: D8393825 Pulled By: ajkr fbshipit-source-id: 5bc4e3c722ee7304e7a9c2439998c456b05a6897	2018-06-13 13:13:18 -07:00
Zhongyi Xie	45b6bcca98	ZSTD compression: should also expect type = kZSTDNotFinalCompression (#3964 ) Summary: Depending on the compression type, `CompressBlock` calls the compress method for each compression type. It calls ZSTD_Compress for both kZSTD and kZSTDNotFinalCompression (https://github.com/facebook/rocksdb/blob/master/table/block_based_table_builder.cc#L169). However currently ZSTD_Compress only expects the type to be kZSTD and this is causing assert failures and crashes. The same also applies to ZSTD_Uncompress. Closes https://github.com/facebook/rocksdb/pull/3964 Differential Revision: D8308715 Pulled By: miasantreble fbshipit-source-id: e5125f53edb829c9c33733167bec74e4793d0782	2018-06-06 23:42:29 -07:00
Zhongyi Xie	f1592a06c2	run make format for PR 3838 (#3954 ) Summary: PR https://github.com/facebook/rocksdb/pull/3838 made some changes that triggers lint warnings. Run `make format` to fix formatting as suggested by siying . Also piggyback two changes: 1) fix singleton destruction order for windows and posix env 2) fix two clang warnings Closes https://github.com/facebook/rocksdb/pull/3954 Differential Revision: D8272041 Pulled By: miasantreble fbshipit-source-id: 7c4fd12bd17aac13534520de0c733328aa3c6c9f	2018-06-05 12:58:02 -07:00
Maysam Yabandeh	d0c38c0c8c	Extend some tests to format_version=3 (#3942 ) Summary: format_version=3 changes the format of SST index. This is however not being tested currently since tests only work with the default format_version which is currently 2. The patch extends the most related tests to also test for format_version=3. Closes https://github.com/facebook/rocksdb/pull/3942 Differential Revision: D8238413 Pulled By: maysamyabandeh fbshipit-source-id: 915725f55753dd8e9188e802bf471c23645ad035	2018-06-04 20:13:00 -07:00
Andrew Kryczka	2210152947	Fix singleton destruction order of PosixEnv and SyncPoint (#3951 ) Summary: Ensure the PosixEnv singleton is destroyed first since its destructor waits for background threads to all complete. This ensures background threads cannot hit sync points after the SyncPoint singleton is destroyed, which was previously possible. Closes https://github.com/facebook/rocksdb/pull/3951 Differential Revision: D8265295 Pulled By: ajkr fbshipit-source-id: 7738dd458c5d993a78377dd0420e82badada81ab	2018-06-04 15:58:46 -07:00
Dmitri Smirnov	f4b72d7056	Provide a way to override windows memory allocator with jemalloc for ZSTD Summary: Windows does not have LD_PRELOAD mechanism to override all memory allocation functions and ZSTD makes use of C-tuntime calloc. During flushes and compactions default system allocator fragments and the system slows down considerably. For builds with jemalloc we employ an advanced ZSTD context creation API that re-directs memory allocation to jemalloc. To reduce the cost of context creation on each block we cache ZSTD context within the block based table builder while a new SST file is being built, this will help all platform builds including those w/o jemalloc. This avoids system allocator fragmentation and improves the performance. The change does not address random reads and currently on Windows reads with ZSTD regress as compared with SNAPPY compression. Closes https://github.com/facebook/rocksdb/pull/3838 Differential Revision: D8229794 Pulled By: miasantreble fbshipit-source-id: 719b622ab7bf4109819bc44f45ec66f0dd3ee80d	2018-06-04 12:12:48 -07:00
Maysam Yabandeh	402b7aa07f	Exclude seq from index keys Summary: Index blocks have the same format as data blocks. The keys therefore similarly to the keys in the data blocks are internal keys, which means that in addition to the user key it also has 8 bytes that encodes sequence number and value type. This extra 8 bytes however is not necessary in index blocks since the index keys act as an separator between two data blocks. The only exception is when the last key of a block and the first key of the next block share the same user key, in which the sequence number is required to act as a separator. The patch excludes the sequence from index keys only if the above special case does not happen for any of the index keys. It then records that in the property block. The reader looks at the property block to see if it should expect sequence numbers in the keys of the index block.s Closes https://github.com/facebook/rocksdb/pull/3894 Differential Revision: D8118775 Pulled By: maysamyabandeh fbshipit-source-id: 915479f028b5799ca91671d67455ecdefbd873bd	2018-05-25 18:42:43 -07:00
Andrew Kryczka	01bcc34896	Introduce library-independent default compression level Summary: Previously we were using -1 as the default for every library, which was legacy from our zlib options. That worked for a while, but after zstd introduced `a146ee04ae`, it started giving poor compression ratios by default in zstd. This PR adds a constant to RocksDB public API, `CompressionOptions::kDefaultCompressionLevel`, which will get translated to the default value specific to the compression library being used in "util/compression.h". The constant uses a number that appears to be larger than any library's maximum compression level. Closes https://github.com/facebook/rocksdb/pull/3895 Differential Revision: D8125780 Pulled By: ajkr fbshipit-source-id: 2db157a89118cd4f94577c2f4a0a5ff31c8391c6	2018-05-23 18:42:08 -07:00
Andrew Kryczka	7b655214d2	Assert keys/values pinned by range deletion meta-block iterators Summary: `RangeDelAggregator` holds the pointers returned by `BlockIter::key()` and `BlockIter::value()` so requires the data to which they point is pinned. `BlockIter::key()` points into block memory and is guaranteed to be pinned if and only if prefix encoding is disabled (or, equivalently, restart interval is set to one). I think `BlockIter::value()` is always pinned. Added an assert for these and removed the wrong TODO about increasing restart interval, which would enable key prefix encoding and break the assertion. Closes https://github.com/facebook/rocksdb/pull/3875 Differential Revision: D8063667 Pulled By: ajkr fbshipit-source-id: 60b5ebcc0cdd610dd6aad9e74a23378793672c41	2018-05-21 09:57:00 -07:00
Siying Dong	17af09fcce	Implement key shortening functions in ReverseBytewiseComparator Summary: Right now ReverseBytewiseComparator::FindShortestSeparator() doesn't really shorten key, and ReverseBytewiseComparator::FindShortestSuccessor() seems to return wrong results. The code is confusing too as it uses BytewiseComparatorImpl::FindShortestSeparator() but the function actually won't do anything if the the first key is larger than the second. Implement ReverseBytewiseComparator::FindShortestSeparator() and override ReverseBytewiseComparator::FindShortestSuccessor() to be empty. Closes https://github.com/facebook/rocksdb/pull/3836 Differential Revision: D7959762 Pulled By: siying fbshipit-source-id: 93acb621c16ce6f23e087ae4e19f7d84d1254683	2018-05-17 18:27:16 -07:00
Maysam Yabandeh	718c1c9c1f	Pass manual_wal_flush also to the first wal file Summary: Currently manual_wal_flush if set in the options will be used only for the wal files created during wal switch. The configuration thus does not affect the first wal file. The patch fixes that and also update the related unit tests. This PR is built on top of https://github.com/facebook/rocksdb/pull/3756 Closes https://github.com/facebook/rocksdb/pull/3824 Differential Revision: D7909153 Pulled By: maysamyabandeh fbshipit-source-id: 024ed99d2555db06bf096c902b998e432bb7b9ce	2018-05-14 10:57:56 -07:00
Andrew Kryczka	072ae671a7	Apply use_direct_io_for_flush_and_compaction to writes only Summary: Previously `DBOptions::use_direct_io_for_flush_and_compaction=true` combined with `DBOptions::use_direct_reads=false` could cause RocksDB to simultaneously read from two file descriptors for the same file, where background reads used direct I/O and foreground reads used buffered I/O. Our measurements found this mixed-mode I/O negatively impacted foreground read perf, compared to when only buffered I/O was used. This PR makes the mixed-mode I/O situation impossible by repurposing `DBOptions::use_direct_io_for_flush_and_compaction` to only apply to background writes, and `DBOptions::use_direct_reads` to apply to all reads. There is no risk of direct background direct writes happening simultaneously with buffered reads since we never read from and write to the same file simultaneously. Closes https://github.com/facebook/rocksdb/pull/3829 Differential Revision: D7915443 Pulled By: ajkr fbshipit-source-id: 78bcbf276449b7e7766ab6b0db246f789fb1b279	2018-05-09 19:42:58 -07:00
Andrew Kryczka	4bf169f07e	Disable readahead when using mmap for reads Summary: `ReadaheadRandomAccessFile` had an unwritten assumption, which was that its wrapped file's `Read()` function always copies into the provided scratch buffer. Actually this was not true when the wrapped file was `PosixMmapReadableFile`, whose `Read()` implementation does no copying and instead returns a `Slice` pointing directly into the `mmap`'d memory region. This PR: - prevents `ReadaheadRandomAccessFile` from ever wrapping mmap readable files - adds an assert for the assumption `ReadaheadRandomAccessFile` makes about the wrapped file's use of scratch buffer Closes https://github.com/facebook/rocksdb/pull/3813 Differential Revision: D7891513 Pulled By: ajkr fbshipit-source-id: dc64a55222d6af280c39a1852ee39e9e9d7cde7d	2018-05-08 12:13:18 -07:00
Maysam Yabandeh	cfb86659bf	WritePrepared Txn: enable rollback in stress test Summary: Rollback was disabled in stress test since there was a concurrency issue in WritePrepared rollback algorithm. The issue is fixed by caching the column family handles in WritePrepared to skip getting them from the db when needed for rollback. Tested by running transaction stress test under tsan. Closes https://github.com/facebook/rocksdb/pull/3785 Differential Revision: D7793727 Pulled By: maysamyabandeh fbshipit-source-id: d81ab6fda0e53186ca69944cfe0712ce4869451e	2018-05-02 18:13:05 -07:00
Siying Dong	63c965cdb4	Sync parent directory after deleting a file in delete scheduler Summary: sync parent directory after deleting a file in delete scheduler. Otherwise, trim speed may not be as smooth as what we want. Closes https://github.com/facebook/rocksdb/pull/3767 Differential Revision: D7760136 Pulled By: siying fbshipit-source-id: ec131d53b61953f09c60d67e901e5eeb2716b05f	2018-04-26 13:58:20 -07:00
Maysam Yabandeh	e5a4dacf6d	WritePrepared Txn: disable rollback in stress test Summary: WritePrepared rollback implementation is not ready to be invoked in the middle of workload. This is due the lack of synchronization to obtain the cf handle from db. Temporarily disabling this until the problem with rollback is fixed. Closes https://github.com/facebook/rocksdb/pull/3772 Differential Revision: D7769041 Pulled By: maysamyabandeh fbshipit-source-id: 0e3b0ce679bc2afba82e653a40afa3f045722754	2018-04-26 09:27:55 -07:00
Gabriel Wicke	090c78a0d7	Support lowering CPU priority of background threads Summary: Background activities like compaction can negatively affect latency of higher-priority tasks like request processing. To avoid this, rocksdb already lowers the IO priority of background threads on Linux systems. While this takes care of typical IO-bound systems, it does not help much when CPU (temporarily) becomes the bottleneck. This is especially likely when using more expensive compression settings. This patch adds an API to allow for lowering the CPU priority of background threads, modeled on the IO priority API. Benchmarks (see below) show significant latency and throughput improvements when CPU bound. As a result, workloads with some CPU usage bursts should benefit from lower latencies at a given utilization, or should be able to push utilization higher at a given request latency target. A useful side effect is that compaction CPU usage is now easily visible in common tools, allowing for an easier estimation of the contribution of compaction vs. request processing threads. As with IO priority, the implementation is limited to Linux, degrading to a no-op on other systems. Closes https://github.com/facebook/rocksdb/pull/3763 Differential Revision: D7740096 Pulled By: gwicke fbshipit-source-id: e5d32373e8dc403a7b0c2227023f9ce4f22b413c	2018-04-24 08:41:51 -07:00
Maysam Yabandeh	bb2a2ec731	WritePrepared Txn: rollback via commit Summary: Currently WritePrepared rolls back a transaction with prepare sequence number prepare_seq by i) write a single rollback batch with rollback_seq, ii) add <rollback_seq, rollback_seq> to commit cache, iii) remove prepare_seq from PrepareHeap. This is correct assuming that there is no snapshot taken when a transaction is rolled back. This is the case the way MySQL does rollback which is after recovery. Otherwise if max_evicted_seq advances the prepare_seq, the live snapshot might assume data as committed since it does not find them in CommitCache. The change is to simply add <prepare_seq. rollback_seq> to commit cache before removing prepare_seq from PrepareHeap. In this way if max_evicted_seq advances prpeare_seq, the existing mechanism that we have to check evicted entries against live snapshots will make sure that the live snapshot will not see the data of rolled back transaction. Closes https://github.com/facebook/rocksdb/pull/3745 Differential Revision: D7696193 Pulled By: maysamyabandeh fbshipit-source-id: c9a2d46341ddc03554dded1303520a1cab74ef9c	2018-04-20 15:28:19 -07:00
Zhongyi Xie	e1e826b980	check return status for Sync() and Append() calls to avoid corruption Summary: Right now in `SyncClosedLogs`, `CopyFile`, and `AddRecord`, where `Sync` and `Append` are invoked in a loop, the error status are not checked. This could lead to potential corruption as later calls will overwrite the error status. Closes https://github.com/facebook/rocksdb/pull/3740 Differential Revision: D7678848 Pulled By: miasantreble fbshipit-source-id: 4b0b412975989dfe80348f73217b9c4122a4bd77	2018-04-19 14:13:46 -07:00
Andrew Kryczka	3cea61392f	include thread-pool priority in thread names Summary: Previously threads were named "rocksdb:bg\<index in thread pool\>", so the first thread in all thread pools would be named "rocksdb:bg0". Users want to be able to distinguish threads used for flush (high-pri) vs regular compaction (low-pri) vs compaction to bottom-level (bottom-pri). So I changed the thread naming convention to include the thread-pool priority. Closes https://github.com/facebook/rocksdb/pull/3702 Differential Revision: D7581415 Pulled By: ajkr fbshipit-source-id: ce04482b6acd956a401ef22dc168b84f76f7d7c1	2018-04-18 17:27:56 -07:00
Maysam Yabandeh	6d06be22c0	Improve db_stress with transactions Summary: db_stress was already capable running transactions by setting use_txn. Running it under stress showed a couple of problems fixed in this patch. - The uncommitted transaction must be either rolled back or commit after recovery. - Current implementation of WritePrepared transaction cannot handle cf drop before crash. Clarified that in the comments and added safety checks. When running with use_txn, clear_column_family_one_in must be set to 0. Closes https://github.com/facebook/rocksdb/pull/3733 Differential Revision: D7654419 Pulled By: maysamyabandeh fbshipit-source-id: a024bad80a9dc99677398c00d29ff17d4436b7f3	2018-04-18 16:32:35 -07:00
Zhongyi Xie	954b496b3f	fix memory leak in two_level_iterator Summary: this PR fixes a few failed contbuild: 1. ASAN memory leak in Block::NewIterator (table/block.cc:429). the proper destruction of first_level_iter_ and second_level_iter_ of two_level_iterator.cc is missing from the code after the refactoring in https://github.com/facebook/rocksdb/pull/3406 2. various unused param errors introduced by https://github.com/facebook/rocksdb/pull/3662 3. updated comment for `ForceReleaseCachedEntry` to emphasize the use of `force_erase` flag. Closes https://github.com/facebook/rocksdb/pull/3718 Reviewed By: maysamyabandeh Differential Revision: D7621192 Pulled By: miasantreble fbshipit-source-id: 476c94264083a0730ded957c29de7807e4f5b146	2018-04-15 17:26:26 -07:00
David Lai	3be9b36453	comment unused parameters to turn on -Wunused-parameter flag Summary: This PR comments out the rest of the unused arguments which allow us to turn on the -Wunused-parameter flag. This is the second part of a codemod relating to https://github.com/facebook/rocksdb/pull/3557. Closes https://github.com/facebook/rocksdb/pull/3662 Differential Revision: D7426121 Pulled By: Dayvedde fbshipit-source-id: 223994923b42bd4953eb016a0129e47560f7e352	2018-04-12 17:59:16 -07:00
Zhongyi Xie	2770a94c42	make MockTimeEnv::current_time_ atomic to fix data race Summary: fix a new TSAN failure https://gist.github.com/miasantreble/7599c33f4e17da1024c67d4540dbe397 Closes https://github.com/facebook/rocksdb/pull/3694 Differential Revision: D7565310 Pulled By: miasantreble fbshipit-source-id: f672c96e925797b34dec6e20b59527e8eebaa825	2018-04-10 14:13:18 -07:00
Dmitri Smirnov	5ec382b918	Fix up backupable_db stack corruption. Summary: Fix up OACR(Lint) warnings. Closes https://github.com/facebook/rocksdb/pull/3674 Differential Revision: D7563869 Pulled By: ajkr fbshipit-source-id: 8c1e5045c8a6a2d85b2933fdbc60fde93bf0c9de	2018-04-09 19:27:24 -07:00
Phani Shekhar Mantripragada	446b32cfc3	Support for Column family specific paths. Summary: In this change, an option to set different paths for different column families is added. This option is set via cf_paths setting of ColumnFamilyOptions. This option will work in a similar fashion to db_paths setting. Cf_paths is a vector of Dbpath values which contains a pair of the absolute path and target size. Multiple levels in a Column family can go to different paths if cf_paths has more than one path. To maintain backward compatibility, if cf_paths is not specified for a column family, db_paths setting will be used. Note that, if db_paths setting is also not specified, RocksDB already has code to use db_name as the only path. Changes : 1) A new member "cf_paths" is added to ImmutableCfOptions. This is set, based on cf_paths setting of ColumnFamilyOptions and db_paths setting of ImmutableDbOptions. This member is used to identify the path information whenever files are accessed. 2) Validation checks are added for cf_paths setting based on existing checks for db_paths setting. 3) DestroyDB, PurgeObsoleteFiles etc. are edited to support multiple cf_paths. 4) Unit tests are added appropriately. Closes https://github.com/facebook/rocksdb/pull/3102 Differential Revision: D6951697 Pulled By: ajkr fbshipit-source-id: 60d2262862b0a8fd6605b09ccb0da32bb331787d	2018-04-05 19:58:20 -07:00
Sagar Vemuri	04c11b867d	Level Compaction with TTL Summary: Level Compaction with TTL. As of today, a file could exist in the LSM tree without going through the compaction process for a really long time if there are no updates to the data in the file's key range. For example, in certain use cases, the keys are not actually "deleted"; instead they are just set to empty values. There might not be any more writes to this "deleted" key range, and if so, such data could remain in the LSM for a really long time resulting in wasted space. Introducing a TTL could solve this problem. Files (and, in turn, data) older than TTL will be scheduled for compaction when there is no other background work. This will make the data go through the regular compaction process and get rid of old unwanted data. This also has the (good) side-effect of all the data in the non-bottommost level being newer than ttl, and all data in the bottommost level older than ttl. It could lead to more writes while reducing space. This functionality can be controlled by the newly introduced column family option -- ttl. TODO for later: - Make ttl mutable - Extend TTL to Universal compaction as well? (TTL is already supported in FIFO) - Maybe deprecate CompactionOptionsFIFO.ttl in favor of this new ttl option. Closes https://github.com/facebook/rocksdb/pull/3591 Differential Revision: D7275442 Pulled By: sagar0 fbshipit-source-id: dcba484717341200d419b0953dafcdf9eb2f0267	2018-04-02 22:14:28 -07:00
Amy Tai	1579626d0d	Enable cancelling manual compactions if they hit the sfm size limit Summary: Manual compactions should be cancelled, just like scheduled compactions are cancelled, if sfm->EnoughRoomForCompaction is not true. Closes https://github.com/facebook/rocksdb/pull/3670 Differential Revision: D7457683 Pulled By: amytai fbshipit-source-id: 669b02fdb707f75db576d03d2c818fb98d1876f5	2018-04-02 19:58:04 -07:00
Fosco Marotto	c3eb762bb0	Update 64-bit shift in compression.h Summary: This was failing the build on windows with zstd, warning treated as an error, 32-bit shift implicitly converted to 64-bit. Closes https://github.com/facebook/rocksdb/pull/3624 Differential Revision: D7307883 Pulled By: gfosco fbshipit-source-id: 68110e9b5b1b59b668dec6cf86b67556402574e7	2018-03-30 11:28:05 -07:00
Anand Ananthabhotla	f9f4d40f93	Align SST file data blocks to avoid spanning multiple pages Summary: Provide a block_align option in BlockBasedTableOptions to allow alignment of SST file data blocks. This will avoid higher IOPS/throughput load due to < 4KB data blocks spanning 2 4KB pages. When this option is set to true, the block alignment is set to lower of block size and 4KB. Closes https://github.com/facebook/rocksdb/pull/3502 Differential Revision: D7400897 Pulled By: anand1976 fbshipit-source-id: 04cc3bd144e88e3431a4f97604e63ad7a0f06d44	2018-03-26 20:26:10 -07:00
Dmitri Smirnov	53d66df0c4	Refactor sync_point to make implementation either customizable or replaceable Summary: Closes https://github.com/facebook/rocksdb/pull/3637 Differential Revision: D7354373 Pulled By: ajkr fbshipit-source-id: 6816c7bbc192ed0fb944942b11c7074bf24eddf1	2018-03-23 12:56:52 -07:00
Siying Dong	118058ba69	SstFileManager: add bytes_max_delete_chunk Summary: Add `bytes_max_delete_chunk` in SstFileManager so that we can drop a large file in multiple batches. Closes https://github.com/facebook/rocksdb/pull/3640 Differential Revision: D7358679 Pulled By: siying fbshipit-source-id: ef17f0da2f5723dbece2669485a9b91b3edc0bb7	2018-03-22 15:58:37 -07:00
Chinmay Kamat	e003d22526	Fix FaultInjectionTestEnv to work with DirectIO Summary: Implemented PositionedAppend() and use_direct_io() for TestWritableFile. With these changes, FaultInjectionTestEnv can be used with DirectIO enabled. Closes https://github.com/facebook/rocksdb/pull/3586 Differential Revision: D7244305 Pulled By: yiwu-arbug fbshipit-source-id: f6b7aece53daa0f9977bc684164a0693693e514c	2018-03-14 00:57:24 -07:00
Bruce Mitchener	a3a3f5497c	Fix some typos in comments and docs. Summary: Closes https://github.com/facebook/rocksdb/pull/3568 Differential Revision: D7170953 Pulled By: siying fbshipit-source-id: 9cfb8dd88b7266da920c0e0c1e10fb2c5af0641c	2018-03-08 10:27:25 -08:00
Bruce Mitchener	0de710f5b8	Use nullptr instead of NULL / 0 more consistently. Summary: Closes https://github.com/facebook/rocksdb/pull/3569 Differential Revision: D7170968 Pulled By: yiwu-arbug fbshipit-source-id: 308a6b7dd358a04fd9a7de3d927bfd8abd57d348	2018-03-07 12:42:12 -08:00
amytai	0a3db28d98	Disallow compactions if there isn't enough free space Summary: This diff handles cases where compaction causes an ENOSPC error. This does not handle corner cases where another background job is started while compaction is running, and the other background job triggers ENOSPC, although we do allow the user to provision for these background jobs with SstFileManager::SetCompactionBufferSize. It also does not handle the case where compaction has finished and some other background job independently triggers ENOSPC. Usage: Functionality is inside SstFileManager. In particular, users should set SstFileManager::SetMaxAllowedSpaceUsage, which is the reference highwatermark for determining whether to cancel compactions. Closes https://github.com/facebook/rocksdb/pull/3449 Differential Revision: D7016941 Pulled By: amytai fbshipit-source-id: 8965ab8dd8b00972e771637a41b4e6c645450445	2018-03-06 16:27:54 -08:00
Andrew Kryczka	6a3eebbab0	support multiple db_paths in SstFileManager Summary: Now that files scheduled for deletion are kept in the same directory, we don't need to constrain deletion scheduler to `db_paths[0]`. Previously this was done because there was a separate trash directory, and this constraint prevented files from being accidentally copied to another filesystem when they're scheduled for deletion. Closes https://github.com/facebook/rocksdb/pull/3544 Differential Revision: D7093786 Pulled By: ajkr fbshipit-source-id: 202f5c92d925eafebec1281fb95bb5828d33414f	2018-03-06 12:43:51 -08:00
Fosco Marotto	d518fe1da6	uint64_t and size_t changes to compile for iOS Summary: In attempting to build a static lib for use in iOS, I ran in to lots of type errors between uint64_t and size_t. This PR contains the changes I made to get `TARGET_OS=IOS make static_lib` to succeed while also getting Xcode to build successfully with the resulting `librocksdb.a` library imported. This also compiles for me on macOS and tests fine, but I'm really not sure if I made the correct decisions about where to `static_cast` and where to change types. Also up for discussion: is iOS worth supporting? Getting the static lib is just part one, we aren't providing any bridging headers or wrappers like the ObjectiveRocks project, it won't be a great experience. Closes https://github.com/facebook/rocksdb/pull/3503 Differential Revision: D7106457 Pulled By: gfosco fbshipit-source-id: 82ac2073de7e1f09b91f6b4faea91d18bd311f8e	2018-03-06 12:43:51 -08:00
Dmitri Smirnov	c364eb42b5	Windows cumulative patch Summary: This patch addressed several issues. Portability including db_test std::thread -> port::Thread Cc: @ and %z to ROCKSDB portable macro. Cc: maysamyabandeh Implement Env::AreFilesSame Make the implementation of file unique number more robust Get rid of C-runtime and go directly to Windows API when dealing with file primitives. Implement GetSectorSize() and aling unbuffered read on the value if available. Adjust Windows Logger for the new interface, implement CloseImpl() Cc: anand1976 Fix test running script issue where $status var was of incorrect scope so the failures were swallowed and not reported. DestroyDB() creates a logger and opens a LOG file in the directory being cleaned up. This holds a lock on the folder and the cleanup is prevented. This fails one of the checkpoin tests. We observe the same in production. We close the log file in this change. Fix DBTest2.ReadAmpBitmapLiveInCacheAfterDBClose failure where the test attempts to open a directory with NewRandomAccessFile which does not work on Windows. Fix DBTest.SoftLimit as it is dependent on thread timing. CC: yiwu-arbug Closes https://github.com/facebook/rocksdb/pull/3552 Differential Revision: D7156304 Pulled By: siying fbshipit-source-id: 43db0a757f1dfceffeb2b7988043156639173f5b	2018-03-06 11:57:43 -08:00
Maysam Yabandeh	62277e15c3	WritePrepared Txn: Move DuplicateDetector to util Summary: Move DuplicateDetector and SetComparator to its own header file in util. It would also address a complaint in the unity test. Closes https://github.com/facebook/rocksdb/pull/3567 Differential Revision: D7163268 Pulled By: maysamyabandeh fbshipit-source-id: 6ddf82773473646dbbc1284ae601a78c4907c778	2018-03-05 23:57:12 -08:00
Andrew Kryczka	5d68243e61	Comment out unused variables Summary: Submitting on behalf of another employee. Closes https://github.com/facebook/rocksdb/pull/3557 Differential Revision: D7146025 Pulled By: ajkr fbshipit-source-id: 495ca5db5beec3789e671e26f78170957704e77e	2018-03-05 13:13:41 -08:00
Yi Wu	1209b6db5c	Blob DB: remove existing garbage collection implementation Summary: Red diff to remove existing implementation of garbage collection. The current approach is reference counting kind of approach and require a lot of effort to get the size counter right on compaction and deletion. I'm going to go with a simple mark-sweep kind of approach and will send another PR for that. CompactionEventListener was added solely for blob db and it adds complexity and overhead to compaction iterator. Removing it as well. Closes https://github.com/facebook/rocksdb/pull/3551 Differential Revision: D7130190 Pulled By: yiwu-arbug fbshipit-source-id: c3a375ad2639a3f6ed179df6eda602372cc5b8df	2018-03-02 12:57:23 -08:00
Anand Ananthabhotla	dfbe52e099	Fix the Logger::Close() and DBImpl::Close() design pattern Summary: The recent Logger::Close() and DBImpl::Close() implementation rely on calling the CloseImpl() virtual function from the destructor, which will not work. Refactor the implementation to have a private close helper function in derived classes that can be called by both CloseImpl() and the destructor. Closes https://github.com/facebook/rocksdb/pull/3528 Reviewed By: gfosco Differential Revision: D7049303 Pulled By: anand1976 fbshipit-source-id: 76a64cbf403209216dfe4864ecf96b5d7f3db9f4	2018-02-23 13:57:26 -08:00
Igor Sugak	aba3409740	Back out "[codemod] - comment out unused parameters" Reviewed By: igorsugak fbshipit-source-id: 4a93675cc1931089ddd574cacdb15d228b1e5f37	2018-02-22 12:43:17 -08:00
David Lai	f4a030ce81	- comment out unused parameters Reviewed By: everiq, igorsugak Differential Revision: D7046710 fbshipit-source-id: 8e10b1f1e2aecebbfb229c742e214db887e5a461	2018-02-22 09:44:23 -08:00
Mike Kolupaev	97307d888f	Fix deadlock in ColumnFamilyData::InstallSuperVersion() Summary: Deadlock: a memtable flush holds DB::mutex_ and calls ThreadLocalPtr::Scrape(), which locks ThreadLocalPtr mutex; meanwhile, a thread exit handler locks ThreadLocalPtr mutex and calls SuperVersionUnrefHandle, which tries to lock DB::mutex_. This deadlock is hit all the time on our workload. It blocks our release. In general, the problem is that ThreadLocalPtr takes an arbitrary callback and calls it while holding a lock on a global mutex. The same global mutex is (at least in some cases) locked by almost all ThreadLocalPtr methods, on any instance of ThreadLocalPtr. So, there'll be a deadlock if the callback tries to do anything to any instance of ThreadLocalPtr, or waits for another thread to do so. So, probably the only safe way to use ThreadLocalPtr callbacks is to do only do simple and lock-free things in them. This PR fixes the deadlock by making sure that local_sv_ never holds the last reference to a SuperVersion, and therefore SuperVersionUnrefHandle never has to do any nontrivial cleanup. I also searched for other uses of ThreadLocalPtr to see if they may have similar bugs. There's only one other use, in transaction_lock_mgr.cc, and it looks fine. Closes https://github.com/facebook/rocksdb/pull/3510 Reviewed By: sagar0 Differential Revision: D7005346 Pulled By: al13n321 fbshipit-source-id: 37575591b84f07a891d6659e87e784660fde815f	2018-02-16 08:13:34 -08:00
Andrew Kryczka	0454f781c2	fix advance reservation of arena block addresses Summary: Calling `std::vector::reserve()` causes memory to be reallocated and then data to be moved. It was called prior to adding every block. This reallocation could be done a huge amount of times, e.g., for users with large index blocks. Instead, we can simply use `std::vector::emplace_back()` in such a way that preserves the no-memory-leak guarantee, while letting the vector decide when to reallocate space. Now I see reallocation/moving happen O(logN) times, rather than O(N) times, where N is the final size of vector. Closes https://github.com/facebook/rocksdb/pull/3508 Differential Revision: D6994228 Pulled By: ajkr fbshipit-source-id: ab7c11e13ff37c8c6c8249be7a79566a4068cd27	2018-02-15 19:41:52 -08:00
jsteemann	4e7a182d09	Several small "fixes" Summary: - removed a few unneeded variables - fused some variable declarations and their assignments - fixed right-trimming code in string_util.cc to not underflow - simplifed an assertion - move non-nullptr check assertion before dereferencing of that pointer - pass an std::string function parameter by const reference instead of by value (avoiding potential copy) Closes https://github.com/facebook/rocksdb/pull/3507 Differential Revision: D7004679 Pulled By: sagar0 fbshipit-source-id: 52944952d9b56dfcac3bea3cd7878e315bb563c4	2018-02-15 16:57:37 -08:00
Siying Dong	b3c5351335	Direct I/O writable file should do fsync in Close() Summary: We don't do fsync() after truncate in direct I/O writeable file (in fact we don't do any fsync ever). This can cause metadata not persistent to disk after the file is generated. We call it instead. Closes https://github.com/facebook/rocksdb/pull/3500 Differential Revision: D6981482 Pulled By: siying fbshipit-source-id: 7e2b591b7e5dd1b96fc0775515b8b9e6092980ef	2018-02-13 16:27:11 -08:00
Siying Dong	74748611a8	Suppress UBSAN error in finer guanularity Summary: Now we suppress alignment UBSAN error as a whole. Suppressing 3-way CRC and murmurhash feels a better idea than turning off alignment check as a whole. Closes https://github.com/facebook/rocksdb/pull/3495 Differential Revision: D6971273 Pulled By: siying fbshipit-source-id: 080b59fed6df494b9f622ef7cb5d42d39e6a8cdf	2018-02-13 12:18:07 -08:00
Chinmay Kamat	9fc72d6f16	Compilation fixes for powerpc build, -Wparentheses-equality error and missing header guards Summary: This pull request contains miscellaneous compilation fixes. Thanks, Chinmay Closes https://github.com/facebook/rocksdb/pull/3462 Differential Revision: D6941424 Pulled By: sagar0 fbshipit-source-id: fe9c26507bf131221f2466740204bff40a15614a	2018-02-09 14:12:43 -08:00
Mike Kolupaev	cb5b8f2090	Fix use-after-free in tailing iterator with merge operator Summary: ForwardIterator::SVCleanup() sometimes didn't pin superversion when it was supposed to. See the added test for the scenario. Here's the ASAN output of the added test without the fix (using `COMPILE_WITH_ASAN=1 make`): https://pastebin.com/9rD0Ywws Closes https://github.com/facebook/rocksdb/pull/3415 Differential Revision: D6817414 Pulled By: al13n321 fbshipit-source-id: bc80c44ea78a3a1fa885dfa448a26111f91afb24	2018-02-02 21:26:28 -08:00
Jun Wu	e502839e25	crc32: suppress -Wimplicit-fallthrough warnings Summary: Workaround a bunch of "implicit-fallthrough" compiler errors, like: ``` util/crc32c.cc:533:7: error: this statement may fall through [-Werror=implicit-fallthrough=] crc = _mm_crc32_u64(crc, (uint64_t)(buf + offset)); ^ util/crc32c.cc:1016:9: note: in expansion of macro ‘CRCsinglet’ CRCsinglet(crc0, next, -2 * 8); ^~~~~~~~~~ util/crc32c.cc:1017:7: note: here case 1: ``` Closes https://github.com/facebook/rocksdb/pull/3339 Reviewed By: sagar0 Differential Revision: D6874736 Pulled By: quark-zju fbshipit-source-id: eec9f3bc135e12fca336928d01711006d5c3cb16	2018-02-01 14:27:42 -08:00
Andrew Kryczka	b78ed0460b	fix ReadaheadRandomAccessFile/iterator prefetch bug Summary: `ReadaheadRandomAccessFile` is used by iterators for file reads in several cases, like in compaction when `compaction_readahead_size > 0` or `use_direct_io_for_flush_and_compaction == true`, or in user iterator when `ReadOptions::readahead_size > 0`. `ReadaheadRandomAccessFile` maintains an internal buffer for readahead data. It assumes that, if the buffer's length is less than `ReadaheadRandomAccessFile::readahead_size_`, which is fixed in the constructor, then EOF has been reached so it doesn't try reading further. Recently, `d938226af4` started calling `RandomAccessFile::Prefetch` with various lengths: 8KB, 16KB, etc. When the `RandomAccessFile` is a `ReadaheadRandomAccessFile`, it triggers the above condition and incorrectly determines EOF. If a block is partially in the readahead buffer and EOF is incorrectly decided, the result is a truncated data block. The problem is reproducible: ``` TEST_TMPDIR=/data/compaction_bench ./db_bench -benchmarks=fillrandom -write_buffer_size=1048576 -target_file_size_base=1048576 -block_size=18384 -use_direct_io_for_flush_and_compaction=true ... put error: Corruption: truncated block read from /data/compaction_bench/dbbench/000014.sst offset 20245, expected 10143 bytes, got 8427 ``` Closes https://github.com/facebook/rocksdb/pull/3454 Differential Revision: D6869405 Pulled By: ajkr fbshipit-source-id: 87001c299e7600a37c0dcccbd0368e0954c929cf	2018-02-01 09:42:09 -08:00
Fosco Marotto	6efa8e270c	Update endif/else behavior for unreachable code error on Windows. Summary: Per #3367 Closes https://github.com/facebook/rocksdb/pull/3389 Differential Revision: D6766126 Pulled By: gfosco fbshipit-source-id: e441a15e8aec6747c613d68f4f0621b605eb48a0	2018-01-31 12:13:00 -08:00
Maysam Yabandeh	4927b4e662	Rounddown in FilePrefetchBuffer::Prefetch Summary: FilePrefetchBuffer::Prefetch is currently rounds the offset up which does not fit its new use cases in prefetching index/filter blocks, as it would skips over some the offsets that were requested to be prefetched. This patch rounds down instead. Fixes #3180 Closes https://github.com/facebook/rocksdb/pull/3413 Differential Revision: D6816392 Pulled By: maysamyabandeh fbshipit-source-id: 3aaeaf59c55d72b61dacfae6d4a8e65eccb3c553	2018-01-26 12:57:25 -08:00
Anand Ananthabhotla	d0f1b49ab6	Add a Close() method to DB to return status when closing a db Summary: Currently, the only way to close an open DB is to destroy the DB object. There is no way for the caller to know the status. In one instance, the destructor encountered an error due to failure to close a log file on HDFS. In order to prevent silent failures, we add DB::Close() that calls CloseImpl() which must be implemented by its descendants. The main failure point in the destructor is closing the log file. This patch also adds a Close() entry point to Logger in order to get status. When DBOptions::info_log is allocated and owned by the DBImpl, it is explicitly closed by DBImpl::CloseImpl(). Closes https://github.com/facebook/rocksdb/pull/3348 Differential Revision: D6698158 Pulled By: anand1976 fbshipit-source-id: 9468e2892553eb09c4c41b8723f590c0dbd8ab7d	2018-01-16 11:08:57 -08:00
Wouter Beek	58b841b356	FIXED: string buffers potentially too small to fit formatted write Summary: This fixes the following warnings when compiled with GCC7: util/transaction_test_util.cc: In static member function ‘static rocksdb::Status rocksdb::RandomTransactionInserter::DBGet(rocksdb::DB, rocksdb::Transaction, rocksdb::ReadOptions&, uint16_t, uint64_t, bool, uint64_t, std::__cxx11::string, bool)’: util/transaction_test_util.cc:75:8: error: ‘snprintf’ output may be truncated before the last format character [-Werror=format-truncation=] Status RandomTransactionInserter::DBGet( ^~~~~~~~~~~~~~~~~~~~~~~~~ util/transaction_test_util.cc:84:11: note: ‘snprintf’ output between 5 and 6 bytes into a destination of size 5 snprintf(prefix_buf, sizeof(prefix_buf), "%.4u", set_i + 1); ~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ util/transaction_test_util.cc: In static member function ‘static rocksdb::Status rocksdb::RandomTransactionInserter::Verify(rocksdb::DB, uint16_t, uint64_t, bool, rocksdb::Random64)’: util/transaction_test_util.cc:245:8: error: ‘snprintf’ output may be truncated before the last format character [-Werror=format-truncation=] Status RandomTransactionInserter::Verify(DB db, uint16_t num_sets, ^~~~~~~~~~~~~~~~~~~~~~~~~ util/transaction_test_util.cc:268:13: note: ‘snprintf’ output between 5 and 6 bytes into a destination of size 5 snprintf(prefix_buf, sizeof(prefix_buf), "%.4u", set_i + 1); ~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Closes https://github.com/facebook/rocksdb/pull/3295 Differential Revision: D6609411 Pulled By: maysamyabandeh fbshipit-source-id: 33f0add471056eb59db2f8bd4366e6dfbb1a187d	2017-12-20 08:12:22 -08:00
yingsu00	f54d7f5fea	Port 3 way SSE4.2 crc32c implementation from Folly Summary: # Summary RocksDB uses SSE crc32 intrinsics to calculate the crc32 values but it does it in single way fashion (not pipelined on single CPU core). Intel's whitepaper () published an algorithm that uses 3-way pipelining for the crc32 intrinsics, then use pclmulqdq intrinsic to combine the values. Because pclmulqdq has overhead on its own, this algorithm will show perf gains on buffers larger than 216 bytes, which makes RocksDB a perfect user, since most of the buffers RocksDB call crc32c on is over 4KB. Initial db_bench show tremendous CPU gain. This change uses the 3-way SSE algorithm by default. The old SSE algorithm is now behind a compiler tag NO_THREEWAY_CRC32C. If user compiles the code with NO_THREEWAY_CRC32C=1 then the old SSE Crc32c algorithm would be used. If the server does not have SSE4.2 at the run time the slow way (Non SSE) will be used. # Performance Test Results We ran the FillRandom and ReadRandom benchmarks in db_bench. ReadRandom is the point of interest here since it calculates the CRC32 for the in-mem buffers. We did 3 runs for each algorithm. Before this change the CRC32 value computation takes about 11.5% of total CPU cost, and with the new 3-way algorithm it reduced to around 4.5%. The overall throughput also improved from 25.53MB/s to 27.63MB/s. 1) ReadRandom in db_bench overall metrics PER RUN Algorithm \| run \| micros/op \| ops/sec \|Throughput (MB/s) 3-way \| 1 \| 4.143 \| 241387 \| 26.7 3-way \| 2 \| 3.775 \| 264872 \| 29.3 3-way \| 3 \| 4.116 \| 242929 \| 26.9 FastCrc32c\|1 \| 4.037 \| 247727 \| 27.4 FastCrc32c\|2 \| 4.648 \| 215166 \| 23.8 FastCrc32c\|3 \| 4.352 \| 229799 \| 25.4 AVG Algorithm \| Average of micros/op \| Average of ops/sec \| Average of Throughput (MB/s) 3-way \| 4.01 \| 249,729 \| 27.63 FastCrc32c \| 4.35 \| 230,897 \| 25.53 2) Crc32c computation CPU cost (inclusive samples percentage) PER RUN Implementation \| run \| TotalSamples \| Crc32c percentage 3-way \| 1 \| 4,572,250,000 \| 4.37% 3-way \| 2 \| 3,779,250,000 \| 4.62% 3-way \| 3 \| 4,129,500,000 \| 4.48% FastCrc32c \| 1 \| 4,663,500,000 \| 11.24% FastCrc32c \| 2 \| 4,047,500,000 \| 12.34% FastCrc32c \| 3 \| 4,366,750,000 \| 11.68% # Test Plan make -j64 corruption_test && ./corruption_test By default it uses 3-way SSE algorithm NO_THREEWAY_CRC32C=1 make -j64 corruption_test && ./corruption_test make clean && DEBUG_LEVEL=0 make -j64 db_bench make clean && DEBUG_LEVEL=0 NO_THREEWAY_CRC32C=1 make -j64 db_bench Closes https://github.com/facebook/rocksdb/pull/3173 Differential Revision: D6330882 Pulled By: yingsu00 fbshipit-source-id: 8ec3d89719533b63b536a736663ca6f0dd4482e9	2017-12-19 18:26:49 -08:00
Andrew Kryczka	5a7e08468a	fix ThreadStatus for bottom-pri compaction threads Summary: added `ThreadType::BOTTOM_PRIORITY` which is used in the `ThreadStatus` object to indicate the thread is used for bottom-pri compactions. Previously there was a bug where we mislabeled such threads as `ThreadType::LOW_PRIORITY`. Closes https://github.com/facebook/rocksdb/pull/3270 Differential Revision: D6559428 Pulled By: ajkr fbshipit-source-id: 96b1a50a9c19492b1a5fd1b77cf7061a6f9f1d1c	2017-12-14 14:57:49 -08:00
Orvid King	b4d88d7128	Fix the build with MSVC 2017 Summary: There were a few places where MSVC's implicit truncation warnings were getting triggered, which was causing the MSVC build to fail due to warnings being treated as errors. This resolves the issues by making the truncations in some places explicit, and by making it so there are no truncations of literals. Fixes #3239 Supersedes #3259 Closes https://github.com/facebook/rocksdb/pull/3273 Reviewed By: yiwu-arbug Differential Revision: D6569204 Pulled By: Orvid fbshipit-source-id: c188cf1cf98d9acb6d94b71875041cc81f8ff088	2017-12-14 12:02:22 -08:00
Islam AbdelRahman	9089373a01	Fix DeleteScheduler::MarkAsTrash() handling existing trash Summary: DeleteScheduler::MarkAsTrash() don't handle existing .trash files correctly This cause rocksdb to not being able to delete existing .trash files on restart Closes https://github.com/facebook/rocksdb/pull/3261 Differential Revision: D6548003 Pulled By: IslamAbdelRahman fbshipit-source-id: c3800639412e587a690062c63076a5a08881e0e6	2017-12-12 18:17:13 -08:00
Souvik Banerjee	4bcb7fb148	Update transaction_test_util.cc Summary: Fixes a compile error on gcc 7.2.1 (-Werror=format-truncation=). Closes https://github.com/facebook/rocksdb/pull/3248 Differential Revision: D6546515 Pulled By: yiwu-arbug fbshipit-source-id: bd78cca63f2af376faceccb1838d2d4cc9208fef	2017-12-12 12:12:38 -08:00
Maysam Yabandeh	36911f55dd	WritePrepared Txn: stress test Summary: Augment the existing MySQLStyleTransactionTest to check for more core case scenarios. The changes showed effective in revealing the bugs reported in https://github.com/facebook/rocksdb/pull/3205 and https://github.com/facebook/rocksdb/pull/3101 Closes https://github.com/facebook/rocksdb/pull/3222 Differential Revision: D6476862 Pulled By: maysamyabandeh fbshipit-source-id: 5068497702d67ffc206a58ed96f8578fbb510137	2017-12-06 09:42:28 -08:00
Andrew Kryczka	63f1c0a57d	fix gflags namespace Summary: I started adding gflags support for cmake on linux and got frustrated that I'd need to duplicate the build_detect_platform logic, which determines namespace based on attempting compilation. We can do it differently -- use the GFLAGS_NAMESPACE macro if available, and if not, that indicates it's an old gflags version without configurable namespace so we can simply hardcode "google". Closes https://github.com/facebook/rocksdb/pull/3212 Differential Revision: D6456973 Pulled By: ajkr fbshipit-source-id: 3e6d5bde3ca00d4496a120a7caf4687399f5d656	2017-12-01 10:42:05 -08:00
Gustav Davidsson	2d04ed65e4	Make trash-to-DB size ratio limit configurable Summary: Allow users to configure the trash-to-DB size ratio limit, so that ratelimits for deletes can be enforced even when larger portions of the database are being deleted. Closes https://github.com/facebook/rocksdb/pull/3158 Differential Revision: D6304897 Pulled By: gdavidsson fbshipit-source-id: a28dd13059ebab7d4171b953ed91ce383a84d6b3	2017-11-17 11:58:17 -08:00
Andrew Kryczka	e27f60b1c8	distinguish kZSTDNotFinalCompression in compression string Summary: This confused some users who were getting compression type from the logs. Closes https://github.com/facebook/rocksdb/pull/3153 Differential Revision: D6294964 Pulled By: ajkr fbshipit-source-id: 3c813376d33682dc6ccafc9a78df1a2e2528985e	2017-11-15 19:41:59 -08:00
Yi Wu	42564ada53	Blob DB: not using PinnableSlice move assignment Summary: The current implementation of PinnableSlice move assignment have an issue #3163. We are moving away from it instead of try to get the move assignment right, since it is too tricky. Closes https://github.com/facebook/rocksdb/pull/3164 Differential Revision: D6319201 Pulled By: yiwu-arbug fbshipit-source-id: 8f3279021f3710da4a4caa14fd238ed2df902c48	2017-11-13 18:12:20 -08:00
Yi Wu	be410dede8	Fix PinnableSlice move assignment Summary: After move assignment, we need to re-initialized the moved PinnableSlice. Also update blob_db_impl.cc to not reuse the moved PinnableSlice since it is supposed to be in an undefined state after move. Closes https://github.com/facebook/rocksdb/pull/3127 Differential Revision: D6238585 Pulled By: yiwu-arbug fbshipit-source-id: bd99f2e37406c4f7de160c7dee6a2e8126bc224e	2017-11-03 18:13:21 -07:00
Prashant D	4c8f336401	util: Fix coverity issues Summary: util/concurrent_arena.h: CID 1396145 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR) 2. uninit_member: Non-static class member free_begin_ is not initialized in this constructor nor in any functions that it calls. 94 Shard() : allocated_and_unused_(0) {} util/dynamic_bloom.cc: 1. Condition hash_func == NULL, taking true branch. CID 1322821 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR) 3. uninit_member: Non-static class member data_ is not initialized in this constructor nor in any functions that it calls. 47 hash_func_(hash_func == nullptr ? &BloomHash : hash_func) {} 48 util/file_reader_writer.h: 204 private: 205 AlignedBuffer buffer_; member_not_init_in_gen_ctor: The compiler-generated constructor for this class does not initialize buffer_offset_. 206 uint64_t buffer_offset_; CID 1418246 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR) member_not_init_in_gen_ctor: The compiler-generated constructor for this class does not initialize buffer_len_. 207 size_t buffer_len_; 208}; util/thread_local.cc: 341#endif CID 1322795 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR) 3. uninit_member: Non-static class member pthread_key_ is not initialized in this constructor nor in any functions that it calls. 342} 40struct ThreadData { 2. uninit_member: Non-static class member next is not initialized in this constructor nor in any functions that it calls. CID 1400668 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR) 4. uninit_member: Non-static class member prev is not initialized in this constructor nor in any functions that it calls. 41 explicit ThreadData(ThreadLocalPtr::StaticMeta* _inst) : entries(), inst(_inst) {} 42 std::vector<Entry> entries; 1. member_decl: Class member declaration for next. 43 ThreadData* next; 3. member_decl: Class member declaration for prev. 44 ThreadData* prev; 45 ThreadLocalPtr::StaticMeta* inst; 46}; Closes https://github.com/facebook/rocksdb/pull/3123 Differential Revision: D6233566 Pulled By: sagar0 fbshipit-source-id: aa2068790ea69787a0035c0db39d59b0c25108db	2017-11-03 14:42:08 -07:00
Andrew Kryczka	cfb120f737	fix CopyFile status checks Summary: copied from internal diff D6156261 Closes https://github.com/facebook/rocksdb/pull/3124 Differential Revision: D6230167 Pulled By: ajkr fbshipit-source-id: 17926bb1152d607556364e3aacfec0ef3c115748	2017-11-03 11:57:10 -07:00
Yi Wu	d956169563	Fix clang build error Summary: Fix cast from size_t to unsigned int. Closes https://github.com/facebook/rocksdb/pull/3125 Differential Revision: D6232863 Pulled By: yiwu-arbug fbshipit-source-id: 4c6131168b1faec26f7820b2cf4a09c242d323b7	2017-11-03 11:26:54 -07:00
Andrew Kryczka	24ad430600	pass key/value samples through zstd compression dictionary generator Summary: Instead of using samples directly, we now support passing the samples through zstd's dictionary generator when `CompressionOptions::zstd_max_train_bytes` is set to nonzero. If set to zero, we will use the samples directly as the dictionary -- same as before. Note this is the first step of #2987, extracted into a separate PR per reviewer request. Closes https://github.com/facebook/rocksdb/pull/3057 Differential Revision: D6116891 Pulled By: ajkr fbshipit-source-id: 70ab13cc4c734fa02e554180eed0618b75255497	2017-11-02 22:56:36 -07:00
Zhongyi Xie	30e4e01e05	add missing else Summary: Closes https://github.com/facebook/rocksdb/pull/3121 Differential Revision: D6229415 Pulled By: miasantreble fbshipit-source-id: 57c7ad2fddf5dd6b8d7e3aaf6f62348151327dfb	2017-11-02 22:28:06 -07:00
Shaohua Li	33c7d4ccd9	Make writable_file_max_buffer_size dynamic Summary: The DBOptions::writable_file_max_buffer_size can be changed dynamically. Closes https://github.com/facebook/rocksdb/pull/3053 Differential Revision: D6152720 Pulled By: shligit fbshipit-source-id: aa0c0cfcfae6a54eb17faadb148d904797c68681	2017-10-31 13:56:35 -07:00
Islam AbdelRahman	05993155ef	Mark files as trash by using .trash extension Summary: SstFileManager move files that need to be deleted into a trash directory. Deprecate this behaviour and instead add ".trash" extension to files that need to be deleted Closes https://github.com/facebook/rocksdb/pull/2970 Differential Revision: D5976805 Pulled By: IslamAbdelRahman fbshipit-source-id: 27374ece4315610b2792c30ffcd50232d4c9a343	2017-10-27 13:27:12 -07:00
Sagar Vemuri	f0804db7f7	Make FIFO compaction options dynamically configurable Summary: ColumnFamilyOptions::compaction_options_fifo and all its sub-fields can be set dynamically now. Some of the ways in which the fifo compaction options can be set are: - `SetOptions({{"compaction_options_fifo", "{max_table_files_size=1024}"}})` - `SetOptions({{"compaction_options_fifo", "{ttl=600;}"}})` - `SetOptions({{"compaction_options_fifo", "{max_table_files_size=1024;ttl=600;}"}})` - `SetOptions({{"compaction_options_fifo", "{max_table_files_size=51;ttl=49;allow_compaction=true;}"}})` Most of the code has been made generic enough so that it could be reused later to make universal options (and other such nested defined-types) dynamic with very few lines of parsing/serializing code changes. Introduced a few new functions like `ParseStruct`, `SerializeStruct` and `GetStringFromStruct`. The duplicate code in `GetStringFromDBOptions` and `GetStringFromColumnFamilyOptions` has been moved into `GetStringFromStruct`. So they become just simple wrappers now. Closes https://github.com/facebook/rocksdb/pull/3006 Differential Revision: D6058619 Pulled By: sagar0 fbshipit-source-id: 1e8f78b3374ca5249bb4f3be8a6d3bb4cbc52f92	2017-10-19 15:26:36 -07:00
Dmitri Smirnov	ebab2e2d42	Enable MSVC W4 with a few exceptions. Fix warnings and bugs Summary: Closes https://github.com/facebook/rocksdb/pull/3018 Differential Revision: D6079011 Pulled By: yiwu-arbug fbshipit-source-id: 988a721e7e7617967859dba71d660fc69f4dff57	2017-10-19 10:57:12 -07:00
Nikhil Benesch	c0208dffbe	arena: derive alignment unit from std::max_align_t Summary: As raised in #2265, the arena allocator will return memory that is improperly aligned to store a `std::function` on macOS. Oddly, I'm unable to tickle this bug without adding a `std::function` field to `struct ReadOptions`—but my proposal in #2265 does exactly that. In any case, here's a simple reproduction. Apply this bogus patch to get a `std::function` into `struct ReadOptions` ``` --- a/include/rocksdb/options.h +++ b/include/rocksdb/options.h @@ -1035,6 +1035,8 @@ struct ReadOptions { // Default: 0 uint64_t max_skippable_internal_keys; + std::function<void()> foo; + ReadOptions(); ReadOptions(bool cksum, bool cache); }; ``` then compile `db_properties_test` with ubsan and run `ReadLatencyHistogramByLevel`: ``` $ make COMPILE_WITH_UBSAN=1 db_properties_test $ ./db_properties_test --gtest_filter=DBPropertiesTest.ReadLatencyHistogramByLevel ``` ubsan will complain about several misaligned accesses: ``` Note: Google Test filter = DBPropertiesTest.ReadLatencyHistogramByLevel [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from DBPropertiesTest [ RUN ] DBPropertiesTest.ReadLatencyHistogramByLevel util/coding.h:372:12: runtime error: load of misaligned address 0x00010d85516c for type 'const unsigned long', which requires 8 byte alignment 0x00010d85516c: note: pointer points here 01 00 34 57 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 78 24 82 0a 01 00 00 00 ^ util/coding.h:362:3: runtime error: store to misaligned address 0x7fff5733fac4 for type 'unsigned long', which requires 8 byte alignment 0x7fff5733fac4: note: pointer points here 01 00 00 00 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 80 1d 96 0d 01 00 00 00 ^ util/coding.h:372:12: runtime error: load of misaligned address 0x00010d85516c for type 'const unsigned long', which requires 8 byte alignment 0x00010d85516c: note: pointer points here 01 00 34 57 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 78 24 82 0a 01 00 00 00 ^ version_set.cc:854: runtime error: constructor call on misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment 0x00010dbfa5c8: note: pointer points here 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ version_set.cc:512: runtime error: constructor call on misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment 0x00010dbfa5c8: note: pointer points here 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ version_set.cc:505: runtime error: constructor call on misaligned address 0x00010dbfa5e8 for type 'rocksdb::ReadOptions', which requires 16 byte alignment 0x00010dbfa5e8: note: pointer points here 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ options.h:931: runtime error: constructor call on misaligned address 0x00010dbfa5e8 for type 'rocksdb::ReadOptions', which requires 16 byte alignment 0x00010dbfa5e8: note: pointer points here 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ options.h:931: runtime error: constructor call on misaligned address 0x00010dbfa628 for type 'std::__1::function<void ()>', which requires 16 byte alignment 0x00010dbfa628: note: pointer points here 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ functional:1583: runtime error: constructor call on misaligned address 0x00010dbfa628 for type 'std::__1::function<void ()>', which requires 16 byte alignment 0x00010dbfa628: note: pointer points here 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:1585:9: runtime error: member access within misaligned address 0x00010dbfa628 for type 'std::__1::function<void ()>', which requires 16 byte alignment 0x00010dbfa628: note: pointer points here 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:1585:9: runtime error: store to misaligned address 0x00010dbfa648 for type '__base ' (aka '__base<void ()> '), which requires 16 byte alignment 0x00010dbfa648: note: pointer points here 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ db/version_set.cc:864:29: runtime error: upcast of misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment 0x00010dbfa5c8: note: pointer points here 00 00 00 00 a0 db 70 0a 01 00 00 00 00 00 00 00 00 00 00 00 90 14 98 0d 01 00 00 00 00 00 00 00 ^ db/version_set.cc:521:12: runtime error: member access within misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment 0x00010dbfa5c8: note: pointer points here 00 00 00 00 a0 db 70 0a 01 00 00 00 00 00 00 00 00 00 00 00 90 14 98 0d 01 00 00 00 00 00 00 00 ^ db/version_set.cc:521:12: runtime error: load of misaligned address 0x00010dbfa5d8 for type 'rocksdb::TableCache ', which requires 16 byte alignment 0x00010dbfa5d8: note: pointer points here 00 00 00 00 90 14 98 0d 01 00 00 00 00 00 00 00 00 00 00 00 01 01 ff ff ff ff ff ff 00 00 00 00 ^ db/version_set.cc:522:9: runtime error: member access within misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment 0x00010dbfa5c8: note: pointer points here 00 00 00 00 a0 db 70 0a 01 00 00 00 00 00 00 00 00 00 00 00 90 14 98 0d 01 00 00 00 00 00 00 00 ^ db/version_set.cc:522:9: runtime error: reference binding to misaligned address 0x00010dbfa5e8 for type 'const rocksdb::ReadOptions', which requires 16 byte alignment 0x00010dbfa5e8: note: pointer points here 00 00 00 00 01 01 ff ff ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ db/version_set.cc:522:24: runtime error: member access within misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment 0x00010dbfa5c8: note: pointer points here 00 00 00 00 a0 db 70 0a 01 00 00 00 00 00 00 00 00 00 00 00 90 14 98 0d 01 00 00 00 00 00 00 00 ^ db/version_set.cc:522:38: runtime error: member access within misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment 0x00010dbfa5c8: note: pointer points here 00 00 00 00 a0 db 70 0a 01 00 00 00 00 00 00 00 00 00 00 00 90 14 98 0d 01 00 00 00 00 00 00 00 ^ db/version_set.cc:522:57: runtime error: member access within misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment 0x00010dbfa5c8: note: pointer points here 00 00 00 00 a0 db 70 0a 01 00 00 00 00 00 00 00 00 00 00 00 90 14 98 0d 01 00 00 00 00 00 00 00 ^ db/version_set.cc:522:57: runtime error: load of misaligned address 0x00010dbfa678 for type 'rocksdb::RangeDelAggregator ', which requires 16 byte alignment 0x00010dbfa678: note: pointer points here 01 00 00 00 d0 a1 bf 0d 01 00 00 00 00 00 00 00 00 00 00 00 f8 db 70 0a 01 00 00 00 00 00 00 00 ^ db/version_set.cc:523:54: runtime error: member access within misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment 0x00010dbfa5c8: note: pointer points here 00 00 00 00 a0 db 70 0a 01 00 00 00 00 00 00 00 00 00 00 00 90 14 98 0d 01 00 00 00 00 00 00 00 ^ db/version_set.cc:523:54: runtime error: load of misaligned address 0x00010dbfa668 for type 'rocksdb::HistogramImpl ', which requires 16 byte alignment 0x00010dbfa668: note: pointer points here 01 00 00 00 c8 88 a5 0d 01 00 00 00 00 00 00 00 01 00 00 00 d0 a1 bf 0d 01 00 00 00 00 00 00 00 ^ db/version_set.cc:524:9: runtime error: member access within misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment 0x00010dbfa5c8: note: pointer points here 00 00 00 00 a0 db 70 0a 01 00 00 00 00 00 00 00 00 00 00 00 90 14 98 0d 01 00 00 00 00 00 00 00 ^ db/version_set.cc:524:47: runtime error: member access within misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment 0x00010dbfa5c8: note: pointer points here 00 00 00 00 a0 db 70 0a 01 00 00 00 00 00 00 00 00 00 00 00 90 14 98 0d 01 00 00 00 00 00 00 00 ^ db/version_set.cc:524:62: runtime error: member access within misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment 0x00010dbfa5c8: note: pointer points here 00 00 00 00 a0 db 70 0a 01 00 00 00 00 00 00 00 00 00 00 00 90 14 98 0d 01 00 00 00 00 00 00 00 ^ db/table_cache.cc:228:33: runtime error: reference binding to misaligned address 0x00010dbfa5e8 for type 'const rocksdb::ReadOptions', which requires 16 byte alignment 0x00010dbfa5e8: note: pointer points here 00 00 00 00 01 01 ff ff ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ table/block_based_table_reader.cc:1554:41: runtime error: reference binding to misaligned address 0x00010dbfa5e8 for type 'const rocksdb::ReadOptions', which requires 16 byte alignment 0x00010dbfa5e8: note: pointer points here 00 00 00 00 01 01 ff ff ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ table/block_based_table_reader.cc:1396:21: runtime error: reference binding to misaligned address 0x00010dbfa5e8 for type 'const rocksdb::ReadOptions', which requires 16 byte alignment 0x00010dbfa5e8: note: pointer points here 00 00 00 00 01 01 ff ff ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ include/rocksdb/options.h:931:8: runtime error: reference binding to misaligned address 0x00010dbfa628 for type 'const std::function<void ()>', which requires 16 byte alignment 0x00010dbfa628: note: pointer points here 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:1584:13: runtime error: load of misaligned address 0x00010dbfa648 for type '__base const' (aka '__base<void ()> const'), which requires 16 byte alignment 0x00010dbfa648: note: pointer points here 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c8 a5 97 0d 01 00 00 00 38 36 9b 0d ^ table/block_based_table_reader.cc:1555:24: runtime error: reference binding to misaligned address 0x00010dbfa5e8 for type 'const rocksdb::ReadOptions', which requires 16 byte alignment 0x00010dbfa5e8: note: pointer points here 00 00 00 00 01 01 ff ff ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ db/table_cache.cc:244:54: runtime error: load of misaligned address 0x00010dbfa618 for type 'const bool', which requires 16 byte alignment 0x00010dbfa618: note: pointer points here 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ db/table_cache.cc:246:49: runtime error: reference binding to misaligned address 0x00010dbfa5e8 for type 'const rocksdb::ReadOptions', which requires 16 byte alignment 0x00010dbfa5e8: note: pointer points here 00 00 00 00 01 01 ff ff ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ db/version_set.cc:532:12: runtime error: member access within misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment 0x00010dbfa5c8: note: pointer points here 00 00 00 00 a0 db 70 0a 01 00 00 00 00 00 00 00 00 00 00 00 90 14 98 0d 01 00 00 00 00 00 00 00 ^ db/version_set.cc:532:12: runtime error: member access within misaligned address 0x00010dbfa5e8 for type 'const rocksdb::ReadOptions', which requires 16 byte alignment 0x00010dbfa5e8: note: pointer points here 00 00 00 00 01 01 ff ff ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ db/version_set.cc:532:26: runtime error: load of misaligned address 0x00010dbfa5f8 for type 'const rocksdb::Slice const', which requires 16 byte alignment 0x00010dbfa5f8: note: pointer points here 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ version_set.cc:493: runtime error: member call on misaligned address 0x00010dbfa5c8 for type 'rocksdb::(anonymous namespace)::LevelFileIteratorState', which requires 16 byte alignment 0x00010dbfa5c8: note: pointer points here 00 00 00 00 a0 db 70 0a 01 00 00 00 00 00 00 00 00 00 00 00 90 14 98 0d 01 00 00 00 00 00 00 00 ^ version_set.cc:493: runtime error: member call on misaligned address 0x00010dbfa5e8 for type 'rocksdb::ReadOptions', which requires 16 byte alignment 0x00010dbfa5e8: note: pointer points here 00 00 00 00 01 01 ff ff ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ options.h:931: runtime error: member call on misaligned address 0x00010dbfa5e8 for type 'rocksdb::ReadOptions', which requires 16 byte alignment 0x00010dbfa5e8: note: pointer points here 00 00 00 00 01 01 ff ff ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ options.h:931: runtime error: member call on misaligned address 0x00010dbfa628 for type 'std::__1::function<void ()>', which requires 16 byte alignment 0x00010dbfa628: note: pointer points here 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ functional:1765: runtime error: member call on misaligned address 0x00010dbfa628 for type 'std::__1::function<void ()>', which requires 16 byte alignment 0x00010dbfa628: note: pointer points here 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:1766:9: runtime error: member access within misaligned address 0x00010dbfa628 for type 'std::__1::function<void ()>', which requires 16 byte alignment 0x00010dbfa628: note: pointer points here 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:1766:9: runtime error: load of misaligned address 0x00010dbfa648 for type '__base ' (aka '__base<void ()> '), which requires 16 byte alignment 0x00010dbfa648: note: pointer points here 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c8 a5 97 0d 01 00 00 00 38 36 9b 0d ^ /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:1766:27: runtime error: member access within misaligned address 0x00010dbfa628 for type 'std::__1::function<void ()>', which requires 16 byte alignment 0x00010dbfa628: note: pointer points here 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:1768:14: runtime error: member access within misaligned address 0x00010dbfa628 for type 'std::__1::function<void ()>', which requires 16 byte alignment 0x00010dbfa628: note: pointer points here 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^ /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:1768:14: runtime error: load of misaligned address 0x00010dbfa648 for type '__base ' (aka '__base<void ()> '), which requires 16 byte alignment 0x00010dbfa648: note: pointer points here 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c8 a5 97 0d 01 00 00 00 38 36 9b 0d ^ [ OK ] DBPropertiesTest.ReadLatencyHistogramByLevel (1599 ms) [----------] 1 test from DBPropertiesTest (1599 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (1599 ms total) [ PASSED ] 1 test. ``` So it seems the root cause is that the internal implementation of `std::function` on macOS (and perhaps with libc++ generally?) requires 16-byte aligned memory, but the arena allocator only guarantees that the returned memory will be `sizeof(void)` aligned, which is only 8-byte alignment on my machine. This patch solves the problem by adjusting the allocator to derive the necessary alignment from `alignof(std::max_align_t)`, which is properly 16 bytes on my machine. As I mentioned in #2265, none of RocksDB's tests will cause this unaligned access to actually abort the process, but, on macOS, linking CockroachDB against a version of RocksDB with the above patch and letting it run for just a few seconds will cause a SIGABRT. ``` Process 19792 stopped thread #2, stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT) frame #0: 0x0000000004f5e78f cockroach`DBNewIter + 95 cockroach`DBNewIter: -> 0x4f5e78f <+95>: callq 0x28(%rax) 0x4f5e792 <+98>: jmp 0x4f5e79e ; <+110> 0x4f5e794 <+100>: movq -0x50(%rbp), %rcx 0x4f5e798 <+104>: movq %rax, %rdi (lldb) bt thread #2, stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT) * frame #0: 0x0000000004f5e78f cockroach`DBNewIter + 95 ``` I'd get you a backtrace, but [Go doesn't include cgo debug information on macOS](https://github.com/golang/go/issues/6942). I've also tried building against libc++ on Linux, where debug information would be available, but I can't seem to trigger the bug there. In any case, this PR both fixes the segfault in CockroachDB and fixes the warnings reported by ubsan. Closes https://github.com/facebook/rocksdb/pull/2347 Differential Revision: D5108596 Pulled By: yiwu-arbug fbshipit-source-id: bd5e4323b2ce915ed4fe78e123cb8996aec75a00	2017-10-17 11:13:19 -07:00
codeeply	f7843f30a8	Move ~Comparator define to comparator.h Summary: When I impl my own comparator, and build in release mode. The following compile error occurs. undefined reference to `typeinfo for rocksdb::Comparator' This fix allows users build with RTTI off when has their own comparator. Closes https://github.com/facebook/rocksdb/pull/3008 Differential Revision: D6077354 Pulled By: yiwu-arbug fbshipit-source-id: 914c26dbab72f0ad1f0e15f8666a3fb2f10bfed8	2017-10-17 09:58:13 -07:00
Andrew Kryczka	731895214b	db_bench randomtransaction print throughput Summary: print throughput in MB/s upon finishing randomtransaction benchmark Closes https://github.com/facebook/rocksdb/pull/3016 Differential Revision: D6070426 Pulled By: ajkr fbshipit-source-id: 69df43beed4c374a36d826e761ca3a83e1fdcbf5	2017-10-16 18:42:25 -07:00
Yi Wu	31d3e41810	PinnableSlice move assignment Summary: Allow `std::move(pinnable_slice)`. Closes https://github.com/facebook/rocksdb/pull/2997 Differential Revision: D6036782 Pulled By: yiwu-arbug fbshipit-source-id: 583fb0419a97e437ff530f4305822341cd3381fa	2017-10-12 18:28:24 -07:00
Kefu Chai	019aa7074c	cmake: pass "-msse4.2" to when building crc32c.cc if HAVE_SSE42 Summary: it turns out that, with older GCC shipped from centos7, the SSE42 intrinsics are not available even with "target" specified. so we need to pass "-msse42" for checking compiler's sse4.2 support and for building crc32c.cc which uses sse4.2 intrinsics for crc32. Signed-off-by: Kefu Chai <tchaikov@gmail.com> Closes https://github.com/facebook/rocksdb/pull/2950 Differential Revision: D6032298 Pulled By: siying fbshipit-source-id: 124c946321043661b3fb0a70b6cdf4c9c5126ab4	2017-10-11 12:26:46 -07:00
Andrew Kryczka	1026e794a3	rate limit auto-tuning Summary: Dynamic adjustment of rate limit according to demand for background I/O. It increases by a factor when limiter is drained too frequently, and decreases by the same factor when limiter is not drained frequently enough. The parameters for this behavior are fixed in `GenericRateLimiter::Tune`. Other changes: - make rate limiter's `Env*` configurable for testing - track num drain intervals in RateLimiter so we don't have to rely on stats, which may be shared across different DB instances from the ones that share the RateLimiter. Closes https://github.com/facebook/rocksdb/pull/2899 Differential Revision: D5858704 Pulled By: ajkr fbshipit-source-id: cc2bac30f85e7f6fd63655d0a6732ef9ed7403b1	2017-10-04 19:15:01 -07:00
Andrew Kryczka	5b2cb64bfb	Prevent threads from respawning during joining Summary: Previously the thread pool might be non-empty after joining since concurrent submissions could spawn new threads. This problem didn't affect our background flush/compaction thread pools because the `shutting_down_` flag prevented new jobs from being submitted during/after joining. But I wanted to be able to reuse the `ThreadPool` without such external synchronization. Closes https://github.com/facebook/rocksdb/pull/2953 Differential Revision: D5951920 Pulled By: ajkr fbshipit-source-id: 0efec7d0056d36d1338367da75e8b0c089bbc973	2017-10-03 16:27:28 -07:00
Siying Dong	64b6452e0c	Make InternalKeyComparator final and directly use it in merging iterator Summary: Merging iterator invokes InternalKeyComparator.Compare() frequently to heap merge. By making InternalKeyComparator final and merging iterator to directly use InternalKeyComparator rather than through Iterator interface, we can give compiler a choice to avoid one more virtual function call if possible. I ran readseq benchmark in memory-only use case to make sure the performance at least doesn't regress. I have to disable the final key word in debug build, as a hack test class depends on overriding the class. Closes https://github.com/facebook/rocksdb/pull/2860 Differential Revision: D5800461 Pulled By: siying fbshipit-source-id: ab876f22a09bb5c560740911412336e0e25ccb53	2017-09-11 12:04:21 -07:00
Yi Wu	dcd36a6aee	Make it explicit blob db doesn't support CF Summary: Blob db doesn't currently support column families. Return NotSupported status explicitly. Closes https://github.com/facebook/rocksdb/pull/2825 Differential Revision: D5757438 Pulled By: yiwu-arbug fbshipit-source-id: 44de9408fd032c98e8ae337d4db4ed37169bd9fa	2017-09-08 11:11:04 -07:00
Kefu Chai	ba3c58cab6	specify SSE42 'target' attribute for Fast_CRC32() Summary: if we enable SSE42 globally when compiling the tree for preparing a portable binary, which could be running on CPU w/o SSE42 instructions even the GCC on the building host is able to emit SSE42 code, this leads to illegal instruction errors on machines not supporting SSE42. to solve this problem, crc32 detects the supported instruction at runtime, and selects the supported CRC32 implementation according to the result of `cpuid`. but intrinics like "_mm_crc32_u64()" will not be available unless the "target" machine is appropriately specified in the command line, like "-msse42", or using the "target" attribute. we could pass "-msse42" only when compiling crc32c.cc, and allow the compiler to generate the SSE42 instructions, but we are still at the risk of executing illegal instructions on machines does not support SSE42 if the compiler emits code that is not guarded by our runtime detection. and we need to do the change in both Makefile and CMakefile. or, we can use GCC's "target" attribute to enable the machine specific instructions on certain function. in this way, we have finer grained control of the used "target". and no need to change the makefiles. so we don't need to duplicate the changes on both makefile and cmake as the previous approach. this problem surfaces when preparing a package for GNU/Linux distribution, and we only applies to optimization for SSE42, so using a feature only available on GCC/Clang is not that formidable. Closes https://github.com/facebook/rocksdb/pull/2807 Differential Revision: D5786084 Pulled By: siying fbshipit-source-id: bca5c0f877b8d6fb55f58f8f122254a26422843d	2017-09-07 12:40:57 -07:00
Kamalalochana Subbaiah	e612e31740	Updated CRC32 Power Optimization Changes Summary: Support for PowerPC Architecture Detecting AltiVec Support Closes https://github.com/facebook/rocksdb/pull/2716 Differential Revision: D5606836 Pulled By: siying fbshipit-source-id: 720262453b1546e5fdbbc668eff56848164113f3	2017-08-31 14:16:30 -07:00
Maysam Yabandeh	26ac24f199	Add more unit test to write_prepared txns Summary: Closes https://github.com/facebook/rocksdb/pull/2798 Differential Revision: D5724173 Pulled By: maysamyabandeh fbshipit-source-id: fb6b782d933fb4be315b1a231a6a67a66fdc9c96	2017-08-31 09:41:27 -07:00
Siying Dong	666a005f9b	Support prefetch last 512KB with direct I/O in block based file reader Summary: Right now, if direct I/O is enabled, prefetching the last 512KB cannot be applied, except compaction inputs or readahead is enabled for iterators. This can create a lot of I/O for HDD cases. To solve the problem, the 512KB is prefetched in block based table if direct I/O is enabled. The prefetched buffer is passed in totegher with random access file reader, so that we try to read from the buffer before reading from the file. This can be extended in the future to support flexible user iterator readahead too. Closes https://github.com/facebook/rocksdb/pull/2708 Differential Revision: D5593091 Pulled By: siying fbshipit-source-id: ee36ff6d8af11c312a2622272b21957a7b5c81e7	2017-08-11 12:16:45 -07:00
James Page	36375de76f	gcc-7/i386: markup intentional fallthroughs Summary: Markup i386 code paths resolving compilation failure under i386 with gcc-7. Signed-off-by: James Page <james.page@ubuntu.com> Closes https://github.com/facebook/rocksdb/pull/2700 Differential Revision: D5583047 Pulled By: maysamyabandeh fbshipit-source-id: fe31bcfeaf7cd2d3f51b55f5ae0b3b0cb3788fbc	2017-08-08 08:56:52 -07:00
Andrew Kryczka	cc01985db0	Introduce bottom-pri thread pool for large universal compactions Summary: When we had a single thread pool for compactions, a thread could be busy for a long time (minutes) executing a compaction involving the bottom level. In multi-instance setups, the entire thread pool could be consumed by such bottom-level compactions. Then, top-level compactions (e.g., a few L0 files) would be blocked for a long time ("head-of-line blocking"). Such top-level compactions are critical to prevent compaction stalls as they can quickly reduce number of L0 files / sorted runs. This diff introduces a bottom-priority queue for universal compactions including the bottom level. This alleviates the head-of-line blocking situation for fast, top-level compactions. - Added `Env::Priority::BOTTOM` thread pool. This feature is only enabled if user explicitly configures it to have a positive number of threads. - Changed `ThreadPoolImpl`'s default thread limit from one to zero. This change is invisible to users as we call `IncBackgroundThreadsIfNeeded` on the low-pri/high-pri pools during `DB::Open` with values of at least one. It is necessary, though, for bottom-pri to start with zero threads so the feature is disabled by default. - Separated `ManualCompaction` into two parts in `PrepickedCompaction`. `PrepickedCompaction` is used for any compaction that's picked outside of its execution thread, either manual or automatic. - Forward universal compactions involving last level to the bottom pool (worker thread's entry point is `BGWorkBottomCompaction`). - Track `bg_bottom_compaction_scheduled_` so we can wait for bottom-level compactions to finish. We don't count them against the background jobs limits. So users of this feature will get an extra compaction for free. Closes https://github.com/facebook/rocksdb/pull/2580 Differential Revision: D5422916 Pulled By: ajkr fbshipit-source-id: a74bd11f1ea4933df3739b16808bb21fcd512333	2017-08-03 15:43:29 -07:00
Siying Dong	a84cee8127	Add a missing "once" in .h Summary: Closes https://github.com/facebook/rocksdb/pull/2670 Differential Revision: D5529018 Pulled By: siying fbshipit-source-id: 10a378933d509035d2dbe502247dd85fcea09789	2017-07-31 12:12:03 -07:00
Siying Dong	21696ba502	Replace dynamic_cast<> Summary: Replace dynamic_cast<> so that users can choose to build with RTTI off, so that they can save several bytes per object, and get tiny more memory available. Some nontrivial changes: 1. Add Comparator::GetRootComparator() to get around the internal comparator hack 2. Add the two experiemental functions to DB 3. Add TableFactory::GetOptionString() to avoid unnecessary casting to get the option string 4. Since 3 is done, move the parsing option functions for table factory to table factory files too, to be symmetric. Closes https://github.com/facebook/rocksdb/pull/2645 Differential Revision: D5502723 Pulled By: siying fbshipit-source-id: fd13cec5601cf68a554d87bfcf056f2ffa5fbf7c	2017-07-28 16:27:16 -07:00
Mike Kolupaev	e85f2c64cb	Prevent empty memtables from using a lot of memory Summary: This fixes OOMs that we (logdevice) are currently having in production. SkipListRep constructor does a couple small allocations from ConcurrentArena (see InlineSkipList constructor). ConcurrentArena would sometimes allocate an entire block for that, which is a few megabytes (we use Options::arena_block_size = 4 MB). So an empty memtable can take take 4 MB of memory. We have ~40k column families (spread across 15 DB instances), so 4 MB per empty memtable easily OOMs a machine for us. This PR makes ConcurrentArena always allocate from Arena's inline block when possible. So as long as InlineSkipList's initial allocations are below 2 KB there would be no blocks allocated for empty memtables. Closes https://github.com/facebook/rocksdb/pull/2569 Differential Revision: D5404029 Pulled By: al13n321 fbshipit-source-id: 568ec22a3fd1a485c06123f6b2dfc5e9ef67cd23	2017-07-28 15:58:43 -07:00
Siying Dong	e7697b8ce8	Fix LITE unit tests Summary: Closes https://github.com/facebook/rocksdb/pull/2649 Differential Revision: D5505778 Pulled By: siying fbshipit-source-id: 7e935603ede3d958ea087ed6b8cfc4121e8797bc	2017-07-26 21:11:47 -07:00
Siying Dong	c281b44829	Revert "CRC32 Power Optimization Changes" Summary: This reverts commit `2289d38115`. Closes https://github.com/facebook/rocksdb/pull/2652 Differential Revision: D5506163 Pulled By: siying fbshipit-source-id: 105e31dd9d99090453a6b9f32c165206cd3affa3	2017-07-26 19:31:36 -07:00
Kamalalochana Subbaiah	2289d38115	CRC32 Power Optimization Changes Summary: Support for PowerPC Architecture Detecting AltiVec Support Closes https://github.com/facebook/rocksdb/pull/2353 Differential Revision: D5210948 Pulled By: siying fbshipit-source-id: 859a8c063d37697addd89ba2b8a14e5efd5d24bf	2017-07-26 09:42:29 -07:00
Sagar Vemuri	72502cf227	Revert "comment out unused parameters" Summary: This reverts the previous commit `1d7048c598`, which broke the build. Did a `git revert 1d7048c`. Closes https://github.com/facebook/rocksdb/pull/2627 Differential Revision: D5476473 Pulled By: sagar0 fbshipit-source-id: 4756ff5c0dfc88c17eceb00e02c36176de728d06	2017-07-21 18:26:26 -07:00
Victor Gao	1d7048c598	comment out unused parameters Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually. Reviewed By: igorsugak Differential Revision: D5454343 fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2	2017-07-21 14:57:44 -07:00
Islam AbdelRahman	3e5ea29a83	Fix Flaky DeleteSchedulerTest::ImmediateDeleteOn25PercDBSize Summary: In this test we are deleting 100 files, and we are expecting DeleteScheduler to delete 26 files in the background and 74 files immediately in the foreground The main purpose of the test is to make sure that we delete files in foreground thread, which is verified in line 546 But sometimes we may end up with 26 files or 25 files in the trash directory because the background thread may be slow and not be able to delete the first file fast enough, so sometimes this test fail. Remove ``` ASSERT_EQ(CountFilesInDir(trash_dir_), 25); ``` Since it does not have any benefit any way Closes https://github.com/facebook/rocksdb/pull/2618 Differential Revision: D5458674 Pulled By: IslamAbdelRahman fbshipit-source-id: 5556a9edfa049db71dce80b8e6ae0fdd25e1e74e	2017-07-20 11:29:01 -07:00
Yedidya Feldblum	f1a056e005	CodeMod: Prefer ADD_FAILURE() over EXPECT_TRUE(false), et cetera Summary: CodeMod: Prefer `ADD_FAILURE()` over `EXPECT_TRUE(false)`, et cetera. The tautologically-conditioned and tautologically-contradicted boolean expectations/assertions have better alternatives: unconditional passes and failures. Reviewed By: Orvid Differential Revision: D5432398 Tags: codemod, codemod-opensource fbshipit-source-id: d16b447e8696a6feaa94b41199f5052226ef6914	2017-07-16 21:26:02 -07:00
Siying Dong	3c327ac2d0	Change RocksDB License Summary: Closes https://github.com/facebook/rocksdb/pull/2589 Differential Revision: D5431502 Pulled By: siying fbshipit-source-id: 8ebf8c87883daa9daa54b2303d11ce01ab1f6f75	2017-07-15 16:11:23 -07:00
Giuseppe Ottaviano	8f927e5f75	Fix undefined behavior in Hash Summary: Instead of ignoring UBSan checks, fix the negative shifts in Hash(). Also add test to make sure the hash values are stable over time. The values were computed before this change, so the test also verifies the correctness of the change. Closes https://github.com/facebook/rocksdb/pull/2546 Differential Revision: D5386369 Pulled By: yiwu-arbug fbshipit-source-id: 6de4b44461a544d6222cc5d72d8cda2c0373d17e	2017-07-10 12:29:24 -07:00
Maysam Yabandeh	b5fb85ec51	fix valgrind init complaint Summary: Closes https://github.com/facebook/rocksdb/pull/2549 Differential Revision: D5386307 Pulled By: maysamyabandeh fbshipit-source-id: 3032c95c54755053b6450765ec4dacbecb734f9d	2017-07-07 18:27:08 -07:00
Maysam Yabandeh	45b9bb0331	Cut filter partition based on metadata_block_size Summary: Currently metadata_block_size controls only index partition size. With this patch a partition is cut after any of index or filter partitions reaches metadata_block_size. Closes https://github.com/facebook/rocksdb/pull/2452 Differential Revision: D5275651 Pulled By: maysamyabandeh fbshipit-source-id: 5057e4424b4c8902043782e6bf8c38f0c4f25160	2017-07-02 10:42:12 -07:00
Siying Dong	afbef65187	Bug fix: Fast CRC Support printing is not honest Summary: `11c5d4741a` introduces a bug that IsFastCrc32Supported() returns wrong result. Fix it. Also fix some FB internal scripts. Closes https://github.com/facebook/rocksdb/pull/2513 Differential Revision: D5343802 Pulled By: yiwu-arbug fbshipit-source-id: 057dc7ae3b262fe951413d1190ce60afc788cc05	2017-06-28 21:41:42 -07:00
Mike Kolupaev	397ab11152	Improve Status message for block checksum mismatches Summary: We've got some DBs where iterators return Status with message "Corruption: block checksum mismatch" all the time. That's not very informative. It would be much easier to investigate if the error message contained the file name - then we would know e.g. how old the corrupted file is, which would be very useful for finding the root cause. This PR adds file name, offset and other stuff to some block corruption-related status messages. It doesn't improve all the error messages, just a few that were easy to improve. I'm mostly interested in "block checksum mismatch" and "Bad table magic number" since they're the only corruption errors that I've ever seen in the wild. Closes https://github.com/facebook/rocksdb/pull/2507 Differential Revision: D5345702 Pulled By: al13n321 fbshipit-source-id: fc8023d43f1935ad927cef1b9c55481ab3cb1339	2017-06-28 21:27:01 -07:00
Aaron Gao	0025a36409	revert perf_context and io_stats to __thread Summary: https://github.com/facebook/rocksdb/pull/2380 introduces a regression by replacing __thread with ThreadLocalPtr. Revert the thread local implementation back. Closes https://github.com/facebook/rocksdb/pull/2485 Differential Revision: D5308050 Pulled By: lightmark fbshipit-source-id: 2676e9c22edf76e8133d3f4c50e2711e11a95480	2017-06-26 15:27:17 -07:00
Maysam Yabandeh	499ebb3ab5	Optimize for serial commits in 2PC Summary: Throughput: 46k tps in our sysbench settings (filling the details later) The idea is to have the simplest change that gives us a reasonable boost in 2PC throughput. Major design changes: 1. The WAL file internal buffer is not flushed after each write. Instead it is flushed before critical operations (WAL copy via fs) or when FlushWAL is called by MySQL. Flushing the WAL buffer is also protected via mutex_. 2. Use two sequence numbers: last seq, and last seq for write. Last seq is the last visible sequence number for reads. Last seq for write is the next sequence number that should be used to write to WAL/memtable. This allows to have a memtable write be in parallel to WAL writes. 3. BatchGroup is not used for writes. This means that we can have parallel writers which changes a major assumption in the code base. To accommodate for that i) allow only 1 WriteImpl that intends to write to memtable via mem_mutex_--which is fine since in 2PC almost all of the memtable writes come via group commit phase which is serial anyway, ii) make all the parts in the code base that assumed to be the only writer (via EnterUnbatched) to also acquire mem_mutex_, iii) stat updates are protected via a stat_mutex_. Note: the first commit has the approach figured out but is not clean. Submitting the PR anyway to get the early feedback on the approach. If we are ok with the approach I will go ahead with this updates: 0) Rebase with Yi's pipelining changes 1) Currently batching is disabled by default to make sure that it will be consistent with all unit tests. Will make this optional via a config. 2) A couple of unit tests are disabled. They need to be updated with the serial commit of 2PC taken into account. 3) Replacing BatchGroup with mem_mutex_ got a bit ugly as it requires releasing mutex_ beforehand (the same way EnterUnbatched does). This needs to be cleaned up. Closes https://github.com/facebook/rocksdb/pull/2345 Differential Revision: D5210732 Pulled By: maysamyabandeh fbshipit-source-id: 78653bd95a35cd1e831e555e0e57bdfd695355a4	2017-06-24 14:11:29 -07:00
Dmitri Smirnov	a21db161c9	Implement ReopenWritibaleFile on Windows and other fixes Summary: Make default impl return NoSupported so the db_blob tests exist in a meaningful manner. Replace std::thread to port::Thread Closes https://github.com/facebook/rocksdb/pull/2465 Differential Revision: D5275563 Pulled By: yiwu-arbug fbshipit-source-id: cedf1a18a2c05e20d768c1308b3f3224dbd70ab6	2017-06-20 10:31:13 -07:00
Sagar Vemuri	53dda8797d	Do not run RateLimiterTest.Rate test on Travis+Mac OSX. Summary: RateLimiterTest.Rate test has been failing continuously since many days on travis in Mac OSX PLATFORM_DEPENDENT test suite. Check https://travis-ci.org/facebook/rocksdb/pull_requests. Disabling this test for now, so that we can investigate more in depth. Closes https://github.com/facebook/rocksdb/pull/2451 Differential Revision: D5250147 Pulled By: sagar0 fbshipit-source-id: d58476a3c2792d20e875754d1516c4bc7174e86c	2017-06-14 14:58:02 -07:00
hyunwoo	6b5a5dc5d8	fixed typo Summary: fixed typo Closes https://github.com/facebook/rocksdb/pull/2430 Differential Revision: D5242471 Pulled By: IslamAbdelRahman fbshipit-source-id: 832eb3a4c70221444ccd2ae63217823fec56c748	2017-06-13 16:58:01 -07:00
haoxiang	0f228be3bb	fixed typo in util/dynamic_bloom.h Summary: fixed a typo in util/dynamic_bloom.h Closes https://github.com/facebook/rocksdb/pull/2442 Differential Revision: D5242397 Pulled By: IslamAbdelRahman fbshipit-source-id: c47fd18cc79afff6b022201a0410c0cd47626576	2017-06-13 16:41:36 -07:00
Andrew Kryczka	c217e0b9c7	Call RateLimiter for compaction reads Summary: Allow users to rate limit background work based on read bytes, written bytes, or sum of read and written bytes. Support these by changing the RateLimiter API, so no additional options were needed. Closes https://github.com/facebook/rocksdb/pull/2433 Differential Revision: D5216946 Pulled By: ajkr fbshipit-source-id: aec57a8357dbb4bfde2003261094d786d94f724e	2017-06-13 14:56:46 -07:00
Siying Dong	0175d58c3c	Make direct I/O write use incremental buffer Summary: Currently for direct I/O, the large maximum buffer is always allocated. This will be wasteful if users flush the data in much smaller chunks. This diff fix this by changing the behavior of incremental buffer works. When we enlarge buffer, we try to copy the existing data in the buffer to the enlarged buffer, rather than flush the buffer first. This can make sure that no extra I/O is introduced because of buffer enlargement. Closes https://github.com/facebook/rocksdb/pull/2403 Differential Revision: D5178403 Pulled By: siying fbshipit-source-id: a8fe1e7304bdb8cab2973340022fe80ff83449fd	2017-06-13 04:41:37 -07:00
Islam AbdelRahman	d713471da8	Limit trash directory to be 25% of total DB Summary: Update DeleteScheduler to delete files immediately if trash directory is >= 25% of DB size Closes https://github.com/facebook/rocksdb/pull/2436 Differential Revision: D5230384 Pulled By: IslamAbdelRahman fbshipit-source-id: 5cbda8ac536a3cc72c774641621edc02c8202482	2017-06-12 16:57:21 -07:00
Siying Dong	db818d2d1a	Fix RocksDB Lite build with CLANG Summary: Closes https://github.com/facebook/rocksdb/pull/2419 Differential Revision: D5193976 Pulled By: siying fbshipit-source-id: 62d115edee6043237e9d6ad3c2a05481e162c9eb	2017-06-12 06:41:27 -07:00
Siying Dong	52a7f38b19	WriteOptions.low_pri which can throttle low pri writes if needed Summary: If ReadOptions.low_pri=true and compaction is behind, the write will either return immediate or be slowed down based on ReadOptions.no_slowdown. Closes https://github.com/facebook/rocksdb/pull/2369 Differential Revision: D5127619 Pulled By: siying fbshipit-source-id: d30e1cff515890af0eff32dfb869d2e4c9545eb0	2017-06-05 15:02:35 -07:00
hyunwoo	c7662a44a4	fixed typo Summary: fixed typo Closes https://github.com/facebook/rocksdb/pull/2376 Differential Revision: D5183630 Pulled By: ajkr fbshipit-source-id: 133cfd0445959e70aa2cd1a12151bf3c0c5c3ac5	2017-06-05 11:27:34 -07:00
Aaron Gao	7f6c02dda1	using ThreadLocalPtr to hide ROCKSDB_SUPPORT_THREAD_LOCAL from public… Summary: … headers https://github.com/facebook/rocksdb/pull/2199 should not reference RocksDB-specific macros (like ROCKSDB_SUPPORT_THREAD_LOCAL in this case) to public headers, `iostats_context.h` and `perf_context.h`. We shouldn't do that because users have to provide these compiler flags when building their binary with RocksDB. We should hide the thread local global variable inside our implementation and just expose a function api to retrieve these variables. It may break some users for now but good for long term. make check -j64 Closes https://github.com/facebook/rocksdb/pull/2380 Differential Revision: D5177896 Pulled By: lightmark fbshipit-source-id: 6fcdfac57f2e2dcfe60992b7385c5403f6dcb390	2017-06-02 17:26:19 -07:00
Siying Dong	95b0e89b5d	Improve write buffer manager (and allow the size to be tracked in block cache) Summary: Improve write buffer manager in several ways: 1. Size is tracked when arena block is allocated, rather than every allocation, so that it can better track actual memory usage and the tracking overhead is slightly lower. 2. We start to trigger memtable flush when 7/8 of the memory cap hits, instead of 100%, and make 100% much harder to hit. 3. Allow a cache object to be passed into buffer manager and the size allocated by memtable can be costed there. This can help users have one single memory cap across block cache and memtable. Closes https://github.com/facebook/rocksdb/pull/2350 Differential Revision: D5110648 Pulled By: siying fbshipit-source-id: b4238113094bf22574001e446b5d88523ba00017	2017-06-02 14:26:56 -07:00
Tamir Duberstein	103d0692ea	Avoid unsupported attributes when not building with UBSAN Summary: yiwu-arbug see individual commits. Closes https://github.com/facebook/rocksdb/pull/2318 Differential Revision: D5141520 Pulled By: yiwu-arbug fbshipit-source-id: 7987c92ab4461eef36afce5a133d3a0ee0c96300	2017-05-30 11:13:01 -07:00
Siying Dong	41cbb72749	options.delayed_write_rate use the rate of rate_limiter by default. Summary: It's hard for RocksDB to come up with a good default of delayed write rate. Use rate given by rate limiter if it is availalbe. This provides the I/O order of magnitude. Closes https://github.com/facebook/rocksdb/pull/2357 Differential Revision: D5115324 Pulled By: siying fbshipit-source-id: 341065ad2211c981fc804011c0f0e59a50c7e754	2017-05-24 09:58:24 -07:00
Andrew Kryczka	6cc9aef162	New API for background work in single thread pool Summary: Previously users could set `max_background_flushes=0` to force rocksdb to use a single thread pool for both background flushes and compactions. That'll no longer be possible since I'm going to deprecate `max_background_flushes` and `max_background_compactions` in favor of a single option. This diff introduces a new way to force a single thread pool: when high-pri pool has zero threads, all background jobs will be submitted to low-pri pool. Note the majority of the code change is adding `Env::GetBackgroundThreads()`, which is necessary to check whether the user has provided a zero-sized thread pool. Closes https://github.com/facebook/rocksdb/pull/2204 Differential Revision: D4936256 Pulled By: ajkr fbshipit-source-id: 929a07a0c0705f7766f5339cd013ff74e90d6e01	2017-05-23 11:12:27 -07:00
Andrew Kryczka	ac39d6bec5	Core-local statistics Summary: This diff changes `StatisticsImpl` from a thread-local approach to a core-local one. The goal is to perform faster aggregations, particularly for applications that have many threads. There should be no behavior change. Closes https://github.com/facebook/rocksdb/pull/2258 Differential Revision: D5016258 Pulled By: ajkr fbshipit-source-id: 7d4d165b4a91d8110f0409d113d1be91f22d31a9	2017-05-23 10:42:59 -07:00
Aaron Gao	3e86c0f07c	disable direct reads for log and manifest and add direct io to tests Summary: Disable direct reads for log and manifest. Direct reads should not affect sequential_file Also add kDirectIO for option_config_ in db_test_util Closes https://github.com/facebook/rocksdb/pull/2337 Differential Revision: D5100261 Pulled By: lightmark fbshipit-source-id: 0ebfd13b93fa1b8f9acae514ac44f8125a05868b	2017-05-22 18:41:28 -07:00
Dmitri Smirnov	15ba4d6c4b	Address MS Visual Studio 2017 issue with autovector Summary: This addresses https://github.com/facebook/rocksdb/issues/2262 Closes https://github.com/facebook/rocksdb/pull/2333 Differential Revision: D5097941 Pulled By: siying fbshipit-source-id: fb33582bfe7883ecc3f6da028703982522b5f75f	2017-05-22 10:57:06 -07:00
Aaron Gao	a36220ccfb	fix unity test Summary: unity test will fail even if we have the same function names in different anonymous namespaces in different files. Closes https://github.com/facebook/rocksdb/pull/2321 Differential Revision: D5083783 Pulled By: lightmark fbshipit-source-id: 1347aaf866900af30d23cdd4f29c1b96f17352af	2017-05-17 18:56:55 -07:00
Nikhil Benesch	11c5d4741a	cross-platform compatibility improvements Summary: We've had a couple CockroachDB users fail to build RocksDB on exotic platforms, so I figured I'd try my hand at solving these issues upstream. The problems stem from a) `USE_SSE=1` being too aggressive about turning on SSE4.2, even on toolchains that don't support SSE4.2 and b) RocksDB attempting to detect support for thread-local storage based on OS, even though it can vary by compiler on the same OS. See the individual commit messages for details. Regarding SSE support, this PR should change virtually nothing for non-CMake based builds. `make`, `PORTABLE=1 make`, `USE_SSE=1 make`, and `PORTABLE=1 USE_SSE=1 make` function exactly as before, except that SSE support will be automatically disabled when a simple SSE4.2-using test program fails to compile, as it does on OpenBSD. (OpenBSD's ports GCC supports SSE4.2, but its binutils do not, so `__SSE_4_2__` is defined but an SSE4.2-using program will fail to assemble.) A warning is emitted in this case. The CMake build is modified to support the same set of options, except that `USE_SSE` is spelled `FORCE_SSE42` because `USE_SSE` is rather useless now that we can automatically detect SSE support, and I figure changing options in the CMake build is less disruptive than changing the non-CMake build. I've tested these changes on all the platforms I can get my hands on (macOS, Windows MSVC, Windows MinGW, and OpenBSD) and it all works splendidly. Let me know if there's anything you object to—I obviously don't mean to break any of your build pipelines in the process of fixing ours downstream. Closes https://github.com/facebook/rocksdb/pull/2199 Differential Revision: D5054042 Pulled By: yiwu-arbug fbshipit-source-id: 938e1fc665c049c02ae15698e1409155b8e72171	2017-05-15 16:15:38 -07:00
Andrew Kryczka	bbe9ee7dd4	core-local array type conversions Summary: try to clean up the type conversions and hope it passes on windows. one interesting thing I learned is that bitshift operations are special: in `x << y`, the result type depends only on the type of `x`, unlike most arithmetic operations where the result type depends on both operands' types. Closes https://github.com/facebook/rocksdb/pull/2277 Differential Revision: D5050145 Pulled By: ajkr fbshipit-source-id: f3309e77526ac9612c632bf93a62d99757af9a29	2017-05-12 09:28:07 -07:00
Andrew Kryczka	cda5fde2d9	CoreLocalArray class Summary: Moved the logic for core-local array out of ConcurrentArena and into a separate class because I want to reuse it for core-local stats. Closes https://github.com/facebook/rocksdb/pull/2256 Differential Revision: D5011518 Pulled By: ajkr fbshipit-source-id: a75a7b8f7b7a42fd6273489ada405f14c6be196a	2017-05-10 18:25:36 -07:00
Anirban Rahut	d85ff4953c	Blob storage pr Summary: The final pull request for Blob Storage. Closes https://github.com/facebook/rocksdb/pull/2269 Differential Revision: D5033189 Pulled By: yiwu-arbug fbshipit-source-id: 6356b683ccd58cbf38a1dc55e2ea400feecd5d06	2017-05-10 15:14:44 -07:00
Tamir Duberstein	fdaefa0309	travis: add Windows cross-compilation Summary: - downcase includes for case-sensitive filesystems - give targets the same name (librocksdb) on all platforms With this patch it is possible to cross-compile RocksDB for Windows from a Linux host using mingw. cc yuslepukhin orgads Closes https://github.com/facebook/rocksdb/pull/2107 Differential Revision: D4849784 Pulled By: siying fbshipit-source-id: ad26ed6b4d393851aa6551e6aa4201faba82ef60	2017-05-05 23:20:01 -07:00
Aaron Gao	a30a696034	do not read next datablock if upperbound is reached Summary: Now if we have iterate_upper_bound set, we continue read until get a key >= upper_bound. For a lot of cases that neighboring data blocks have a user key gap between them, our index key will be a user key in the middle to get a shorter size. For example, if we have blocks: [a b c d][f g h] Then the index key for the first block will be 'e'. then if upper bound is any key between 'd' and 'e', for example, d1, d2, ..., d99999999999, we don't have to read the second block and also know that we have done our iteration by reaching the last key that smaller the upper bound already. This diff can reduce RA in most cases. Closes https://github.com/facebook/rocksdb/pull/2239 Differential Revision: D4990693 Pulled By: lightmark fbshipit-source-id: ab30ea2e3c6edf3fddd5efed3c34fcf7739827ff	2017-05-05 23:20:01 -07:00
Aaron Gao	2d42cf5ea9	Roundup read bytes in ReadaheadRandomAccessFile Summary: Fix alignment in ReadaheadRandomAccessFile Closes https://github.com/facebook/rocksdb/pull/2253 Differential Revision: D5012336 Pulled By: lightmark fbshipit-source-id: 10d2c829520cb787227ef653ef63d5d701725778	2017-05-05 12:14:14 -07:00
Siying Dong	d616ebea23	Add GPLv2 as an alternative license. Summary: Closes https://github.com/facebook/rocksdb/pull/2226 Differential Revision: D4967547 Pulled By: siying fbshipit-source-id: dd3b58ae1e7a106ab6bb6f37ab5c88575b125ab4	2017-04-27 18:06:12 -07:00
Dmitri Smirnov	cdad04b051	Remove double buffering on RandomRead on Windows. Summary: Remove double buffering on RandomRead on Windows. With more logic appear in file reader/write Read no longer obeys forwarding calls to Windows implementation. Previously direct_io (unbuffered) was only available on Windows but now is supported as generic. We remove intermediate buffering on Windows. Remove random_access_max_buffer_size option which was windows specific. Non-zero values for that opton introduced unnecessary lock contention. Remove Env::EnableReadAhead(), Env::ShouldForwardRawRequest() that are no longer necessary. Add aligned buffer reads for cases when requested reads exceed read ahead size. Closes https://github.com/facebook/rocksdb/pull/2105 Differential Revision: D4847770 Pulled By: siying fbshipit-source-id: 8ab48f8e854ab498a4fd398a6934859792a2788f	2017-04-27 12:30:05 -07:00
Aaron Gao	2150cc1f3e	fix WritableFile buffer size in direct IO Summary: �fix the buffer size in case of ppl use buffer size as their block_size. Closes https://github.com/facebook/rocksdb/pull/2198 Differential Revision: D4956878 Pulled By: lightmark fbshipit-source-id: 8bb0dc9c133887aadcd625d5261a3d1110b71473	2017-04-26 15:57:02 -07:00
Tomas Kolda	04d58970cb	AIX and Solaris Sparc Support Summary: Replacement of #2147 The change was squashed due to a lot of conflicts. Closes https://github.com/facebook/rocksdb/pull/2194 Differential Revision: D4929799 Pulled By: siying fbshipit-source-id: 5cd49c254737a1d5ac13f3c035f128e86524c581	2017-04-21 20:48:04 -07:00
Yi Wu	0fcdccc33e	Blob storage helper methods Summary: Split out interfaces needed for blob storage from #1560, including * CompactionEventListener and OnFlushBegin listener interfaces. * Blob filename support. Closes https://github.com/facebook/rocksdb/pull/2169 Differential Revision: D4905463 Pulled By: yiwu-arbug fbshipit-source-id: 564e73448f1b7a367e5e46216a521e57ea9011b5	2017-04-18 12:42:38 -07:00
Aaron Gao	6e8d6f429d	readahead backwards from sst end Summary: prefetch some data from the end of the file for each compaction to reduce IO. Closes https://github.com/facebook/rocksdb/pull/2149 Differential Revision: D4880576 Pulled By: lightmark fbshipit-source-id: aa767cd1afc84c541837fbf1ad6c0d45b34d3932	2017-04-14 18:56:14 -07:00
Aaron Gao	44fa8ece9b	change use_direct_writes to use_direct_io_for_flush_and_compaction Summary: Replace Options::use_direct_writes with Options::use_direct_io_for_flush_and_compaction Now if Options::use_direct_io_for_flush_and_compaction = true, we will enable direct io for both reads and writes for flush and compaction job. Whereas Options::use_direct_reads controls user reads like iterator and Get(). Closes https://github.com/facebook/rocksdb/pull/2117 Differential Revision: D4860912 Pulled By: lightmark fbshipit-source-id: d93575a8a5e780cf7e40797287edc425ee648c19	2017-04-13 16:12:04 -07:00
Aaron Gao	10d7546961	set readahead buffer size from roundup(user_size) + 4k to roundup(use… Summary: Users usually set readahead buffer to a multiple of 4k, more than that, usually a multiple of blocks. So previously we set real buffer size 512 * n + 4k, which may introduce an additional block reading. Closes https://github.com/facebook/rocksdb/pull/2138 Differential Revision: D4871504 Pulled By: lightmark fbshipit-source-id: b070faa51d92e976e8e8468c00692699e585e243	2017-04-11 17:13:33 -07:00
Manuel Ung	1f8b119ed6	Limit maximum memory used in the WriteBatch representation Summary: Extend TransactionOptions to include max_write_batch_size which determines the maximum size of the writebatch representation. If memory limit is exceeded, the operation will abort with subcode kMemoryLimit. Closes https://github.com/facebook/rocksdb/pull/2124 Differential Revision: D4861842 Pulled By: lth fbshipit-source-id: 46fd172ea67cc90bbba829bf0d70cfab2261c161	2017-04-10 15:42:26 -07:00
Sagar Vemuri	343b59d6ee	Move various string utility functions into string_util Summary: This is an effort to club all string related utility functions into one common place, in string_util, so that it is easier for everyone to know what string processing functions are available. Right now they seem to be spread out across multiple modules, like logging and options_helper. Check the sub-commits for easier reviewing. Closes https://github.com/facebook/rocksdb/pull/2094 Differential Revision: D4837730 Pulled By: sagar0 fbshipit-source-id: 344278a	2017-04-06 14:54:12 -07:00
Yi Wu	df6f5a3772	Move memtable related files into memtable directory Summary: Move memtable related files into memtable directory. Closes https://github.com/facebook/rocksdb/pull/2087 Differential Revision: D4829242 Pulled By: yiwu-arbug fbshipit-source-id: ca70ab6	2017-04-06 14:09:13 -07:00
Siying Dong	d2dce5611a	Move some files under util/ to separate dirs Summary: Move some files under util/ to new directories env/, monitoring/ options/ and cache/ Closes https://github.com/facebook/rocksdb/pull/2090 Differential Revision: D4833681 Pulled By: siying fbshipit-source-id: 2fd8bef	2017-04-05 19:09:16 -07:00
Islam AbdelRahman	c50e3750dc	Use a human readable size for level report Summary: Current ``` Compaction Stats [default] Level Files Size(MB} Score Read(GB} Rn(GB} Rnp1(GB} Write(GB} Wnew(GB} Moved(GB} W-Amp Rd(MB/s} Wr(MB/s} Comp(sec} Comp(cnt} Avg(sec} KeyIn KeyDrop ---------------------------------------------------------------------------------------------------------------------------------------------------------- L0 2/0 49.02 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 76.1 1 2 0.322 0 0 Sum 2/0 49.02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 76.1 1 2 0.322 0 0 Int 0/0 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 76.1 1 2 0.322 0 0 ``` New ``` Compaction Stats [default] Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn Key Closes https://github.com/facebook/rocksdb/pull/2055 Differential Revision: D4804576 Pulled By: IslamAbdelRahman fbshipit-source-id: 719be6a	2017-04-05 17:24:19 -07:00
Aaron Gao	02799ad77a	Revert "delete fallocate with punch_hole" Summary: This reverts commit `0fd574926c`. It breaks tmpfs on kernel 4.0 or earlier. We will wait for the fix before remove this part Closes https://github.com/facebook/rocksdb/pull/2096 Differential Revision: D4839661 Pulled By: lightmark fbshipit-source-id: `574a51f`	2017-04-05 16:10:09 -07:00
Andrew Kryczka	e2c6c06366	add TimedEnv Summary: I've needed Env timing measurements a few times now, so finally built something for it. Closes https://github.com/facebook/rocksdb/pull/2073 Differential Revision: D4811231 Pulled By: ajkr fbshipit-source-id: 218a249	2017-04-04 11:24:12 -07:00
Siying Dong	6ef8c620d3	Move auto_roll_logger and filename out of db/ Summary: It is confusing to have auto_roll_logger to stay under db/, which has nothing to do with database. Move filename together as it is a dependency. Closes https://github.com/facebook/rocksdb/pull/2080 Differential Revision: D4821141 Pulled By: siying fbshipit-source-id: ca7d768	2017-04-03 18:39:14 -07:00
Orgad Shaneh	6401a8b76b	Fix build with MinGW Summary: There still are many warnings (most of them about invalid printf format for long long), but it builds if FAIL_ON_WARNINGS is disabled. Closes https://github.com/facebook/rocksdb/pull/2052 Differential Revision: D4807355 Pulled By: siying fbshipit-source-id: ef03786	2017-03-30 16:54:52 -07:00
Andrew Kryczka	80fe5b3855	disable test: DeleteSchedulerTest.DynamicRateLimiting1 Summary: temporarily disable since it isn't working on travis. Closes https://github.com/facebook/rocksdb/pull/2064 Differential Revision: D4807373 Pulled By: ajkr fbshipit-source-id: f2bb2b0	2017-03-30 16:54:52 -07:00
Sagar Vemuri	c6d04f2ecf	Option to fail a request as incomplete when skipping too many internal keys Summary: Operations like Seek/Next/Prev sometimes take too long to complete when there are many internal keys to be skipped. Adding an option, max_skippable_internal_keys -- which could be used to set a threshold for the maximum number of keys that can be skipped, will help to address these cases where it is much better to fail a request (as incomplete) than to wait for a considerable time for the request to complete. This feature -- to fail an iterator seek request as incomplete, is disabled by default when max_skippable_internal_keys = 0. It is enabled only when max_skippable_internal_keys > 0. This feature is based on the discussion mentioned in the PR https://github.com/facebook/rocksdb/pull/1084. Closes https://github.com/facebook/rocksdb/pull/2000 Differential Revision: D4753223 Pulled By: sagar0 fbshipit-source-id: 1c973f7	2017-03-30 12:09:21 -07:00
Herman Lee	58179ec4a6	Cleanup of ThreadStatusUtil structures should use the DB's reference Summary: instead of thread_local The cleanup path for the rocksdb database might not have the thread_updater_local_cache_ pointer initialized because the thread executing the cleanup is likely not a rocksdb thread. This results in a memory leak detected by Valgrind. The cleanup code path should use the thread_status_updater pointer obtained from the DB object instead of a thread local one. Closes https://github.com/facebook/rocksdb/pull/2059 Differential Revision: D4801611 Pulled By: hermanlee fbshipit-source-id: 407d7de	2017-03-30 10:39:13 -07:00
Min Wei	8a8c967460	Enable Fast CRC32 for Win64 Summary: Currently the fast crc32 path is not enabled on Windows. I am trying to enable it here, hopefully, with the minimum impact to the existing code structure. Closes https://github.com/facebook/rocksdb/pull/2033 Differential Revision: D4770635 Pulled By: siying fbshipit-source-id: 676f8b8	2017-03-29 17:39:19 -07:00
Aaron Gao	0fd574926c	delete fallocate with punch_hole Summary: As discuss in this thread: https://www.facebook.com/groups/rocksdb.dev/permalink/1218043868294125/ We remove fallocate with FALLOC_FL_PUNCH_HOLE because the recent bug on xfs in kernel 4.x+ that align file size to page size even with FALLOC_FL_KEEP_SIZE \| FALLOC_FL_PUNCH_HOLE. Closes https://github.com/facebook/rocksdb/pull/2038 Differential Revision: D4779974 Pulled By: siying fbshipit-source-id: 5f54625	2017-03-28 15:54:12 -07:00
Maysam Yabandeh	e7731d119a	Configure index partition size Summary: Allow the users to specify the target index partition size. With this patch an index partition is cut before its estimated in-memory size goes above the configured value for metadata_block_size. The filter partitions are still cut right after an index partition is cut. Closes https://github.com/facebook/rocksdb/pull/2041 Differential Revision: D4780216 Pulled By: maysamyabandeh fbshipit-source-id: 95a0831	2017-03-28 12:09:12 -07:00
Shu Zhang	8dee8cad9e	Enable fifo compaction benchmark to db_bench Summary: Added fifo benchmark to db_bench. One thing i am not sure is that i am using CompactRange() instead of CompactFiles(). (may cause performance skew because CompactionRange() is not happening in current thread?) For CompactFiles(), for some reason FIFO compaction doesn't work as expected. More insight is welcomed. I guess FIFO compaction doesn't work with file names? igorcanadi test cmd: ./db_bench --compaction_style=2 --benchmarks=fillseqdeterministic --disable_auto_compactions --num_levels=1 --fifo_compaction_max_table_files_size_mb=10 ---------------------- DB 0 LSM --------------------- Level[0]: /000014.sst(size: 4211014 bytes) fillseqdeterministic : 4.731 micros/op 211381 ops/sec; 23.4 MB/s Closes https://github.com/facebook/rocksdb/pull/1734 Differential Revision: D4774964 Pulled By: siying fbshipit-source-id: 9d08df6	2017-03-24 17:09:15 -07:00
Raza Hussain	6908e24b56	dynamic setting of stats_dump_period_sec through SetDBOption() Summary: Resolved the following issue: https://github.com/facebook/rocksdb/issues/1930 Closes https://github.com/facebook/rocksdb/pull/2004 Differential Revision: D4736764 Pulled By: yiwu-arbug fbshipit-source-id: 64fe0b7	2017-03-20 22:54:13 -07:00
Aaron Gao	9272e12f19	avoid ftruncate twice in buffered io Summary: in buffered io, the filesize_ is the real size. Closes https://github.com/facebook/rocksdb/pull/1991 Differential Revision: D4711433 Pulled By: lightmark fbshipit-source-id: ad604b9	2017-03-17 11:39:13 -07:00
Islam AbdelRahman	d52f334cbd	Break stalls when no bg work is happening Summary: Current stall will keep sleeping even if there is no Flush/Compactions to wait for, I changed the logic to break the stall if we are not flushing or compacting db_bench command used ``` # fillrandom # memtable size = 10MB # value size = 1 MB # num = 1000 # use /dev/shm ./db_bench --benchmarks="fillrandom,stats" --value_size=1048576 --write_buffer_size=10485760 --num=1000 --delayed_write_rate=XXXXX --db="/dev/shm/new_stall" \| grep "Cumulative stall" ``` ``` Current results # delayed_write_rate = 1000 Kb/sec Cumulative stall: 00:00:9.031 H:M:S # delayed_write_rate = 200 Kb/sec Cumulative stall: 00:00:22.314 H:M:S # delayed_write_rate = 100 Kb/sec Cumulative stall: 00:00:42.784 H:M:S # delayed_write_rate = 50 Kb/sec Cumulative stall: 00:01:23.785 H:M:S # delayed_write_rate = 25 Kb/sec Cumulative stall: 00:02:45.702 H:M:S ``` ``` New results # delayed_write_rate = 1000 Kb/sec Cumulative stall: 00:00:9.017 H:M:S # delayed_write_rate = 200 Kb/sec Cumulative stall: 00 Closes https://github.com/facebook/rocksdb/pull/1884 Differential Revision: D4585439 Pulled By: IslamAbdelRahman fbshipit-source-id: aed2198	2017-03-16 18:24:17 -07:00
Islam AbdelRahman	995618a821	Support SstFileManager::SetDeleteRateBytesPerSecond() Summary: Update DeleteScheduler component to support changing delete rate in runtime by introducing SstFileManager::SetDeleteRateBytesPerSecond() Closes https://github.com/facebook/rocksdb/pull/1994 Differential Revision: D4719906 Pulled By: IslamAbdelRahman fbshipit-source-id: e6b8d9e	2017-03-16 12:09:15 -07:00
Islam AbdelRahman	e19163688b	Add macros to include file name and line number during Logging Summary: current logging ``` 2017/03/14-14:20:30.393432 7fedde9f5700 (Original Log Time 2017/03/14-14:20:30.393414) [default] Level summary: base level 1 max bytes base 268435456 files[1 0 0 0 0 0 0] max score 0.25 2017/03/14-14:20:30.393438 7fedde9f5700 [JOB 2] Try to delete WAL files size 61417909, prev total WAL file size 73820858, number of live WAL files 2. 2017/03/14-14:20:30.393464 7fedde9f5700 [DEBUG] [JOB 2] Delete /dev/shm/old_logging//MANIFEST-000001 type=3 #1 -- OK 2017/03/14-14:20:30.393472 7fedde9f5700 [DEBUG] [JOB 2] Delete /dev/shm/old_logging//000003.log type=0 #3 -- OK 2017/03/14-14:20:31.427103 7fedd49f1700 [default] New memtable created with log file: #9. Immutable memtables: 0. 2017/03/14-14:20:31.427179 7fedde9f5700 [JOB 3] Syncing log #6 2017/03/14-14:20:31.427190 7fedde9f5700 (Original Log Time 2017/03/14-14:20:31.427170) Calling FlushMemTableToOutputFile with column family [default], flush slots available 1, compaction slots allowed 1, compaction slots scheduled 1 2017/03/14-14:20:31. Closes https://github.com/facebook/rocksdb/pull/1990 Differential Revision: D4708695 Pulled By: IslamAbdelRahman fbshipit-source-id: cb8968f	2017-03-15 19:39:12 -07:00
Aaron Gao	d525718a93	cleanup direct io flag in WritableFileWriter Summary: remove unnecessary field `direct_io_`, use `use_direct_io()` instead. Closes https://github.com/facebook/rocksdb/pull/1992 Differential Revision: D4712195 Pulled By: lightmark fbshipit-source-id: 57d34f9	2017-03-14 22:39:09 -07:00
Maysam Yabandeh	11526252cc	Pinnableslice (2nd attempt) Summary: PinnableSlice Summary: Currently the point lookup values are copied to a string provided by the user. This incures an extra memcpy cost. This patch allows doing point lookup via a PinnableSlice which pins the source memory location (instead of copying their content) and releases them after the content is consumed by the user. The old API of Get(string) is translated to the new API underneath. Here is the summary for improvements: value 100 byte: 1.8% regular, 1.2% merge values value 1k byte: 11.5% regular, 7.5% merge values value 10k byte: 26% regular, 29.9% merge values The improvement for merge could be more if we extend this approach to pin the merge output and delay the full merge operation until the user actually needs it. We have put that for future work. PS: Sometimes we observe a small decrease in performance when switching from t5452014 to this patch but with the old Get(string) API. The d Closes https://github.com/facebook/rocksdb/pull/1756 Differential Revision: D4391738 Pulled By: maysamyabandeh fbshipit-source-id: 6f3edd3	2017-03-13 11:54:10 -07:00
Maysam Yabandeh	e6725e8c8d	Fix some bugs in MockEnv Summary: Fixing some bugs in MockEnv so it be actually used. Closes https://github.com/facebook/rocksdb/pull/1914 Differential Revision: D4609923 Pulled By: maysamyabandeh fbshipit-source-id: ca25735	2017-03-13 09:54:11 -07:00
Min Wei	900c62be61	fix compile for VS2015 Summary: Without the cast, the build will break on Windows. Closes https://github.com/facebook/rocksdb/pull/1982 Differential Revision: D4690462 Pulled By: ajkr fbshipit-source-id: c493b6c	2017-03-10 11:24:09 -08:00
Andrew Kryczka	5b11124e39	add max to histogram stats Summary: Domas enlightened me about p100 (i.e., max) stats. Let's add them to our histograms. Closes https://github.com/facebook/rocksdb/pull/1968 Differential Revision: D4678716 Pulled By: ajkr fbshipit-source-id: 65e7118	2017-03-08 22:24:15 -08:00
Maysam Yabandeh	54b434110e	Builders for partition filter Summary: This is the second split of this pull request: https://github.com/facebook/rocksdb/pull/1891 which includes only the builder part. The testing will be included in the third split, where the reader is also included. Closes https://github.com/facebook/rocksdb/pull/1952 Differential Revision: D4660272 Pulled By: maysamyabandeh fbshipit-source-id: 36b3cf0	2017-03-07 13:54:12 -08:00
Andrew Kryczka	7c80a6d7d1	Statistic for how often rate limiter is drained Summary: This is the metric I plan to use for adaptive rate limiting. The statistics are updated only if the rate limiter is drained by flush or compaction. I believe (but am not certain) that this is the normal case. The Statistics object is passed in RateLimiter::Request() to avoid requiring changes to client code, which would've been necessary if we passed it in the RateLimiter constructor. Closes https://github.com/facebook/rocksdb/pull/1946 Differential Revision: D4646489 Pulled By: ajkr fbshipit-source-id: d8e0161	2017-03-02 17:54:15 -08:00
Islam AbdelRahman	f89b3893c0	Remove skip_table_builder_flush and default it to true Summary: This option is needed to be enabled for Direct IO and I cannot think of a reason where we need to disable it remove it and default it to true Closes https://github.com/facebook/rocksdb/pull/1944 Differential Revision: D4641088 Pulled By: IslamAbdelRahman fbshipit-source-id: d7085b9	2017-03-02 16:54:10 -08:00
Siying Dong	8432bcf555	Make compaction_pri settable through option string Summary: Closes https://github.com/facebook/rocksdb/pull/1941 Differential Revision: D4637253 Pulled By: siying fbshipit-source-id: a59dcdb	2017-03-02 10:24:12 -08:00
Aaron Gao	e877afa08b	Remove bulk loading and auto_roll_logger in rocksdb_lite Summary: shrink lite size Closes https://github.com/facebook/rocksdb/pull/1929 Differential Revision: D4622059 Pulled By: siying fbshipit-source-id: 050b796	2017-02-28 11:09:11 -08:00
xiusir	90d8355075	Fix the wrong address for PREFETCH in DynamicBloom::Prefetch Summary: - Change data_[b] to data_[b / 8] in DynamicBloom::Prefetch, as b means the b-th bit in data_ and data_[b / 8] is the proper byte in data_. Closes https://github.com/facebook/rocksdb/pull/1935 Differential Revision: D4628696 Pulled By: siying fbshipit-source-id: bc5a0c6	2017-02-28 10:39:11 -08:00
Islam AbdelRahman	08864df212	Move advanced column family options to advanced_options.h Summary: For the sake of making our options simpler, we should keep options.h as simple as possible and move more advanced/less common options to advaned_options.h I started with ColumnFamilyOptions and also did some re-ordering I have moved all ColumnFamilyOptions to advanced_options.h and only left these options in options.h ``` const Comparator* comparator = BytewiseComparator(); std::shared_ptr<MergeOperator> merge_operator = nullptr; const CompactionFilter* compaction_filter = nullptr; std::shared_ptr<CompactionFilterFactory> compaction_filter_factory = nullptr; size_t write_buffer_size = 64 << 20; CompressionType compression; int level0_file_num_compaction_trigger = 4; bool disable_auto_compactions = false; ``` Please feel free to comment on specific options if you think they should be advanced or should not be Closes https://github.com/facebook/rocksdb/pull/1847 Differential Revision: D4519996 Pulled By: IslamAbdelRahman fbshipit-source-id: abebd9a	2017-02-27 17:54:14 -08:00
Tamir Duberstein	253799c06d	Add missing include for `abort()` Summary: Fixes #1233 (again). Closes https://github.com/facebook/rocksdb/pull/1931 Differential Revision: D4625289 Pulled By: ajkr fbshipit-source-id: 70e774e	2017-02-27 17:24:13 -08:00
Siying Dong	8efb5ffa2a	[rocksdb][PR] Remove option min_partial_merge_operands and verify_checksums_in_comp… Summary: …action The two options, min_partial_merge_operands and verify_checksums_in_compaction, are not seldom used. Remove them to reduce the total number of options. Also remove them from Java and C interface. Closes https://github.com/facebook/rocksdb/pull/1902 Differential Revision: D4601219 Pulled By: siying fbshipit-source-id: aad4cb2	2017-02-23 15:09:12 -08:00
Siying Dong	1ba2804b7f	Remove XFunc tests Summary: Xfunc is hardly used. Remove it to keep the code simple. Closes https://github.com/facebook/rocksdb/pull/1905 Differential Revision: D4603220 Pulled By: siying fbshipit-source-id: 731f96d	2017-02-23 12:09:11 -08:00
Aaron Gao	1ef5f50e84	detect logical sector size Summary: querying logical sector size from the device instead of hardcoding it for linux platform. Closes https://github.com/facebook/rocksdb/pull/1875 Differential Revision: D4591946 Pulled By: ajkr fbshipit-source-id: 4e9805c	2017-02-23 11:25:36 -08:00
Aaron Gao	f206af56fc	add use_direct_io() to ReadaheadRandomAccessFile Summary: Missing this function will cause RandomAccessFileReader not doing alignment in Direct IO mode, which introduce an IOError: invalid argument. Closes https://github.com/facebook/rocksdb/pull/1900 Differential Revision: D4601261 Pulled By: lightmark fbshipit-source-id: c3eadf1	2017-02-22 14:54:11 -08:00
Aaron Gao	286a36db7f	posix writablefile truncate Summary: we occasionally missing this call so the file size will be wrong Closes https://github.com/facebook/rocksdb/pull/1894 Differential Revision: D4598446 Pulled By: lightmark fbshipit-source-id: 42b6ef5	2017-02-22 10:09:14 -08:00
Daniel Black	f0879e4c39	Page size isn't always 4k on linux Summary: Some places autodetected. These are the two places that didn't. closes #1498 Still unsure if the following instances of 4 * 1024 need fixing in: util/io_posix.h include/rocksdb/table.h (appears to be blocksize and different) utilities/persistent_cache/block_cache_tier.cc utilities/persistent_cache/persistent_cache_test.h include/rocksdb/env.h util/env_posix.cc db/column_family.cc Closes https://github.com/facebook/rocksdb/pull/1499 Differential Revision: D4593640 Pulled By: yiwu-arbug fbshipit-source-id: efc48de	2017-02-21 16:39:14 -08:00
Yulia Kartseva	ebc8a79980	alignment is on in ReadaheadRandomAccessFile::Read() Summary: Closes https://github.com/facebook/rocksdb/pull/1857 Differential Revision: D4534518 Pulled By: wat-ze-hex fbshipit-source-id: b456946	2017-02-18 12:09:12 -08:00
Marcin Dlugajczyk	a618a16f44	New subcode for IOError to detect the ESTALE errno Summary: I'd like to propose a patch to expose a new IOError type with subcode kStaleFile to allow to detect when ESTALE error is returned. This allows the rocksdb consumers to handle this error separately from other IOErrors. I've also added a missing string representation for the kDeadlock subcode, I believe calling ToString() on Status object with that subcode would result in an out of band access in the msgs array, Please let me know if you have any questions or would like me to make any changes to this pull request. Closes https://github.com/facebook/rocksdb/pull/1748 Differential Revision: D4387675 Pulled By: IslamAbdelRahman fbshipit-source-id: 67feb13	2017-02-17 10:54:13 -08:00
Aaron Gao	db2b4eb50e	avoid direct io in rocksdb_lite Summary: fix lite bugs disable direct io in lite mode Closes https://github.com/facebook/rocksdb/pull/1870 Differential Revision: D4559866 Pulled By: yiwu-arbug fbshipit-source-id: 3761c51	2017-02-16 10:39:13 -08:00
Xiaofei Du	7106a994fe	Use monotonic time points in write_controller.cc and rate_limiter.cc Summary: NowMicros() provides non-monotonic time. When wall clock is synchronized or changed, the non-monotonicity time points will affect write rate controllers. This patch changes write_controller.cc and rate_limiter.cc to use monotonic time points. Closes https://github.com/facebook/rocksdb/pull/1865 Differential Revision: D4561732 Pulled By: siying fbshipit-source-id: 95ece62	2017-02-14 18:24:24 -08:00
Sagar Vemuri	eb912a927e	Remove disableDataSync option Summary: Remove disableDataSync, and another similarly named disable_data_sync options. This is being done to simplify options, and also because the performance gains of this feature can be achieved by other methods. Closes https://github.com/facebook/rocksdb/pull/1859 Differential Revision: D4541292 Pulled By: sagar0 fbshipit-source-id: 5b3a6ca	2017-02-13 11:09:13 -08:00
James Sun	53bb01516d	[rocksdb][PR] compaction_style and compaction_pri should output their value as a st… Summary: …ring Replace the numerical output for compaction_style and compaction_pri with strings Closes https://github.com/facebook/rocksdb/pull/1817 Differential Revision: D4482796 Pulled By: highker fbshipit-source-id: 5785768	2017-02-07 10:39:12 -08:00
Maysam Yabandeh	69d5262c81	Two-level Indexes Summary: Partition Index blocks and use a Partition-index as a 2nd level index. The two-level index can be used by setting BlockBasedTableOptions::kTwoLevelIndexSearch as the index type and configuring BlockBasedTableOptions::index_per_partition t15539501 Closes https://github.com/facebook/rocksdb/pull/1814 Differential Revision: D4473535 Pulled By: maysamyabandeh fbshipit-source-id: bffb87e	2017-02-06 16:39:12 -08:00
Dmitri Smirnov	0a4cdde50a	Windows thread Summary: introduce new methods into a public threadpool interface, - allow submission of std::functions as they allow greater flexibility. - add Joining methods to the implementation to join scheduled and submitted jobs with an option to cancel jobs that did not start executing. - Remove ugly `#ifdefs` between pthread and std implementation, make it uniform. - introduce pimpl for a drop in replacement of the implementation - Introduce rocksdb::port::Thread typedef which is a replacement for std::thread. On Posix Thread defaults as before std::thread. - Implement WindowsThread that allocates memory in a more controllable manner than windows std::thread with a replaceable implementation. - should be no functionality changes. Closes https://github.com/facebook/rocksdb/pull/1823 Differential Revision: D4492902 Pulled By: siying fbshipit-source-id: c74cb11	2017-02-06 14:54:18 -08:00
Dmitri Smirnov	add8b50cc9	Move ThreadLocal implementation into .cc Summary: Closes https://github.com/facebook/rocksdb/pull/1829 Differential Revision: D4502314 Pulled By: siying fbshipit-source-id: f46fac1	2017-02-02 14:09:12 -08:00
Siying Dong	f289d9f4ac	Fix OSX build break after the fallocate change Summary: The recent update about fallocate failed OSX build. Fix it. Closes https://github.com/facebook/rocksdb/pull/1830 Differential Revision: D4500235 Pulled By: siying fbshipit-source-id: a5f2b40	2017-02-02 10:39:11 -08:00
Siying Dong	4a3e7d320c	Change the default of delayed slowdown value to 16MB/s Summary: Change the default of delayed slowdown value to 16MB/s and further increase the L0 stop condition to 36 files. Closes https://github.com/facebook/rocksdb/pull/1821 Differential Revision: D4489229 Pulled By: siying fbshipit-source-id: 1003981	2017-02-01 20:39:17 -08:00
Siying Dong	0513e21f9b	RangeSync() should work with ROCKSDB_FALLOCATE_PRESENT not set Summary: Closes https://github.com/facebook/rocksdb/pull/1824 Differential Revision: D4493862 Pulled By: siying fbshipit-source-id: c168446	2017-02-01 10:24:20 -08:00
Islam AbdelRahman	8b369ae5bd	Cleaner default options using C++11 in-class init Summary: C++11 in-class initialization is cleaner and makes it the default more explicit to our users and more visible. Use it for ColumnFamilyOptions and DBOptions Closes https://github.com/facebook/rocksdb/pull/1822 Differential Revision: D4490473 Pulled By: IslamAbdelRahman fbshipit-source-id: c493a87	2017-01-31 18:09:15 -08:00
Islam AbdelRahman	ec79a7b53c	Dedup code in option.cc and db_options.cc Summary: The code in DBOptions::Dump is simply a duplicate of the code in ImmutableDBOptions::Dump and MutableDBOptions.Dump consolidate duplicate code. tested visually Closes https://github.com/facebook/rocksdb/pull/1818 Differential Revision: D4486710 Pulled By: IslamAbdelRahman fbshipit-source-id: 7085189	2017-01-31 17:39:12 -08:00
Siying Dong	2d75cd40d3	NewLRUCache() to pick number of shard bits based on capacity if not given Summary: If the users use the NewLRUCache() without passing in the number of shard bits, instead of using hard-coded 6, we'll determine it based on capacity. Closes https://github.com/facebook/rocksdb/pull/1584 Differential Revision: D4242517 Pulled By: siying fbshipit-source-id: 86b0f18	2017-01-27 06:39:12 -08:00
Andrew Kryczka	94a0c32e73	Fix LRU Ref() for handles with external references only Summary: For case !handle->InCache() && handle->refs >= 1 (the third case mentioned in lru_cache.h), the key was overwritten by Insert(). In this case, the refcount can still be incremented, and the cache handle will never enter LRU list. Fix Ref() logic for this case. Closes https://github.com/facebook/rocksdb/pull/1808 Differential Revision: D4467656 Pulled By: ajkr fbshipit-source-id: c0784d8	2017-01-26 10:54:15 -08:00
Andrew Kryczka	17c1180603	Generalize Env registration framework Summary: The Env registration framework supports registering client Envs and selecting which one to instantiate according to a text field. This enabled things like adding the -env_uri argument to db_bench, so the same binary could be reused with different Envs just by changing CLI config. Now this problem has come up again in a non-Env context, as I want to instantiate a client Statistics implementation from db_bench, which is configured entirely via text parameters. Also, in the future we may wish to use it for deserializing client objects when loading OPTIONS file. This diff generalizes the Env registration logic to work with arbitrary types. - Generalized registration and instantiation code by templating them - The entire implementation is in a header file as that's Google style guide's recommendation for template definitions - Pattern match with std::regex_match rather than checking prefix, which was the previous behavior - Rename functions/files to be non-Env-specific Closes https://github.com/facebook/rocksdb/pull/1776 Differential Revision: D4421933 Pulled By: ajkr fbshipit-source-id: 34647d1	2017-01-25 16:09:14 -08:00
sdong	07dddd5f7e	EnvPosixTestWithParam should wait for all threads to finish Summary: If we don't wait for the threads to finish after each run, the thread queue may not be empty while the next test starts to run, which can cause unexpected behaviors. Also make some of the relaxed read/write more restrict. Closes https://github.com/facebook/rocksdb/pull/1590 Reviewed By: AsyncDBConnMarkedDownDBException Differential Revision: D4245922 Pulled By: AsyncDBConnMarkedDownDBException fbshipit-source-id: f83b74b	2017-01-25 15:54:13 -08:00
Hyeonseok Oh	f2b4939da4	fixed typo Summary: I fixed exisit -> exist Closes https://github.com/facebook/rocksdb/pull/1799 Differential Revision: D4451466 Pulled By: yiwu-arbug fbshipit-source-id: b447c3a	2017-01-23 12:54:13 -08:00
Siying Dong	0e8dfd6062	Fix OptimizeForPointLookup() Summary: If users directly call OptimizeForPointLookup(), it is broken as the option isn't compatible with parallel memtable insert. Fix it by using memtable bloomo filter instead. Closes https://github.com/facebook/rocksdb/pull/1791 Differential Revision: D4442836 Pulled By: siying fbshipit-source-id: bf6c9cd	2017-01-20 10:54:12 -08:00
Yi Wu	9239103cd4	Flush job should release reference current version if sync log failed Summary: Fix the bug when sync log fail, FlushJob::Run() will not be execute and reference to cfd->current() will not be release. Closes https://github.com/facebook/rocksdb/pull/1792 Differential Revision: D4441316 Pulled By: yiwu-arbug fbshipit-source-id: 5523e28	2017-01-19 23:09:15 -08:00
Yi Wu	602c13a964	Remove fadvise with direct IO read Summary: Remove the logic since we don't use buffer cache with direct IO. Resolve read regression we currently have. Closes https://github.com/facebook/rocksdb/pull/1782 Differential Revision: D4430408 Pulled By: yiwu-arbug fbshipit-source-id: 5557bba	2017-01-18 12:09:10 -08:00
Kefu Chai	e8a096000b	util/thread_local.h: silence a clang-build warning Summary: otherwise clang complains with /home/jenkins/workspace/ceph-master/src/rocksdb/util/thread_local.h:205:5: error: macro expansion producing 'defined' has undefined behavior [-Werror,-Wexpansion-to-defined] ^ /home/jenkins/workspace/ceph-master/src/rocksdb/util/thread_local.h:22:4: note: expanded from macro 'ROCKSDB_SUPPORT_THREAD_LOCAL' !defined(OS_WIN) && !defined(OS_MACOSX) && !defined(IOS_CROSS_COMPILE) ^` Signed-off-by: Kefu Chai <tchaikov@gmail.com> Closes https://github.com/facebook/rocksdb/pull/1757 Differential Revision: D4394140 Pulled By: siying fbshipit-source-id: f0beda0	2017-01-15 13:24:16 -08:00
Aaron Gao	3e6899d116	change UseDirectIO() to use_direct_io() Summary: also change variable name `direct_io_` to `use_direct_io_` in WritableFile to make it consistent with read path. Closes https://github.com/facebook/rocksdb/pull/1770 Differential Revision: D4416435 Pulled By: lightmark fbshipit-source-id: 4143c53	2017-01-13 12:09:15 -08:00
Aaron Gao	d4e07a8459	fix warning of unused direct io helper functions Summary: add build guard Closes https://github.com/facebook/rocksdb/pull/1771 Differential Revision: D4410779 Pulled By: siying fbshipit-source-id: 3796c30	2017-01-12 12:39:14 -08:00
Aaron Gao	dc2584eea0	direct reads refactor Summary: direct IO reads refactoring remove unnecessary classes and unified interfaces tested with db_bench need more change for options and ON/OFF for different files. Since disabled is default, it should be fine now Closes https://github.com/facebook/rocksdb/pull/1636 Differential Revision: D4307189 Pulled By: lightmark fbshipit-source-id: 6991e22	2017-01-11 16:54:12 -08:00
Anirban Rahut	62384ebe9c	Guarding extra fallocate call with TRAVIS because its not working pro… Summary: …perly on travis There is some old code in PosixWritableFile::Close(), which truncates the file to the measured size and then does an extra fallocate with KEEP_SIZE. This is commented as a failsafe because in some cases ftruncate doesn't do the right job (I don't know of an instance of this btw). However doing an fallocate with KEEP_SIZE should not increase the file size. However on Travis Worker which is Docker (likely AUFS ) its not working. There are comments on web that show that the AUFS author had initially not implemented fallocate, and then did it later. So not sure what is the quality of the implementation. Closes https://github.com/facebook/rocksdb/pull/1765 Differential Revision: D4401340 Pulled By: anirbanr-fb fbshipit-source-id: e2d8100	2017-01-11 14:24:13 -08:00
Andrew Kryczka	fe395fb63d	Allow incrementing refcount on cache handles Summary: Previously the only way to increment a handle's refcount was to invoke Lookup(), which (1) did hash table lookup to get cache handle, (2) incremented that handle's refcount. For a future DeleteRange optimization, I added a function, Ref(), for when the caller already has a cache handle and only needs to do (2). Closes https://github.com/facebook/rocksdb/pull/1761 Differential Revision: D4397114 Pulled By: ajkr fbshipit-source-id: 9addbe5	2017-01-10 16:54:20 -08:00
Dmitri Smirnov	3c233ca4ea	Fix Windows environment issues Summary: Enable directIO on WritableFileImpl::Append with offset being current length of the file. Enable UniqueID tests on Windows, disable others but leeting them to compile. Unique tests are valuable to detect failures on different filesystems and upcoming ReFS. Clear output in WinEnv Getchildren.This is different from previous strategy, do not touch output on failure. Make sure DBTest.OpenWhenOpen works with windows error message Closes https://github.com/facebook/rocksdb/pull/1746 Differential Revision: D4385681 Pulled By: IslamAbdelRahman fbshipit-source-id: c07b702	2017-01-09 15:54:12 -08:00
Maysam Yabandeh	d0ba8ec8f9	Revert "PinnableSlice" Summary: This reverts commit `54d94e9c2c`. The pull request was landed by mistake. Closes https://github.com/facebook/rocksdb/pull/1755 Differential Revision: D4391678 Pulled By: maysamyabandeh fbshipit-source-id: 36d5149	2017-01-08 14:24:12 -08:00
Maysam Yabandeh	54d94e9c2c	PinnableSlice Summary: Currently the point lookup values are copied to a string provided by the user. This incures an extra memcpy cost. This patch allows doing point lookup via a PinnableSlice which pins the source memory location (instead of copying their content) and releases them after the content is consumed by the user. The old API of Get(string) is translated to the new API underneath. Here is the summary for improvements: 1. value 100 byte: 1.8% regular, 1.2% merge values 2. value 1k byte: 11.5% regular, 7.5% merge values 3. value 10k byte: 26% regular, 29.9% merge values The improvement for merge could be more if we extend this approach to pin the merge output and delay the full merge operation until the user actually needs it. We have put that for future work. PS: Sometimes we observe a small decrease in performance when switching from t5452014 to this patch but with the old Get(string) API. The difference is a little and could be noise. More importantly it is safely cancelled Closes https://github.com/facebook/rocksdb/pull/1732 Differential Revision: D4374613 Pulled By: maysamyabandeh fbshipit-source-id: a077f1a	2017-01-08 13:54:13 -08:00
Islam AbdelRahman	ac73d7558b	Add GetSupportedCompressions() convenience function Summary: This function will return a list of supported compression types in RocksDB This is needed for MyRocks https://github.com/facebook/mysql-5.6/pull/446 Closes https://github.com/facebook/rocksdb/pull/1747 Differential Revision: D4385921 Pulled By: IslamAbdelRahman fbshipit-source-id: 2f5b59f	2017-01-06 11:24:14 -08:00
Adam Retter	85ac1a320a	Fix rocksdb::Status::getState Summary: This fixes the Java API for Status#getState use in Native code and also simplifies the implementation of rocksdb::Status::getState. Closes https://github.com/facebook/rocksdb/issues/1688 Closes https://github.com/facebook/rocksdb/pull/1714 Differential Revision: D4364181 Pulled By: yiwu-arbug fbshipit-source-id: 8e073b4	2017-01-03 18:39:14 -08:00
Siying Dong	17a4b75cc3	Always fsync the file after file copying Summary: File copying happens when creating checkpoints and bulkloading files from different FS partition. We should fsync the files when copying them to guarantee durability. A side effect will be that the dirty pages in file system buffers won't grow too large. Closes https://github.com/facebook/rocksdb/pull/1728 Differential Revision: D4371083 Pulled By: siying fbshipit-source-id: 579e14c	2016-12-28 19:09:16 -08:00
Yi Wu	ab48c165a9	Print cache options to info log Summary: Improve cache options logging to info log. Also print the value of cache_index_and_filter_blocks_with_high_priority. Closes https://github.com/facebook/rocksdb/pull/1709 Differential Revision: D4358776 Pulled By: yiwu-arbug fbshipit-source-id: 8f030a0	2016-12-22 14:54:19 -08:00
Aaron Gao	972f96b3fb	direct io write support Summary: rocksdb direct io support ``` [gzh@dev11575.prn2 ~/rocksdb] ./db_bench -benchmarks=fillseq --num=1000000 Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags RocksDB: version 5.0 Date: Wed Nov 23 13:17:43 2016 CPU: 40 * Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz CPUCache: 25600 KB Keys: 16 bytes each Values: 100 bytes each (50 bytes after compression) Entries: 1000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 110.6 MB (estimated) FileSize: 62.9 MB (estimated) Write rate: 0 bytes/second Compression: Snappy Memtablerep: skip_list Perf Level: 1 WARNING: Assertions are enabled; benchmarks unnecessarily slow ------------------------------------------------ Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags DB path: [/tmp/rocksdbtest-112628/dbbench] fillseq : 4.393 micros/op 227639 ops/sec; 25.2 MB/s [gzh@dev11575.prn2 ~/roc Closes https://github.com/facebook/rocksdb/pull/1564 Differential Revision: D4241093 Pulled By: lightmark fbshipit-source-id: 98c29e3	2016-12-22 13:09:19 -08:00
Islam AbdelRahman	989e644ed8	Remove sst_file_manager option from LITE Summary: Remove sst_file_manager option from LITE Closes https://github.com/facebook/rocksdb/pull/1690 Differential Revision: D4341331 Pulled By: IslamAbdelRahman fbshipit-source-id: 9f9328d	2016-12-21 17:54:21 -08:00
Jianpeng Ma	bd6cf7b51d	WritableFileWriter: default buffer size equal min(64k,options.writabl? Summary: ?e_file_max_buffer_size) If we overwrite WritableFile and has a buffer which has the same function of buf_. We hope remove the cache function of WritableFileWriter. So using options.writable_file_max_buffer_size = 0 to disable cache function. Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com> Closes https://github.com/facebook/rocksdb/pull/1628 Differential Revision: D4307219 Pulled By: yiwu-arbug fbshipit-source-id: 77a6e26	2016-12-16 13:09:14 -08:00
Daniel Black	816c1e30ca	gcc-7 requires include <functional> for std::function Summary: Fixes compile error: In file included from ./util/statistics.h:17:0, from ./util/stop_watch.h:8, from ./util/perf_step_timer.h:9, from ./util/iostats_context_imp.h:8, from ./util/posix_logger.h:27, from ./port/util_logger.h:18, from ./db/auto_roll_logger.h:15, from db/auto_roll_logger.cc:6: ./util/thread_local.h:65:16: error: 'function' in namespace 'std' does not name a template type typedef std::function<void(void, void)> FoldFunc; Closes https://github.com/facebook/rocksdb/pull/1656 Differential Revision: D4318702 Pulled By: yiwu-arbug fbshipit-source-id: 8c5d17a	2016-12-16 11:24:18 -08:00
Daniel Black	0ab6fc167f	Gcc-7 buffer size insufficient Summary: Bunch of commits related to insufficient buffer size. Errors in individual commits. Closes https://github.com/facebook/rocksdb/pull/1673 Differential Revision: D4332127 Pulled By: IslamAbdelRahman fbshipit-source-id: 878f73c	2016-12-14 19:24:26 -08:00
Daniel Black	b7239bf7e0	Gcc 7 fallthrough Summary: hopefully the last of the gcc-7 compile errors Closes https://github.com/facebook/rocksdb/pull/1675 Differential Revision: D4332106 Pulled By: IslamAbdelRahman fbshipit-source-id: 139448c	2016-12-14 19:24:25 -08:00

... 5 6 7 8 9 ...

2011 Commits