Summary:
Provide a block_align option in BlockBasedTableOptions to allow
alignment of SST file data blocks. This will avoid higher
IOPS/throughput load due to < 4KB data blocks spanning 2 4KB pages.
When this option is set to true, the block alignment is set to lower of
block size and 4KB.
Closes https://github.com/facebook/rocksdb/pull/3502
Differential Revision: D7400897
Pulled By: anand1976
fbshipit-source-id: 04cc3bd144e88e3431a4f97604e63ad7a0f06d44
Summary:
Improving blob db FIFO eviction with the following changes,
* Change blob_dir_size to max_db_size. Take into account SST file size when computing DB size.
* FIFO now only take into account live sst files and live blob files. It is normal for disk usage to go over max_db_size because there are obsolete sst files and blob files pending deletion.
* FIFO eviction now also evict TTL blob files that's still open. It doesn't evict non-TTL blob files.
* If FIFO is triggered, it will pass an expiration and the current sequence number to compaction filter. Compaction filter will then filter inlined keys to evict those with an earlier expiration and smaller sequence number. So call LSM FIFO.
* Compaction filter also filter those blob indexes where corresponding blob file is gone.
* Add an event listener to listen compaction/flush event and update sst file size.
* Implement DB::Close() to make sure base db, as well as event listener and compaction filter, destruct before blob db.
* More blob db statistics around FIFO.
* Fix some locking issue when accessing a blob file.
Closes https://github.com/facebook/rocksdb/pull/3556
Differential Revision: D7139328
Pulled By: yiwu-arbug
fbshipit-source-id: ea5edb07b33dfceacb2682f4789bea61de28bbfa
Summary:
The zeroed entries were not removed from prepared_section_completed_ map. This patch adds a unit test to show the problem and fixes that by refactoring the code. The new code is more efficient since i) it uses two separate mutex to avoid contention between commit and prepare threads, ii) it uses a sorted vector for maintaining uniq log entires with prepare which avoids a very large heap with many duplicate entries.
Closes https://github.com/facebook/rocksdb/pull/3545
Differential Revision: D7106071
Pulled By: maysamyabandeh
fbshipit-source-id: b3ae17cb6cd37ef10b6b35e0086c15c758768a48
Summary:
- made `CreateCheckpoint` properly return `InvalidArgument` when called with an empty directory. Previously it triggered an assertion failure due to a bug in the logic.
- made `ldb` set empty `checkpoint_dir` if that's what the user specifies, so that we can use it to properly test `CreateCheckpoint` in the future.
Differential Revision: D6874562
fbshipit-source-id: dcc1bd41768261d9338987fa7711444289707ed7
Summary:
Add a legocastle job to continuously build the last 10 commits every 4 hours and report lite build binary size to scuba.
Closes https://github.com/facebook/rocksdb/pull/3511
Differential Revision: D7001730
Pulled By: yiwu-arbug
fbshipit-source-id: 7c8ca87c46d663c786a0d32be69ebbe7b19a5eb9
Summary:
Some workloads (like my current benchmarking) may want partitioned indexes without partitioned filters. Particularly, when `-optimize_filters_for_hits=true`, the total index size may be larger than the total filter size, so it can make sense to hold all filters in-memory but not all indexes.
Closes https://github.com/facebook/rocksdb/pull/3492
Differential Revision: D6970092
Pulled By: ajkr
fbshipit-source-id: b7fa1828e1d13829339aefb90fd56eb7c5337f61
Summary:
This is to avoid run time error. Fail the db_bench immediately if cuckoo table is used but mmap_read is not specified.
Closes https://github.com/facebook/rocksdb/pull/3420
Differential Revision: D6838284
Pulled By: siying
fbshipit-source-id: 20893fa28d40fadc31e4ff154bed02f5a1bad341
Summary: Grandfather in super old lint issues to make a clean slate for moving forward that allows us to have stronger enforcement on new issues.
Reviewed By: yiwu-arbug
Differential Revision: D6821806
fbshipit-source-id: 22797d31ec58e9eb0255d3b66fedfcfcb0dc127c
Summary:
It failed every time. I guess people usually ran with assertions disabled.
Closes https://github.com/facebook/rocksdb/pull/3422
Differential Revision: D6822984
Pulled By: ajkr
fbshipit-source-id: 2e90db75618b26ac1c46ddfa9e03c095c7bf16e3
Summary:
The macro was added by mistake in #2372
Closes https://github.com/facebook/rocksdb/pull/3343
Differential Revision: D6681356
Pulled By: yiwu-arbug
fbshipit-source-id: 4180172fb0eaef4189c07f219241e0c261c03461
Summary:
This patch addresses a couple of minor TODOs for WritePrepared Txn such as double checking some assert statements at runtime as well, skip extra AddPrepared in non-2pc transactions, and safety check for infinite loops.
Closes https://github.com/facebook/rocksdb/pull/3302
Differential Revision: D6617002
Pulled By: maysamyabandeh
fbshipit-source-id: ef6673c139cb49f64c0879508d2f573b78609aca
Summary:
**# Summary**
RocksDB uses SSE crc32 intrinsics to calculate the crc32 values but it does it in single way fashion (not pipelined on single CPU core). Intel's whitepaper () published an algorithm that uses 3-way pipelining for the crc32 intrinsics, then use pclmulqdq intrinsic to combine the values. Because pclmulqdq has overhead on its own, this algorithm will show perf gains on buffers larger than 216 bytes, which makes RocksDB a perfect user, since most of the buffers RocksDB call crc32c on is over 4KB. Initial db_bench show tremendous CPU gain.
This change uses the 3-way SSE algorithm by default. The old SSE algorithm is now behind a compiler tag NO_THREEWAY_CRC32C. If user compiles the code with NO_THREEWAY_CRC32C=1 then the old SSE Crc32c algorithm would be used. If the server does not have SSE4.2 at the run time the slow way (Non SSE) will be used.
**# Performance Test Results**
We ran the FillRandom and ReadRandom benchmarks in db_bench. ReadRandom is the point of interest here since it calculates the CRC32 for the in-mem buffers. We did 3 runs for each algorithm.
Before this change the CRC32 value computation takes about 11.5% of total CPU cost, and with the new 3-way algorithm it reduced to around 4.5%. The overall throughput also improved from 25.53MB/s to 27.63MB/s.
1) ReadRandom in db_bench overall metrics
PER RUN
Algorithm | run | micros/op | ops/sec |Throughput (MB/s)
3-way | 1 | 4.143 | 241387 | 26.7
3-way | 2 | 3.775 | 264872 | 29.3
3-way | 3 | 4.116 | 242929 | 26.9
FastCrc32c|1 | 4.037 | 247727 | 27.4
FastCrc32c|2 | 4.648 | 215166 | 23.8
FastCrc32c|3 | 4.352 | 229799 | 25.4
AVG
Algorithm | Average of micros/op | Average of ops/sec | Average of Throughput (MB/s)
3-way | 4.01 | 249,729 | 27.63
FastCrc32c | 4.35 | 230,897 | 25.53
2) Crc32c computation CPU cost (inclusive samples percentage)
PER RUN
Implementation | run | TotalSamples | Crc32c percentage
3-way  | 1  | 4,572,250,000 | 4.37%
3-way  | 2  | 3,779,250,000 | 4.62%
3-way  | 3  | 4,129,500,000 | 4.48%
FastCrc32c   | 1  | 4,663,500,000 | 11.24%
FastCrc32c   | 2  | 4,047,500,000 | 12.34%
FastCrc32c   | 3  | 4,366,750,000 | 11.68%
**# Test Plan**
make -j64 corruption_test && ./corruption_test
By default it uses 3-way SSE algorithm
NO_THREEWAY_CRC32C=1 make -j64 corruption_test && ./corruption_test
make clean && DEBUG_LEVEL=0 make -j64 db_bench
make clean && DEBUG_LEVEL=0 NO_THREEWAY_CRC32C=1 make -j64 db_bench
Closes https://github.com/facebook/rocksdb/pull/3173
Differential Revision: D6330882
Pulled By: yingsu00
fbshipit-source-id: 8ec3d89719533b63b536a736663ca6f0dd4482e9
Summary:
We added a new verification that ensures a value that snapshot reads when is released is the same as when it was created. This test however fails when the cf is dropped in between. The patch skips the tests if that was the case.
Closes https://github.com/facebook/rocksdb/pull/3279
Differential Revision: D6581584
Pulled By: maysamyabandeh
fbshipit-source-id: afe37d371c0f91818d2e279b3949b810e112e8eb
Summary:
Add "--use_txn" option to use transactional API in db_stress, default being WRITE_PREPARED policy, which is the main intention of modifying db_stress. It also extend the existing snapshots to verify that before releasing a snapshot a read from it returns the same value as before.
Closes https://github.com/facebook/rocksdb/pull/3243
Differential Revision: D6556912
Pulled By: maysamyabandeh
fbshipit-source-id: 1ae31465be362d44bd06e635e2e9e49a1da11268
Summary:
clang-analyzer complaint about db_ being nullptr, but it couldn't be because it checks exec_stats before proceed. Add an assert to get around the false-positive.
Test Plan
`make analyze`
Closes https://github.com/facebook/rocksdb/pull/3236
Differential Revision: D6505417
Pulled By: yiwu-arbug
fbshipit-source-id: e5b65764ea994dd9e4bab3e697b97dc70dc22cab
Summary:
This is to fix tools/check_format_compatible.sh. The tool try to open
old versions of rocksdb with the provided options file. When options
file is missing (e.g. rocksdb 2.2), it should still proceed with default
options.
Closes https://github.com/facebook/rocksdb/pull/3232
Differential Revision: D6503955
Pulled By: yiwu-arbug
fbshipit-source-id: e44cfcce7ddc7d12cf83466ed3f3fe7624aa78b8
Summary:
Add recent versions for format compatible test. We should probably update the script to auto include available versions (by looking at include/rocksdb/versions.h and deduce branch names), but we can do it later.
Closes https://github.com/facebook/rocksdb/pull/3233
Differential Revision: D6503631
Pulled By: yiwu-arbug
fbshipit-source-id: e2b01d1ef6e784ff6ffa1bd75d741755e3c69a8c
Summary:
I started adding gflags support for cmake on linux and got frustrated that I'd need to duplicate the build_detect_platform logic, which determines namespace based on attempting compilation. We can do it differently -- use the GFLAGS_NAMESPACE macro if available, and if not, that indicates it's an old gflags version without configurable namespace so we can simply hardcode "google".
Closes https://github.com/facebook/rocksdb/pull/3212
Differential Revision: D6456973
Pulled By: ajkr
fbshipit-source-id: 3e6d5bde3ca00d4496a120a7caf4687399f5d656
Summary:
Add PreReleaseCallback to be called at the end of WriteImpl but before publishing the sequence number. The callback is used in WritePrepareTxn to i) update the commit map, ii) update the last published sequence number in the 2nd write queue. It also ensures that all the commits will go to the 2nd queue.
These changes will ensure that the commit map is updated before the sequence number is published and used by reading snapshots. If we use two write queues, the snapshots will use the seq number published by the 2nd queue. If we use one write queue (the default, the snapshots will use the last seq number in the memtable, which also indicates the last published seq number.
Closes https://github.com/facebook/rocksdb/pull/3205
Differential Revision: D6438959
Pulled By: maysamyabandeh
fbshipit-source-id: f8b6c434e94bc5f5ab9cb696879d4c23e2577ab9
Summary:
- Made CLI arguments take precedence over options file when both are provided. Note some of the CLI args are not settable via options file, like `--compression_max_dict_bytes`, so it's necessary to allow both ways of providing options simultaneously.
- Changed `PrepareOptionsForOpenDB` to update the proper `ColumnFamilyOptions` if one exists for the user's `--column_family_name` argument. I supported this only in the base class, `LDBCommand`, so it works for the general arguments. Will defer adding support for subcommand-specific arguments.
- Made the command fail if `--try_load_options` is provided and loading options file returns NotFound. I found the previous behavior of silently continuing confusing.
Closes https://github.com/facebook/rocksdb/pull/3144
Differential Revision: D6270544
Pulled By: ajkr
fbshipit-source-id: 7c2eac9f9b38720523d74466fb9e78db53561367
Summary:
tools/ldb_cmd.cc:
```
310 ignore_unknown_options_ = IsFlagPresent(flags, ARG_IGNORE_UNKNOWN_OPTIONS);
CID 1322798 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
5. uninit_member: Non-static class member db_ttl_ is not initialized in this constructor nor in any functions that it calls.
311}
```
Closes https://github.com/facebook/rocksdb/pull/3122
Differential Revision: D6428576
Pulled By: sagar0
fbshipit-source-id: d77f04dd201f7f1d9f59ef88a215ee7ad7b934e9
Summary:
We hit "Illegal instruction" error in regression test with "shlx" instruction. Setting PORTABLE=1 to resolve it.
Closes https://github.com/facebook/rocksdb/pull/3165
Differential Revision: D6321972
Pulled By: yiwu-arbug
fbshipit-source-id: cc9fe0dbd4698d1b66a750a0b062f66899862719
Summary:
- moved existing compression options to `InitializeOptionsGeneral` since they cannot be set through options file
- added flag for `zstd_max_train_bytes` which was recently introduced by #3057
Closes https://github.com/facebook/rocksdb/pull/3128
Differential Revision: D6240460
Pulled By: ajkr
fbshipit-source-id: 27dbebd86a55de237ba6a45cc79cff9214e82ebc
Summary:
Without this option, running the compact benchmark on a DB containing only bottommost files simply returned immediately.
Closes https://github.com/facebook/rocksdb/pull/3138
Differential Revision: D6256660
Pulled By: ajkr
fbshipit-source-id: e3b64543acd503d821066f4200daa201d4fb3a9d
Summary:
- Release all snapshots before crashing and reopening the DB. Without this, we may attempt to release snapshots from an old DB using a new DB. That tripped an assertion.
- Release multiple snapshots in the same operation if needed. Without this, we would sometimes leak snapshots.
Closes https://github.com/facebook/rocksdb/pull/3098
Differential Revision: D6194923
Pulled By: ajkr
fbshipit-source-id: b9c89bcca7ebcbb6c7802c616f9d1175a005aadf
Summary:
Add options to `db_stress` (correctness testing tool) to randomly acquire snapshot and release it after some period of time. It's useful for correctness testing of #3009, as well as other parts of compaction that behave differently depending on which snapshots are held.
Closes https://github.com/facebook/rocksdb/pull/3038
Differential Revision: D6086501
Pulled By: ajkr
fbshipit-source-id: 3ec0d8666c78ac507f1f808887c4ff759ba9b865
Summary:
Dynamic adjustment of rate limit according to demand for background I/O. It increases by a factor when limiter is drained too frequently, and decreases by the same factor when limiter is not drained frequently enough. The parameters for this behavior are fixed in `GenericRateLimiter::Tune`. Other changes:
- make rate limiter's `Env*` configurable for testing
- track num drain intervals in RateLimiter so we don't have to rely on stats, which may be shared across different DB instances from the ones that share the RateLimiter.
Closes https://github.com/facebook/rocksdb/pull/2899
Differential Revision: D5858704
Pulled By: ajkr
fbshipit-source-id: cc2bac30f85e7f6fd63655d0a6732ef9ed7403b1
Summary:
it's unsupported in options file, so the flag should be respected by db_bench even when an options file is provided.
Closes https://github.com/facebook/rocksdb/pull/2910
Differential Revision: D5869836
Pulled By: ajkr
fbshipit-source-id: f67f591ae083e95e989f86b6fad50765d2e3d855
Summary:
- Switched all instances of SetMinPossibleForUserKey and SetMaxPossibleForUserKey in accordance to InternalKeyComparator's comparison logic
Closes https://github.com/facebook/rocksdb/pull/2868
Differential Revision: D5804152
Pulled By: axxufb
fbshipit-source-id: 80be35e04f2e8abc35cc64abe1fecb03af24e183
Summary:
**Summary**:
Set defaults for high-pri and low-pri thread pools in regression test script.
**Reason for this change**:
With #2680 , high-pri and low-pri thread pools get different numbers than before if `num_high_pri_threads` and `num_low_pri_threads` options are not explicitly passed to db_bench in regression test script ... leading to a false-positive regression.
**Test Plan**:
REMOTE_HOST=udb1671.prn3 TEST_MODE=1 FBSOURCE=~/fbsource ~/fbsource/fbcode/rocks/tools/debug_regression_test.sh viewstate (with very minor changes to the internals).
Observe P50 and P99 which showed up as regressions in our graphs.
Stats with the commit prior to #2680 , ie. 4f81ab3 :
seekrandomwhilewriting : 75.096 micros/op 13316 ops/sec; 168.6 MB/s (7499074 of 7500000 found)
Microseconds per seek:
Count: 120000000 Average: 1197.7254 StdDev: 33.35
Min: 187 Median: 980.5292 Max: 1816424
Percentiles: **P50: 980.53** P75: 1494.57 **P99: 4185.64** P99.9: 7800.11 P99.99: 15039.64
Stats at #2680, ie. at commit dce6d5a (false-positive regression):
seekrandomwhilewriting : 85.330 micros/op 11719 ops/sec; 148.4 MB/s (7499073 of 7500000 found)
Microseconds per seek:
Count: 120000000 Average: 1362.3261 StdDev: 27.86
Min: 185 Median: 1088.1915 Max: 652760
Percentiles: **P50: 1088.19** P75: 1658.12 **P99: 5361.15** P99.9: 7997.95 P99.99: 11730.07
Stats with the current change on top of dce6d5a :
seekrandomwhilewriting : 77.780 micros/op 12856 ops/sec; 162.8 MB/s (7499102 of 7500000 found)
Microseconds per seek:
Count: 120000000 Average: 1226.6744 StdDev: 17.16
Min: 185 Median: 994.2956 Max: 2553530
Percentiles: **P50: 994.30** P75: 1513.68 **P99: 4284.30** P99.9: 9338.64 P99.99: 23008.86
Closes https://github.com/facebook/rocksdb/pull/2801
Differential Revision: D5742338
Pulled By: sagar0
fbshipit-source-id: cc5d727c1a131f2a7070d1bb892efbe929b976ff
Summary:
store a zero as the checksum when disabled since it's easier to keep block trailer a fixed length.
Closes https://github.com/facebook/rocksdb/pull/2781
Differential Revision: D5694702
Pulled By: ajkr
fbshipit-source-id: 69cea9da415778ba2b600dfd9d0dfc8cb5188ecd
Summary:
Clang complain about an cast from uint64_t to int32 in db_stress. Fixing it.
Closes https://github.com/facebook/rocksdb/pull/2755
Differential Revision: D5655947
Pulled By: yiwu-arbug
fbshipit-source-id: cfac10e796e0adfef4727090b50975b0d6e2c9be
Summary:
fix some things that made this command hard to use from CLI:
- use default values for `target_file_size_base` and `max_bytes_for_level_base`. previously we were using small values for these but default value of `write_buffer_size`, which led to enormous number of L1 files.
- failure message for `value_size_mult` too big. previously there was just an assert, so in non-debug mode it'd overrun the value buffer and crash mysteriously.
- only print verification success if there's no failure. before it'd print both in the failure case.
- support `memtable_prefix_bloom_size_ratio`
- support `num_bottom_pri_threads` (universal compaction)
Closes https://github.com/facebook/rocksdb/pull/2741
Differential Revision: D5629495
Pulled By: ajkr
fbshipit-source-id: ddad97d6d4ba0884e7c0f933b0a359712514fc1d
Summary:
Support a window of `active_width` keys that rolls through `[0, max_key)` over the duration of the test. Operations only affect keys inside the window. This gives us the ability to detect L0->L0 deletion bug (#2722).
Closes https://github.com/facebook/rocksdb/pull/2739
Differential Revision: D5628555
Pulled By: ajkr
fbshipit-source-id: 9cb2d8f4ab1a7c73f7797b8e19f7094970ea8749
Summary:
- like other subcommands, reporting compression sizes should be specified with the `--command` CLI arg.
- also added `--compression_types` arg as it's useful to restrict the types of compression used, at least in my dictionary compression experiments.
Closes https://github.com/facebook/rocksdb/pull/2706
Differential Revision: D5589520
Pulled By: ajkr
fbshipit-source-id: 305bb4ebcc95eecc8a85523cd3b1050619c9ddc5
Summary:
Previously we could only select the CF on which to operate uniformly at random. This is a limitation, e.g., when testing universal compaction as all CFs would need to run full compaction at roughly the same time, which isn't realistic.
This PR allows the user to specify the probability distribution for selecting CFs via the `--column_family_distribution` argument.
Closes https://github.com/facebook/rocksdb/pull/2677
Differential Revision: D5544436
Pulled By: ajkr
fbshipit-source-id: 478d56260995236ae90895ce5bd51f38882e185a
Summary:
Right now, if direct I/O is enabled, prefetching the last 512KB cannot be applied, except compaction inputs or readahead is enabled for iterators. This can create a lot of I/O for HDD cases. To solve the problem, the 512KB is prefetched in block based table if direct I/O is enabled. The prefetched buffer is passed in totegher with random access file reader, so that we try to read from the buffer before reading from the file. This can be extended in the future to support flexible user iterator readahead too.
Closes https://github.com/facebook/rocksdb/pull/2708
Differential Revision: D5593091
Pulled By: siying
fbshipit-source-id: ee36ff6d8af11c312a2622272b21957a7b5c81e7
Summary:
TSAN shows error when we grab too many locks at the same time. In TSAN crash test, make one shard key cover 2^22 keys so that no many keys will be hold at the same time.
Closes https://github.com/facebook/rocksdb/pull/2719
Differential Revision: D5609035
Pulled By: siying
fbshipit-source-id: 930e5d63fff92dbc193dc154c4c615efbdf06c6a
Summary:
We need a tool to check any sst file corruption in the db.
It will check all the sst files in current version and read all the blocks (data, meta, index) with checksum verification. If any verification fails, the function will return non-OK status.
Closes https://github.com/facebook/rocksdb/pull/2498
Differential Revision: D5324269
Pulled By: lightmark
fbshipit-source-id: 6f8a272008b722402a772acfc804524c9d1a483b
Summary:
The background thread pools' sizes weren't easily configurable by `max_background_compactions` and `max_background_flushes` in multi-instance setups. Introduced separate arguments for their sizes.
Closes https://github.com/facebook/rocksdb/pull/2680
Differential Revision: D5550675
Pulled By: ajkr
fbshipit-source-id: bab5f0a7bc5db63bb084d0c10facbe437096367d
Summary:
When we had a single thread pool for compactions, a thread could be busy for a long time (minutes) executing a compaction involving the bottom level. In multi-instance setups, the entire thread pool could be consumed by such bottom-level compactions. Then, top-level compactions (e.g., a few L0 files) would be blocked for a long time ("head-of-line blocking"). Such top-level compactions are critical to prevent compaction stalls as they can quickly reduce number of L0 files / sorted runs.
This diff introduces a bottom-priority queue for universal compactions including the bottom level. This alleviates the head-of-line blocking situation for fast, top-level compactions.
- Added `Env::Priority::BOTTOM` thread pool. This feature is only enabled if user explicitly configures it to have a positive number of threads.
- Changed `ThreadPoolImpl`'s default thread limit from one to zero. This change is invisible to users as we call `IncBackgroundThreadsIfNeeded` on the low-pri/high-pri pools during `DB::Open` with values of at least one. It is necessary, though, for bottom-pri to start with zero threads so the feature is disabled by default.
- Separated `ManualCompaction` into two parts in `PrepickedCompaction`. `PrepickedCompaction` is used for any compaction that's picked outside of its execution thread, either manual or automatic.
- Forward universal compactions involving last level to the bottom pool (worker thread's entry point is `BGWorkBottomCompaction`).
- Track `bg_bottom_compaction_scheduled_` so we can wait for bottom-level compactions to finish. We don't count them against the background jobs limits. So users of this feature will get an extra compaction for free.
Closes https://github.com/facebook/rocksdb/pull/2580
Differential Revision: D5422916
Pulled By: ajkr
fbshipit-source-id: a74bd11f1ea4933df3739b16808bb21fcd512333
Summary:
Move an option necessary for running db_bench on multiple CFs into the general initialization area, so it works with both flag-based init and OPTIONS-based init.
Closes https://github.com/facebook/rocksdb/pull/2675
Differential Revision: D5541378
Pulled By: ajkr
fbshipit-source-id: 169926cb4ae95c17974f744faf7cc794d41e5c0a
Summary:
* Dump blob db options to info log
* Remove BlobDBOptionsImpl to disallow dynamic cast *BlobDBOptions into *BlobDBOptionsImpl. Move options there to be constants or into BlobDBOptions. The dynamic cast is broken after #2645
* Change some of the default options
* Remove blob_db_options.min_blob_size, which is unimplemented. Will implement it soon.
Closes https://github.com/facebook/rocksdb/pull/2671
Differential Revision: D5529912
Pulled By: yiwu-arbug
fbshipit-source-id: dcd58ca981db5bcc7f123b65a0d6f6ae0dc703c7
Summary:
Replace dynamic_cast<> so that users can choose to build with RTTI off, so that they can save several bytes per object, and get tiny more memory available.
Some nontrivial changes:
1. Add Comparator::GetRootComparator() to get around the internal comparator hack
2. Add the two experiemental functions to DB
3. Add TableFactory::GetOptionString() to avoid unnecessary casting to get the option string
4. Since 3 is done, move the parsing option functions for table factory to table factory files too, to be symmetric.
Closes https://github.com/facebook/rocksdb/pull/2645
Differential Revision: D5502723
Pulled By: siying
fbshipit-source-id: fd13cec5601cf68a554d87bfcf056f2ffa5fbf7c
Summary:
it should be a bool
Closes https://github.com/facebook/rocksdb/pull/2653
Differential Revision: D5506148
Pulled By: ajkr
fbshipit-source-id: f142f0f3aa8b678c68adef12e5ac6e1e163306f3
Summary:
This reverts the previous commit 1d7048c598, which broke the build.
Did a `git revert 1d7048c`.
Closes https://github.com/facebook/rocksdb/pull/2627
Differential Revision: D5476473
Pulled By: sagar0
fbshipit-source-id: 4756ff5c0dfc88c17eceb00e02c36176de728d06
Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually.
Reviewed By: igorsugak
Differential Revision: D5454343
fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2
Summary:
For forward_compatible_checkout_objs the local branch is already created in previous step. This patch avoid recreating it. This should address "fatal: A branch named '3.10.fb' already exists." errors.
Closes https://github.com/facebook/rocksdb/pull/2606
Differential Revision: D5443786
Pulled By: maysamyabandeh
fbshipit-source-id: 69d5a67b87677429cf36e3a467bd114d341f3b9c
Summary:
The new local branch specified with -b cannot be called master. Use tmp prefix to avoid name collision.
Closes https://github.com/facebook/rocksdb/pull/2600
Differential Revision: D5442944
Pulled By: maysamyabandeh
fbshipit-source-id: 4a623d9b21d6cc01bee812b2799790315bdf5f6e
Summary:
This will fix the error: "error: pathspec '2.2.fb.branch' did not match any file(s) known to git."
Tested by manually sshing to sandcastle and running the command.
Closes https://github.com/facebook/rocksdb/pull/2599
Differential Revision: D5441130
Pulled By: maysamyabandeh
fbshipit-source-id: a22fd6a52221471bafbba8990394b499535e5812
Summary:
We have FLAGS_benchmark_write_rate_limit to limit write rate in
db_bench, but it was not in use for updaterandom benchmark.
Closes https://github.com/facebook/rocksdb/pull/2578
Differential Revision: D5420328
Pulled By: yiwu-arbug
fbshipit-source-id: 5fa48c2b88f2f2dc83d615cb9c40c472bc916835
Summary:
* Create info log before db open to make blob db able to log to LOG file.
* Properly destroy blob db.
Closes https://github.com/facebook/rocksdb/pull/2567
Differential Revision: D5400034
Pulled By: yiwu-arbug
fbshipit-source-id: a49cfaf4b5c67d42d4cbb872bd5a9441828c17ce
Summary:
Found by gcc warning:
x86_64-pc-linux-gnu-g++ --version
x86_64-pc-linux-gnu-g++ (GCC) 7.1.1 20170710
tools/db_bench_tool.cc: In member function 'void rocksdb::Benchmark::RandomWithVerify(rocksdb::ThreadState*)':
tools/db_bench_tool.cc:4430:8: error: '%lu' directive output may be truncated writing between 1 and 19 bytes into a region of size between 0 and 66 [-Werror=format-truncation=]
void RandomWithVerify(ThreadState* thread) {
^~~~~~~~~~~~~~~~
tools/db_bench_tool.cc:4430:8: note: directive argument in the range [0, 9223372036854775807]
tools/db_bench_tool.cc:4492:13: note: 'snprintf' output between 37 and 128 bytes into a destination of size 100
snprintf(msg, sizeof(msg),
~~~~~~~~^~~~~~~~~~~~~~~~~~
"( get:%" PRIu64 " put:%" PRIu64 " del:%" PRIu64 " total:%" \
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
PRIu64 " found:%" PRIu64 ")",
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
gets_done, puts_done, deletes_done, readwrites_, found);
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1plus: all warnings being treated as errors
Makefile:1707: recipe for target 'tools/db_bench_tool.o' failed
Closes https://github.com/facebook/rocksdb/pull/2558
Differential Revision: D5398703
Pulled By: siying
fbshipit-source-id: 6ffa552bbd8b59cfc2c36289f86ff9b9acca8ca6
Summary:
As titled. Also fixed an off-by-one error causing us to add one less range deletion than the user specified.
Closes https://github.com/facebook/rocksdb/pull/2544
Differential Revision: D5383451
Pulled By: ajkr
fbshipit-source-id: cbd5890c33f09bbb5c0c1f4bb952a1add32336e0
Summary:
Valgrind had false positive complaints about the initialization pattern for `GetCurrentTime()`'s argument in #2480. We can instead have the client initialize the time variable before calling `GetCurrentTime()`, and have `GetCurrentTime()` promise to only overwrite it in success case.
Closes https://github.com/facebook/rocksdb/pull/2526
Differential Revision: D5358689
Pulled By: ajkr
fbshipit-source-id: 857b189f24c19196f6bb299216f3e23e7bc4be42
Summary:
The comma "," is not a valid separator for bash arrays.
Closes https://github.com/facebook/rocksdb/pull/2516
Differential Revision: D5348101
Pulled By: yiwu-arbug
fbshipit-source-id: 8f0afdac368e21076eb7366b7df7dbaaf158cf96
Summary:
We've got some DBs where iterators return Status with message "Corruption: block checksum mismatch" all the time. That's not very informative. It would be much easier to investigate if the error message contained the file name - then we would know e.g. how old the corrupted file is, which would be very useful for finding the root cause. This PR adds file name, offset and other stuff to some block corruption-related status messages.
It doesn't improve all the error messages, just a few that were easy to improve. I'm mostly interested in "block checksum mismatch" and "Bad table magic number" since they're the only corruption errors that I've ever seen in the wild.
Closes https://github.com/facebook/rocksdb/pull/2507
Differential Revision: D5345702
Pulled By: al13n321
fbshipit-source-id: fc8023d43f1935ad927cef1b9c55481ab3cb1339
Summary:
Introducing FIFO compactions with TTL.
FIFO compaction is based on size only which makes it tricky to enable in production as use cases can have organic growth. A user requested an option to drop files based on the time of their creation instead of the total size.
To address that request:
- Added a new TTL option to FIFO compaction options.
- Updated FIFO compaction score to take TTL into consideration.
- Added a new table property, creation_time, to keep track of when the SST file is created.
- Creation_time is set as below:
- On Flush: Set to the time of flush.
- On Compaction: Set to the max creation_time of all the files involved in the compaction.
- On Repair and Recovery: Set to the time of repair/recovery.
- Old files created prior to this code change will have a creation_time of 0.
- FIFO compaction with TTL is enabled when ttl > 0. All files older than ttl will be deleted during compaction. i.e. `if (file.creation_time < (current_time - ttl)) then delete(file)`. This will enable cases where you might want to delete all files older than, say, 1 day.
- FIFO compaction will fall back to the prior way of deleting files based on size if:
- the creation_time of all files involved in compaction is 0.
- the total size (of all SST files combined) does not drop below `compaction_options_fifo.max_table_files_size` even if the files older than ttl are deleted.
This feature is not supported if max_open_files != -1 or with table formats other than Block-based.
**Test Plan:**
Added tests.
**Benchmark results:**
Base: FIFO with max size: 100MB ::
```
svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100
readwhilewriting : 1.924 micros/op 519858 ops/sec; 13.6 MB/s (1176277 of 5000000 found)
```
With TTL (a low one for testing) ::
```
svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100 --fifo_compaction_ttl=20
readwhilewriting : 1.902 micros/op 525817 ops/sec; 13.7 MB/s (1185057 of 5000000 found)
```
Example Log lines:
```
2017/06/26-15:17:24.609249 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609177) [db/compaction_picker.cc:1471] [default] FIFO compaction: picking file 40 with creation time 1498515423 for deletion
2017/06/26-15:17:24.609255 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609234) [db/db_impl_compaction_flush.cc:1541] [default] Deleted 1 files
...
2017/06/26-15:17:25.553185 7fd5a61a5800 [DEBUG] [db/db_impl_files.cc:309] [JOB 0] Delete /dev/shm/dbbench/000040.sst type=2 #40 -- OK
2017/06/26-15:17:25.553205 7fd5a61a5800 EVENT_LOG_v1 {"time_micros": 1498515445553199, "job": 0, "event": "table_file_deletion", "file_number": 40}
```
SST Files remaining in the dbbench dir, after db_bench execution completed:
```
svemuri@dev15905 ~/rocksdb (fifo-compaction) $ ls -l /dev/shm//dbbench/*.sst
-rw-r--r--. 1 svemuri users 30749887 Jun 26 15:17 /dev/shm//dbbench/000042.sst
-rw-r--r--. 1 svemuri users 30768779 Jun 26 15:17 /dev/shm//dbbench/000044.sst
-rw-r--r--. 1 svemuri users 30757481 Jun 26 15:17 /dev/shm//dbbench/000046.sst
```
Closes https://github.com/facebook/rocksdb/pull/2480
Differential Revision: D5305116
Pulled By: sagar0
fbshipit-source-id: 3e5cfcf5dd07ed2211b5b37492eb235b45139174
Summary:
update compatible test to include 5.5 and 5.6 branch.
Closes https://github.com/facebook/rocksdb/pull/2501
Differential Revision: D5325220
Pulled By: yiwu-arbug
fbshipit-source-id: 5f5271491e6dd2d7b2cf73a7142f38a571553bc4
Summary:
Added a flag, `ignore_unknown_options`, to skip unknown options when loading an options file (using `LoadLatestOptions`/`LoadOptionsFromFile`) or while verifying options (using `CheckOptionsCompatibility`). This will help in downgrading the db to an older version.
Also added `--ignore_unknown_options` flag to ldb
**Example Use case:**
In MyRocks, if copying from newer version to older version, it is often impossible to start because of new RocksDB options that don't exist in older version, even though data format is compatible.
MyRocks uses these load and verify functions in [ha_rocksdb.cc::check_rocksdb_options_compatibility](e004fd9f41/storage/rocksdb/ha_rocksdb.cc (L3348-L3401)).
**Test Plan:**
Updated the unit tests.
`make check`
ldb:
$ ./ldb --db=/tmp/test_db --create_if_missing put a1 b1
OK
Now edit /tmp/test_db/<OPTIONS-file> and add an unknown option.
Try loading the options now, and it fails:
$ ./ldb --db=/tmp/test_db --try_load_options get a1
Failed: Invalid argument: Unrecognized option DBOptions:: abcd
Passes with the new --ignore_unknown_options flag
$ ./ldb --db=/tmp/test_db --try_load_options --ignore_unknown_options get a1
b1
Closes https://github.com/facebook/rocksdb/pull/2423
Differential Revision: D5212091
Pulled By: sagar0
fbshipit-source-id: 2ec17636feb47dc0351b53a77e5f15ef7cbf2ca7
Summary:
Allow users to rate limit background work based on read bytes, written bytes, or sum of read and written bytes. Support these by changing the RateLimiter API, so no additional options were needed.
Closes https://github.com/facebook/rocksdb/pull/2433
Differential Revision: D5216946
Pulled By: ajkr
fbshipit-source-id: aec57a8357dbb4bfde2003261094d786d94f724e
Summary:
clean up the current test dir if the last regression test is still running.
Closes https://github.com/facebook/rocksdb/pull/2401
Differential Revision: D5177882
Pulled By: lightmark
fbshipit-source-id: 91d899fcc2bde841948eae71af8584d4bdb35468
Summary:
… headers
https://github.com/facebook/rocksdb/pull/2199 should not reference RocksDB-specific macros (like ROCKSDB_SUPPORT_THREAD_LOCAL in this case) to public headers, `iostats_context.h` and `perf_context.h`. We shouldn't do that because users have to provide these compiler flags when building their binary with RocksDB.
We should hide the thread local global variable inside our implementation and just expose a function api to retrieve these variables. It may break some users for now but good for long term.
make check -j64
Closes https://github.com/facebook/rocksdb/pull/2380
Differential Revision: D5177896
Pulled By: lightmark
fbshipit-source-id: 6fcdfac57f2e2dcfe60992b7385c5403f6dcb390
Summary:
Improve write buffer manager in several ways:
1. Size is tracked when arena block is allocated, rather than every allocation, so that it can better track actual memory usage and the tracking overhead is slightly lower.
2. We start to trigger memtable flush when 7/8 of the memory cap hits, instead of 100%, and make 100% much harder to hit.
3. Allow a cache object to be passed into buffer manager and the size allocated by memtable can be costed there. This can help users have one single memory cap across block cache and memtable.
Closes https://github.com/facebook/rocksdb/pull/2350
Differential Revision: D5110648
Pulled By: siying
fbshipit-source-id: b4238113094bf22574001e446b5d88523ba00017
Summary:
fix regression test by not reporting stats when building db
Closes https://github.com/facebook/rocksdb/pull/2390
Differential Revision: D5159909
Pulled By: lightmark
fbshipit-source-id: c3f4b9deb9c6799ff84207fd341c529144f8158d
Summary:
- `max_background_flushes` and `max_background_compactions` are still supported for backwards compatibility
- `base_background_compactions` is completely deprecated. Now we just throttle to one background compaction when there's no pressure.
- `max_background_jobs` is added to automatically partition the concurrent background jobs into flushes vs compactions. Currently it's very simple as we just allocate one-fourth of the jobs to flushes, and the remaining can be used for compactions.
- The test cases that set `base_background_compactions > 1` needed to be updated. I just grab the pressure token such that the desired number of compactions can be scheduled.
Closes https://github.com/facebook/rocksdb/pull/2205
Differential Revision: D4937461
Pulled By: ajkr
fbshipit-source-id: df52cbbd497e13bbc9a60560a5ac2a2526b3f1f9
Summary:
Release builds are failing on Linux with the error:
```
tools/db_stress.cc: In function ‘int main(int, char**)’:
tools/db_stress.cc:2365:12: error: ‘rocksdb::SyncPoint’ has not been declared
rocksdb::SyncPoint::GetInstance()->SetCallBack(
^
tools/db_stress.cc:2370:12: error: ‘rocksdb::SyncPoint’ has not been declared
rocksdb::SyncPoint::GetInstance()->SetCallBack(
^
tools/db_stress.cc:2375:12: error: ‘rocksdb::SyncPoint’ has not been declared
rocksdb::SyncPoint::GetInstance()->EnableProcessing();
^
make[1]: *** [tools/db_stress.o] Error 1
make[1]: Leaving directory `/data/sandcastle/boxes/trunk-git-rocksdb-public'
make: *** [release] Error 2
```
Closes https://github.com/facebook/rocksdb/pull/2355
Differential Revision: D5113552
Pulled By: sagar0
fbshipit-source-id: 351df707277787da5633ba4a40e52edc7c895dc4
Summary:
Disable direct reads for log and manifest. Direct reads should not affect sequential_file
Also add kDirectIO for option_config_ in db_test_util
Closes https://github.com/facebook/rocksdb/pull/2337
Differential Revision: D5100261
Pulled By: lightmark
fbshipit-source-id: 0ebfd13b93fa1b8f9acae514ac44f8125a05868b
Summary:
- Introduced an include/ file dedicated to db-related debug functions to avoid making db.h more complex
- Added debugging function, `GetAllKeyVersions()`, to return a listing of internal data for a range of user keys. The new `struct KeyVersion` exposes data similar to internal key without exposing any internal type.
- Migrated the "ldb idump" subcommand to use this function
- The API takes an inclusive-exclusive range to match behavior of "ldb idump". This will be quite annoying for users who want to query a single user key's versions :(.
Closes https://github.com/facebook/rocksdb/pull/2232
Differential Revision: D4976007
Pulled By: ajkr
fbshipit-source-id: cab375da53a7595d6575af2b7e3b776aa3ad793e
Summary:
Allow an option for users to do some compaction in FIFO compaction, to pay some write amplification for fewer number of files.
Closes https://github.com/facebook/rocksdb/pull/2163
Differential Revision: D4895953
Pulled By: siying
fbshipit-source-id: a1ab608dd0627211f3e1f588a2e97159646e1231
Summary:
Replacement of #2147
The change was squashed due to a lot of conflicts.
Closes https://github.com/facebook/rocksdb/pull/2194
Differential Revision: D4929799
Pulled By: siying
fbshipit-source-id: 5cd49c254737a1d5ac13f3c035f128e86524c581
Summary:
In a previous commit, I changed the way to checkout release branches from "git checkout <branch_name>" to "git checkout origin/<branch_name>". However, this doesn't seem to work in our CI environment. Revert it.
Closes https://github.com/facebook/rocksdb/pull/2189
Differential Revision: D4922294
Pulled By: siying
fbshipit-source-id: 482c17f9b05e6ccb190876b050682fe5a458103d
Summary:
tools/check_format_compatible.sh will check a newer version of RocksDB can open option files generated by older version releases. In order to achieve that, a new parameter "--try_load_options" is added to ldb. With this parameter set, if option file exists, we load the option file and use it to open the DB. With this opiton set, we can validate option loading logic.
Closes https://github.com/facebook/rocksdb/pull/2178
Differential Revision: D4914989
Pulled By: siying
fbshipit-source-id: db114f7724fcb41e5e9483116d84d7c4b8389ca4
Summary:
Tested by running it on a remote machine.
I could not run it on the particular remote machine which has a different location for time command since it is busy and the script does not allow concurrent runs. So I tested it by hacking the script and replacing the command with "\$(hostname)" and confirmed that the scripts prints out the host name of the remote machine.
Closes https://github.com/facebook/rocksdb/pull/2181
Differential Revision: D4921654
Pulled By: maysamyabandeh
fbshipit-source-id: 8abb5ea9f7234f3c50a749576ccbb47ff605beb9
Summary:
Need to add more recent versions to tools/check_format_compatible.sh to meka sure backward and forward compatibility.
Closes https://github.com/facebook/rocksdb/pull/2175
Differential Revision: D4911585
Pulled By: siying
fbshipit-source-id: 943e6488757efb11bb6720d811c7ba949915c9de
Summary:
Add a function to allow users to reset internal stats without restarting the DB.
Closes https://github.com/facebook/rocksdb/pull/2167
Differential Revision: D4907939
Pulled By: siying
fbshipit-source-id: ab2dd85b88aabe9380da7485320a1d460d3e1f68
Summary:
Replace Options::use_direct_writes with Options::use_direct_io_for_flush_and_compaction
Now if Options::use_direct_io_for_flush_and_compaction = true, we will enable direct io for both reads and writes for flush and compaction job. Whereas Options::use_direct_reads controls user reads like iterator and Get().
Closes https://github.com/facebook/rocksdb/pull/2117
Differential Revision: D4860912
Pulled By: lightmark
fbshipit-source-id: d93575a8a5e780cf7e40797287edc425ee648c19
Summary:
filter the warning out and only print it once.
Closes https://github.com/facebook/rocksdb/pull/2137
Differential Revision: D4870925
Pulled By: lightmark
fbshipit-source-id: 91b363ce7f70bce88b0780337f408fc4649139b8
Summary:
Run the time command before regression tests, parse the output, and add the numbers to the report.
Closes https://github.com/facebook/rocksdb/pull/2101
Differential Revision: D4862781
Pulled By: maysamyabandeh
fbshipit-source-id: 4a81caa5d14187d67093aad154c8f0ad56aba901
Summary:
Check the result of the benchmark againt a specified truth_db, which is
expected to be produced using the same benchmark but perhaps on a
different commit or with different configs.
The verification is simple and assumes that key/values are generated
deterministically. This assumption would break if db_bench using rand
variable differently from the benchmark that produced truth_db.
Currently it is checked to work on fillrandom and readwhilewriting.
A param finish_after_writes is added to ensure that the background
writing thread will write the same number of entries between two
benchmarks.
Example:
$ TEST_TMPDIR=/dev/shm/truth_db ./db_bench
--benchmarks="fillrandom,readwhilewriting" --num=200000
--finish_after_writes=true
$ TEST_TMPDIR=/dev/shm/tmpdb ./db_bench
--benchmarks="fillrandom,readwhilewriting,verify" --truth_db
/dev/shm/truth_db/dbbench --num=200000 --finish_after_writes=true
Verifying db <= truth_db...
Verifying db >= truth_db...
...Verified
Closes https://github.com/facebook/rocksdb/pull/2098
Differential Revision: D4839233
Pulled By: maysamyabandeh
fbshipit-source-id: 2f4ed31
Summary:
This is an effort to club all string related utility functions into one common place, in string_util, so that it is easier for everyone to know what string processing functions are available. Right now they seem to be spread out across multiple modules, like logging and options_helper.
Check the sub-commits for easier reviewing.
Closes https://github.com/facebook/rocksdb/pull/2094
Differential Revision: D4837730
Pulled By: sagar0
fbshipit-source-id: 344278a
Summary:
Move some files under util/ to new directories env/, monitoring/ options/ and cache/
Closes https://github.com/facebook/rocksdb/pull/2090
Differential Revision: D4833681
Pulled By: siying
fbshipit-source-id: 2fd8bef
Summary:
Rebuilding regression db at the first run every monday.
Closes https://github.com/facebook/rocksdb/pull/2093
Differential Revision: D4836961
Pulled By: lightmark
fbshipit-source-id: 22d6c25
Summary:
It is confusing to have auto_roll_logger to stay under db/, which has nothing to do with database. Move filename together as it is a dependency.
Closes https://github.com/facebook/rocksdb/pull/2080
Differential Revision: D4821141
Pulled By: siying
fbshipit-source-id: ca7d768
Summary:
Fix the bug that previous test existence locally
Closes https://github.com/facebook/rocksdb/pull/2074
Differential Revision: D4813631
Pulled By: lightmark
fbshipit-source-id: eceb0d9
Summary:
add ldb to regression test in order to enable reuse db by creating checkpoint
Closes https://github.com/facebook/rocksdb/pull/2030
Differential Revision: D4800549
Pulled By: lightmark
fbshipit-source-id: d3a7325
Summary:
Added fifo benchmark to db_bench.
One thing i am not sure is that i am using CompactRange() instead of CompactFiles(). (may cause performance skew because CompactionRange() is not happening in current thread?) For CompactFiles(), for some reason FIFO compaction doesn't work as expected. More insight is welcomed. I guess FIFO compaction doesn't work with file names? igorcanadi
test cmd:
./db_bench --compaction_style=2 --benchmarks=fillseqdeterministic --disable_auto_compactions --num_levels=1 --fifo_compaction_max_table_files_size_mb=10
---------------------- DB 0 LSM ---------------------
Level[0]: /000014.sst(size: 4211014 bytes)
fillseqdeterministic : 4.731 micros/op 211381 ops/sec; 23.4 MB/s
Closes https://github.com/facebook/rocksdb/pull/1734
Differential Revision: D4774964
Pulled By: siying
fbshipit-source-id: 9d08df6
Summary:
PlainTable now supports non-mmap mode. We don't need to check it anymore.
Closes https://github.com/facebook/rocksdb/pull/1882
Differential Revision: D4751643
Pulled By: siying
fbshipit-source-id: ab14540
Summary:
Removed max_grandparent_overlap_factor from benchmark.sh since it is not a valid option anymore.
Closes https://github.com/facebook/rocksdb/pull/2015
Differential Revision: D4748229
Pulled By: lgalanis
fbshipit-source-id: c3869ea
Summary:
PinnableSlice
Summary:
Currently the point lookup values are copied to a string provided by the
user. This incures an extra memcpy cost. This patch allows doing point lookup
via a PinnableSlice which pins the source memory location (instead of
copying their content) and releases them after the content is consumed
by the user. The old API of Get(string) is translated to the new API
underneath.
Here is the summary for improvements:
value 100 byte: 1.8% regular, 1.2% merge values
value 1k byte: 11.5% regular, 7.5% merge values
value 10k byte: 26% regular, 29.9% merge values
The improvement for merge could be more if we extend this approach to
pin the merge output and delay the full merge operation until the user
actually needs it. We have put that for future work.
PS:
Sometimes we observe a small decrease in performance when switching from
t5452014 to this patch but with the old Get(string) API. The d
Closes https://github.com/facebook/rocksdb/pull/1756
Differential Revision: D4391738
Pulled By: maysamyabandeh
fbshipit-source-id: 6f3edd3
Summary:
Add the flag --prefix to the sst_dump tool
This flag is similar to, and exclusive from, the --from flag.
--prefix=0x00FF will return all rows prefixed with 0x00FF.
The --to flag may also be specified and will work as expected.
These changes were used to help in debugging the power cycle corruption issue and theses changes were tested by scanning through a udb.
Closes https://github.com/facebook/rocksdb/pull/1984
Differential Revision: D4691814
Pulled By: reidHoruff
fbshipit-source-id: 027f261
Summary:
This reverts commit d43adf21bb.
The patch has caused problems in regression tests. Will revert it for now until we figure how to debug the problems regression tests.
Closes https://github.com/facebook/rocksdb/pull/1975
Differential Revision: D4682880
Pulled By: maysamyabandeh
fbshipit-source-id: 84df83a
Summary:
It augments the regression benchmarks with a time command, parses the output, and print them to the SUMMARY.csv file.
I tested a variation of the script locally. Any idea how to do run a test that also involves writing to scuba tables?
Closes https://github.com/facebook/rocksdb/pull/1967
Differential Revision: D4679470
Pulled By: maysamyabandeh
fbshipit-source-id: 44dac30
Summary:
Also extracted the common logic into a base class, BackupableCommand.
Closes https://github.com/facebook/rocksdb/pull/1939
Differential Revision: D4630121
Pulled By: ajkr
fbshipit-source-id: 04bb067
Summary:
This option is needed to be enabled for Direct IO
and I cannot think of a reason where we need to disable it
remove it and default it to true
Closes https://github.com/facebook/rocksdb/pull/1944
Differential Revision: D4641088
Pulled By: IslamAbdelRahman
fbshipit-source-id: d7085b9
Summary:
The default behavior was too weird because, previously, we got the L0 file size limit (64MB) from Options default and L1+ file size limit (2MB) from the hardcoded value. We should get both from Options default.
Closes https://github.com/facebook/rocksdb/pull/1943
Differential Revision: D4640301
Pulled By: ajkr
fbshipit-source-id: fd8c0fd
Summary:
add direct_io and compaction_readahead_size in db_stress
test direct_io under db_stress with compaction_readahead_size enabled to capture bugs found in production.
`./db_stress --allow_concurrent_memtable_write=0 --use_direct_reads --use_direct_writes --compaction_readahead_size=4096`
Closes https://github.com/facebook/rocksdb/pull/1906
Differential Revision: D4604514
Pulled By: IslamAbdelRahman
fbshipit-source-id: ebbf0ee
Summary:
Remove disableDataSync, and another similarly named disable_data_sync options.
This is being done to simplify options, and also because the performance gains of this feature can be achieved by other methods.
Closes https://github.com/facebook/rocksdb/pull/1859
Differential Revision: D4541292
Pulled By: sagar0
fbshipit-source-id: 5b3a6ca
Summary:
move the argument description to the right section, make it clearer that the '=' sign is required, and use it in one subcommand where it seemed forgotten before.
Closes https://github.com/facebook/rocksdb/pull/1840
Differential Revision: D4515096
Pulled By: ajkr
fbshipit-source-id: 7b5b1c1
Summary:
I want to be able to, e.g., DECLARE_string(statistics_string); in my application such that I can override the default value of statistics_string. For this to work, we need to remove the unnamed namespace containing all the flags, and make sure all variables/functions covered by that namespace are static.
Replaces #1828 due to internal tool issues.
Closes https://github.com/facebook/rocksdb/pull/1844
Differential Revision: D4515124
Pulled By: ajkr
fbshipit-source-id: 23b695e
Summary:
introduce new methods into a public threadpool interface,
- allow submission of std::functions as they allow greater flexibility.
- add Joining methods to the implementation to join scheduled and submitted jobs with
an option to cancel jobs that did not start executing.
- Remove ugly `#ifdefs` between pthread and std implementation, make it uniform.
- introduce pimpl for a drop in replacement of the implementation
- Introduce rocksdb::port::Thread typedef which is a replacement for std::thread. On Posix Thread defaults as before std::thread.
- Implement WindowsThread that allocates memory in a more controllable manner than windows std::thread with a replaceable implementation.
- should be no functionality changes.
Closes https://github.com/facebook/rocksdb/pull/1823
Differential Revision: D4492902
Pulled By: siying
fbshipit-source-id: c74cb11
Summary:
Added -statistics_string to deserialize a Statistics object using the factory functions registered by applications.
Closes https://github.com/facebook/rocksdb/pull/1812
Differential Revision: D4469811
Pulled By: ajkr
fbshipit-source-id: 2d80862
Summary:
The Env registration framework supports registering client Envs and selecting which one to instantiate according to a text field. This enabled things like adding the -env_uri argument to db_bench, so the same binary could be reused with different Envs just by changing CLI config.
Now this problem has come up again in a non-Env context, as I want to instantiate a client Statistics implementation from db_bench, which is configured entirely via text parameters. Also, in the future we may wish to use it for deserializing client objects when loading OPTIONS file.
This diff generalizes the Env registration logic to work with arbitrary types.
- Generalized registration and instantiation code by templating them
- The entire implementation is in a header file as that's Google style guide's recommendation for template definitions
- Pattern match with std::regex_match rather than checking prefix, which was the previous behavior
- Rename functions/files to be non-Env-specific
Closes https://github.com/facebook/rocksdb/pull/1776
Differential Revision: D4421933
Pulled By: ajkr
fbshipit-source-id: 34647d1
Summary:
Currently the point lookup values are copied to a string provided by the user.
This incures an extra memcpy cost. This patch allows doing point lookup
via a PinnableSlice which pins the source memory location (instead of
copying their content) and releases them after the content is consumed
by the user. The old API of Get(string) is translated to the new API
underneath.
Here is the summary for improvements:
1. value 100 byte: 1.8% regular, 1.2% merge values
2. value 1k byte: 11.5% regular, 7.5% merge values
3. value 10k byte: 26% regular, 29.9% merge values
The improvement for merge could be more if we extend this approach to
pin the merge output and delay the full merge operation until the user
actually needs it. We have put that for future work.
PS:
Sometimes we observe a small decrease in performance when switching from
t5452014 to this patch but with the old Get(string) API. The difference
is a little and could be noise. More importantly it is safely
cancelled
Closes https://github.com/facebook/rocksdb/pull/1732
Differential Revision: D4374613
Pulled By: maysamyabandeh
fbshipit-source-id: a077f1a
Summary:
?tion
Add flags in db_bench to test with block cache mid-point insertion.
Also update sst_dump to dump total block sizes of each type. I find it
useful to look at these test db stats and I don't know if we have them
elsewhere.
Closes https://github.com/facebook/rocksdb/pull/1706
Differential Revision: D4355812
Pulled By: yiwu-arbug
fbshipit-source-id: 3e4a348
Summary:
Add the parameter in db_bench to help users to measure latency histogram with constant read rate.
Closes https://github.com/facebook/rocksdb/pull/1683
Differential Revision: D4341387
Pulled By: siying
fbshipit-source-id: 1b4b276
Summary:
Fixes compile error:
In file included from ./util/statistics.h:17:0,
from ./util/stop_watch.h:8,
from ./util/perf_step_timer.h:9,
from ./util/iostats_context_imp.h:8,
from ./util/posix_logger.h:27,
from ./port/util_logger.h:18,
from ./db/auto_roll_logger.h:15,
from db/auto_roll_logger.cc:6:
./util/thread_local.h:65:16: error: 'function' in namespace 'std' does not name a template type
typedef std::function<void(void*, void*)> FoldFunc;
Closes https://github.com/facebook/rocksdb/pull/1656
Differential Revision: D4318702
Pulled By: yiwu-arbug
fbshipit-source-id: 8c5d17a
Summary:
Include a dump of user_collected_properties in sst_dump
Closes https://github.com/facebook/rocksdb/pull/1668
Differential Revision: D4325078
Pulled By: IslamAbdelRahman
fbshipit-source-id: 226b6d6
Summary:
char is unsigned on power by default causing this test to fail with the FF case. ppc64 return 255 while x86 returned -1. Casting works on both platforms.
Closes https://github.com/facebook/rocksdb/pull/1500
Differential Revision: D4308775
Pulled By: yiwu-arbug
fbshipit-source-id: db3e6e0
Summary:
The info log file ("LOG") is stored in the db directory by default. When the db is on a distributed env, this is unnecessarily slow. So, I added an option to db_bench to just print the info log messages to stderr.
Closes https://github.com/facebook/rocksdb/pull/1641
Differential Revision: D4309348
Pulled By: ajkr
fbshipit-source-id: 1e6f851
Summary:
made db_stress capable of adding range deletions to its db and verifying their correctness. i'll make db_crashtest.py use this option later once the collapsing optimization (https://github.com/facebook/rocksdb/pull/1614) is committed because currently it slows down the test too much.
Closes https://github.com/facebook/rocksdb/pull/1625
Differential Revision: D4293939
Pulled By: ajkr
fbshipit-source-id: d3beb3a
Summary:
When enabled, this option replaces range tombstones with a sequence of
point tombstones covering the same range. This can be used to A/B test perf of
range tombstones vs sequential point tombstones, and help us find the cross-over
point, i.e., the size of the range above which range tombstones outperform point
tombstones.
Closes https://github.com/facebook/rocksdb/pull/1594
Differential Revision: D4246312
Pulled By: ajkr
fbshipit-source-id: 3b00b23
Summary:
Reduce number of comparisons in heap by caching which child node in the first level is smallest (left_child or right_child)
So next time we can compare directly against the smallest child
I see that the total number of calls to comparator drops significantly when using this optimization
Before caching (~2mil key comparison for iterating the DB)
```
$ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq" --db="/dev/shm/heap_opt" --use_existing_db --disable_auto_compactions --cache_size=1000000000 --perf_level=2
readseq : 0.338 micros/op 2959201 ops/sec; 327.4 MB/s user_key_comparison_count = 2000008
```
After caching (~1mil key comparison for iterating the DB)
```
$ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq" --db="/dev/shm/heap_opt" --use_existing_db --disable_auto_compactions --cache_size=1000000000 --perf_level=2
readseq : 0.309 micros/op 3236801 ops/sec; 358.1 MB/s user_key_comparison_count = 1000011
```
It also improves
Closes https://github.com/facebook/rocksdb/pull/1600
Differential Revision: D4256027
Pulled By: IslamAbdelRahman
fbshipit-source-id: 76fcc66
Summary:
Now that we have userspace persisted cache, we don't need flashcache anymore.
Closes https://github.com/facebook/rocksdb/pull/1588
Differential Revision: D4245114
Pulled By: igorcanadi
fbshipit-source-id: e2c1c72
Summary:
- Made RangeDelAggregator's InternalKeyComparator member a reference-to-const so we don't need to copy-construct it. Also added InternalKeyComparator to ImmutableCFOptions so we don't need to construct one for each DBIter.
- Made MemTable::NewRangeTombstoneIterator and the table readers' NewRangeTombstoneIterator() functions return nullptr instead of NewEmptyInternalIterator to avoid the allocation. Updated callers accordingly.
Closes https://github.com/facebook/rocksdb/pull/1548
Differential Revision: D4208169
Pulled By: ajkr
fbshipit-source-id: 2fd65cf
Summary:
Added a few options to configure when to add range tombstones during
any benchmark involving writes.
Closes https://github.com/facebook/rocksdb/pull/1522
Differential Revision: D4187388
Pulled By: ajkr
fbshipit-source-id: 2c8a473
Summary:
Add a subcommand to ldb with which we can delete a range of keys.
Closes https://github.com/facebook/rocksdb/pull/1521
Differential Revision: D4186338
Pulled By: ajkr
fbshipit-source-id: b8e9861
Summary:
enhancing sst_dump to be able to parse internal key
Closes https://github.com/facebook/rocksdb/pull/1482
Differential Revision: D4154175
Pulled By: siying
fbshipit-source-id: b0e28b1
Summary:
Use 16384 as e.g .value for ldb the --compression_max_dict_bytes option.
I think 14 was copy and pasted from the options in the lines above.
Closes https://github.com/facebook/rocksdb/pull/1483
Differential Revision: D4154393
Pulled By: siying
fbshipit-source-id: ef53a69
Summary:
Note: reviewed in https://reviews.facebook.net/D65115
- DBIter maintains a range tombstone accumulator. We don't cleanup obsolete tombstones yet, so if the user seeks back and forth, the same tombstones would be added to the accumulator multiple times.
- DBImpl::NewInternalIterator() (used to make DBIter's underlying iterator) adds memtable/L0 range tombstones, L1+ range tombstones are added on-demand during NewSecondaryIterator() (see D62205)
- DBIter uses ShouldDelete() when advancing to check whether keys are covered by range tombstones
Closes https://github.com/facebook/rocksdb/pull/1464
Differential Revision: D4131753
Pulled By: ajkr
fbshipit-source-id: be86559
Summary:
Fix the bug that --dump_malloc_stats is set before opening the DB.
Closes https://github.com/facebook/rocksdb/pull/1447
Differential Revision: D4106001
Pulled By: siying
fbshipit-source-id: 4e746da
Summary:
Some older tags don't run GCC 4.8 with FB internal setting. Fixed them and created branches. Change the format compatible script accordingly.
Also add more releases to check format compatibility.
* enable cmake to work on linux and osx also
* port part of build_detect_platform not covered by thirdparty.inc
to cmake.
- detect fallocate()
- detect malloc_usable_size()
- detect JeMalloc
- detect snappy
* check for asan,tsan,ubsan
* create 'build_version.cc' in build directory.
* add `check` target to support 'make check'.
* add `tools` target to match its counterpart in Makefile.
* use `date` on non-win32 platforms.
* pass different cflags on non-win32 platforms
* detect pthead library using FindThread cmake module.
* enable CMP0042 to silence the cmake warning on osx
* reorder the linked libraries. because testutillib references gtest, to
enable the linker to find the referenced symbols, we need to put gtest
after testutillib.
Signed-off-by: Marcus Watts <mwatts@redhat.com>
Signed-off-by: Kefu Chai <kchai@redhat.com>
* hash_table_bench.cc: fix build without gflags
Signed-off-by: Kefu Chai <kchai@redhat.com>
* remove gtest from librocksdb linkage
testharness.cc is included in librocksdb sources, and it uses gtest. but
gtest is not supposed to be part of the public API of librocksdb. so, in
this change, the testharness.cc is moved out out librocksdb, and is
built as an object target, then linked with the tools and tests instead.
Signed-off-by: Marcus Watts <mwatts@redhat.com>
Signed-off-by: Kefu Chai <kchai@redhat.com>
Summary: Use ImmutableDBOptions/MutableDBOptions internally and DBOptions only for user-facing APIs. MutableDBOptions is barely a placeholder for now. I'll start to move options to MutableDBOptions in following diffs.
Test Plan:
make all check
Reviewers: yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D64065
Summary:
ZSTD 1.0.0 is coming. We can finally add a support of ZSTD without worrying about compatibility.
Still keep ZSTDNotFinal for compatibility reason.
Test Plan: Run all tests. Run db_bench with ZSTD version with RocksDB built with ZSTD 1.0 and older.
Reviewers: andrewkr, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: cyan, igor, IslamAbdelRahman, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D63141
Summary: There's no reference to ImmutableCFOptions elsewhere in /include/rocksdb. ImmutableCFOptions was introduced in this commit (5665e5e285) but later its reference in /include/rocksdb/table.h is removed.
Test Plan:
make all check
Reviewers: IslamAbdelRahman, sdong, yhchiang
Reviewed By: yhchiang
Subscribers: yhchiang, andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D63177
Summary: Fix using `arg[i].thread` after deleting it
Test Plan: run clang_analyze
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D63171
Summary: To reduce number of options, merge source_compaction_factor, max_grandparent_overlap_bytes and expanded_compaction_factor into max_compaction_bytes.
Test Plan: Add two new unit tests. Run all existing tests, including jtest.
Reviewers: yhchiang, igor, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59829
Summary:
Add ReadOptions::read_amp_bytes_per_bit option which allow us to create a bitmap for every data block we read
the bitmap will contain (block_size / read_amp_bytes_per_bit) bits.
We will use this bitmap to mark which bytes have been used of the block so we can calculate the read amplification
Test Plan: added new tests
Reviewers: andrewkr, yhchiang, sdong
Reviewed By: sdong
Subscribers: yiwu, leveldb, march, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58707
Summary:
Clock-based cache implemenetation aim to have better concurreny than
default LRU cache. See inline comments for implementation details.
Test Plan:
Update cache_test to run on both LRUCache and ClockCache. Adding some
new tests to catch some of the bugs that I fixed while implementing the
cache.
Reviewers: kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61647
Summary:
This is a proof of concept of a RocksDB blob log file. The actual value of the Put() is appended to a blob log using normal data block format, and the handle of the block is written as the value of the key in RocksDB.
The prototype only supports Put() and Get(). It doesn't support DB restart, garbage collection, Write() call, iterator, snapshots, etc.
Test Plan: Add unit tests.
Reviewers: arahut
Reviewed By: arahut
Subscribers: kradhakrishnan, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61485
Summary: As title.
Test Plan: Run the benchmark with and without the parameter.
Reviewers: yiwu, andrewkr, lightmark, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61491
Summary: Add a benchmark to `db_bench`. In this benchmark, a write thread will populate time series data in the format of 'id | timestamp', and multiple read threads will randomly retrieve all data from one id at a time.
Test Plan: Run the benchmark: `num=134217728;bpl=536870912;mb=67108864;overlap=10;mcz=2;del=300000000;levels=6;ctrig=4;delay=8;stop=12;wbn=3;mbc=20;wbs=134217728;dds=0;sync=0;t=32;vs=800;bs=4096;cs=17179869184;of=500000;wps=0;si=10000000; kir=100000; dir=/data/users/jhli/test/; ./db_bench --benchmarks=timeseries --disable_seek_compaction=1 --mmap_read=0 --statistics=1 --histogram=1 --num=$num --threads=$t --value_size=$vs --block_size=$bs --cache_size=$cs --bloom_bits=10 --cache_numshardbits=6 --open_files=$of --verify_checksum=1 --db=$dir --sync=$sync --disable_wal=0 --compression_type=none --stats_interval=$si --compression_ratio=1 --disable_data_sync=$dds --write_buffer_size=$wbs --target_file_size_base=$mb --max_write_buffer_number=$wbn --max_background_compactions=$mbc --level0_file_num_compaction_trigger=$ctrig --level0_slowdown_writes_trigger=$delay --level0_stop_writes_trigger=$stop --num_levels=$levels --delete_obsolete_files_period_micros=$del --min_level_to_compress=$mcz --max_grandparent_overlap_factor=$overlap --stats_per_interval=1 --max_bytes_for_level_base=$bpl --use_existing_db=0 --key_id_range=$kir`
Reviewers: andrewkr, sdong
Reviewed By: sdong
Subscribers: lgalanis, andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60651
Summary:
db_stress test is now failing because of this scenario
- run db_stress with merge_operator enabled (now we have a db with merge operands)
- run db_stress with merge_operator disabled (now when we fail to open the db)
the solution is to pass the merge_operator to the DB even if we are not going to do any merge operations
Test Plan: Check the failure
Reviewers: andrewkr, yiwu, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61311
Summary: OnTableFileCreated() now is also called when the file creaion fails. In that case, we shouldn't assert the file size is not 0.
Test Plan: Run crash test
Reviewers: yiwu, andrewkr, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: IslamAbdelRahman, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61137
Summary: Extend the option memtable_prefix_bloom_huge_page_tlb_size from just putting memtable bloom filter to huge page to memtable itself too.
Test Plan: Run all existing tests.
Reviewers: IslamAbdelRahman, yhchiang, andrewkr
Reviewed By: andrewkr
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60513
Summary:
- Added a new subcommand, 'ldb restore', that restores from backup
- Made backup_env_uri optional (also for 'ldb backup') because it can use db_env when backup_env isn't provided
Test Plan:
verify backup and restore commands work:
$ ./ldb backup --db=/data/users/andrewkr/rocksdb/db_bench-out/ --num_threads 1 --backup_dir=/data/users/andrewkr/rocksdb/db_bench-out/backup/
$ ./ldb restore --db=/data/users/andrewkr/rocksdb/db_bench-out/restore/ --num_threads 1 --backup_dir=/data/users/andrewkr/rocksdb/db_bench-out/backup/
Reviewers: sdong, wanning
Reviewed By: wanning
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60849
Summary:
Tsan crash white-box test was disabled because it never ends. Re-enable it with reduced killing odds.
Add a parameter in crash test script to allow we pass it through an environment variable.
Test Plan: Run it manually.
Reviewers: IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61053
getline on std::cin can be very inefficient when ldb is loading large values, with high CPU usage in libc _IO_(un)getc, this is because of the performance penalty that comes from synchronizing stdio and iostream buffers.
See the reproducers and tests in #1133 .
If an ifstream on /dev/stdin is used (when available) then using ldb to load large values can be much more efficient.
I thought for ldb load, that this approach is preferable to using <cstdio> or std::ios_base::sync_with_stdio(false).
I couldn't think of a use case where ldb load would need to support reading unbuffered input, an alternative approach would be to add support for passing --input_file=/dev/stdin.
I have a CLA in place, thanks.
The CI tests were failing at the time of https://github.com/facebook/rocksdb/pull/1156, so this change and PR will supersede it.
* Added new statistics and refactored to allow ioptions to be passed around as required to access environment and statistics pointers (and, as a convenient side effect, info_log pointer).
* Prevent incrementing compression counter when compression is turned off in options.
* Prevent incrementing compression counter when compression is turned off in options.
* Added two more supported compression types to test code in db_test.cc
* Prevent incrementing compression counter when compression is turned off in options.
* Added new StatsLevel that excludes compression timing.
* Fixed casting error in coding.h
* Fixed CompressionStatsTest for new StatsLevel.
* Removed unused variable that was breaking the Linux build
Summary: add backup support for ldb tool, and use it to run load test for backup on two HDFS envs
Test Plan: first generate some db, then compile against load test in fbcode, run load_test --db=<db path> backup --backup_env_uri=<URI of backup env> --backup_dir=<backup directory> --num_threads=<number of thread>
Reviewers: andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60633
Summary: Add option write_buffer_manager to help users control total memory spent on memtables across multiple DB instances.
Test Plan: Add a new unit test.
Reviewers: yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: adela, benj, sumeet, muthu, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59925
* Fixed Windows build error in CMakeLists.txt and perf_level error in db_bench_tool.cc
* Changed hard-coded perf levels in db_bench_tool.cc to enum values from perf_level.h
* Replaced remaining FLAGS_perf_level > 0
Summary:
This diff makes the following improvement in regression_test.sh:
1. Add NUM_OPS and DELETE_TEST_PATH to regression_test.sh:
* NUM_OPS: The number of operations that will be issued
in EACH thread.
Default: $NUM_KEYS / $NUM_THREADS
* DELETE_TEST_PATH: If true, then the test directory
will be deleted after the script ends.
Default: 0
2. Add more information in SUMMARY.csv
3. Fix a bug in regression_test.sh where each thread in fillseq will all issue $NUM_KEYS writes.
4. Add --deletes in db_bench, which allows us to control the number of deletes instead of must using FLAGS_num.
Test Plan: run regression test with and without DELETE_TEST_PATH and NUM_OPS
Reviewers: yiwu, sdong, IslamAbdelRahman, gunnarku
Reviewed By: gunnarku
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60039
Summary: filter_deletes option was removed, remove it from crash_test to fix it
Test Plan: make crash_test
Reviewers: yhchiang, andrewkr, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59901
Summary: filter_deltes is not a frequently used feature. Remove it.
Test Plan: Run all test suites.
Reviewers: igor, yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59427
Summary:
The sandcastle setup doesn't provide a remote with our branches. Need
to fetch them directly from github.
Test Plan:
ran script on devserver and the relevant commands on a sandcastle
host.
Reviewers: IslamAbdelRahman, lightmark, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D59511
Summary:
memtable_prefix_bloom_probes is not a critical option. Remove it to reduce number of options.
It's easier for users to make mistakes with memtable_prefix_bloom_bits, turn it to memtable_prefix_bloom_bits_ratio
Test Plan: Run all existing tests
Reviewers: yhchiang, igor, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: gunnarku, yoshinorim, MarkCallaghan, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59199
Summary:
Added an option, --env_uri. When provided, it is used as an argument to
NewEnvFromUri(), which instantiates an Env based on it.
Test Plan:
built a simple binary that registers ChrootEnv for prefix "/", then
ran:
$ ./tmp --env_uri /tmp/ --db /abcde
/tmp/ is the chroot directory and /abcde is the db_name. Then I verified
db_bench uses /tmp/abcde
Reviewers: sdong, kradhakrishnan, lightmark
Reviewed By: lightmark
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D59325
Summary:
We had to go back and update the g++ path for 4.4.fb-4.8.fb. So the
path is now fixed on the branches, but can't be fixed on the tags since they're
immutable. By making format compatibility tests use branch names (when
available), backported fixes like this will be used without having to re-release.
Also removed v1.5.7 and v2.1 because make fails.
Test Plan:
$ build_tools/rocksdb-lego-determinator run_format_compatible
Reviewers: sdong, lightmark, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D59355
Summary:
This patch does the following improvement for regression_test.sh
* Allow regression_test.sh to specify OPTIONS_FILE.
* Add header comments that includes examples on how to run the script
and introduce all configurable parameters.
* bug fix.
Test Plan: Run the example commands in the header comments of regression_test.sh
Reviewers: sdong, yiwu, gunnarku
Reviewed By: gunnarku
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D59175
Summary:
This patch allows db_bench to initialize it's RocksDB Options via a
options file, specified by the --options_file flag. Note that if
--options_file flag is set, then it has higher priority than the
command-line argument.
Test Plan: db_bench_tool_test
Reviewers: sdong, IslamAbdelRahman, kradhakrishnan, yiwu, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D58533
Summary:
This patch does the following improvement on the regression_test.sh
* allows db_bench being executed at a remost host while storing the
benchmark results locally.
* kills all db_bench related processes before running db_bench
* better error handling.
Test Plan:
1. Run regression_test.sh both locally and remotely
2. Run multiple regression_test.sh at the same time and make sure
i. Only one runs successfully.
ii. The one that runs successfully will kill all other db_bench
processes before it runs any benchmark.
Reviewers: sdong, yiwu, gunnarku
Reviewed By: gunnarku
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D58611
Summary: The function wrapper for LDBCommand::SelectCommand is too long so that Windows build fails with warning "decorated name length exceeded, name was truncated". Shrink the length by using a struct.
Test Plan: Build on both of Linux and Windows and make sure the warning doesn't show in either platform.
Reviewers: andrewkr, adsharma, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58965
Summary:
A couple of notes from the diff:
- The namespace block I added at the top of table_properties_collector.cc was in reaction to an issue i was having with PutVarint64 and reusing the "val" string. I'm not sure this is the cleanest way of doing this, but abstracting this out at least results in the correct behavior.
- I chose "rocksdb.merge.operands" as the property name. I am open to suggestions for better names.
- The change to sst_dump_tool.cc seems a bit inelegant to me. Is there a better way to do the if-else block?
Test Plan:
I added a test case in table_properties_collector_test.cc. It adds two merge operands and checks to make sure that both of them are reflected by GetMergeOperands. It also checks to make sure the wasPropertyPresent bool is properly set in the method.
Running both of these tests should pass:
./table_properties_collector_test
./sst_dump_test
Reviewers: IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58119
Summary:
Added a new abstraction to cache page to RocksDB designed for the read
cache use.
RocksDB current block cache is more of an object cache. For the persistent read cache
project, what we need is a page cache equivalent. This changes adds a cache
abstraction to RocksDB to cache pages called PersistentCache. PersistentCache can cache
uncompressed pages or raw pages (content as in filesystem). The user can
choose to operate PersistentCache either in COMPRESSED or UNCOMPRESSED mode.
Blame Rev:
Test Plan: Run unit tests
Reviewers: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D55707
Summary:
So a customized ldb tool can pass it's own Selector.
Such a selector is expected to call LDBCommand::SelectCommand
and then add some of its own customized commands
Test Plan: make ldb
Reviewers: sdong, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57249
Summary:
The implementation remains where it is. Only the
header is exported. This is so that a customized
ldb tool can print help along with its own
extra commands
Test Plan: make ldb
Reviewers: sdong, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57255
Summary: Adds three new WriteBatch data types: Prepare(xid), Commit(xid), Rollback(xid). Prepare(xid) should precede the (single) operation to which is applies. There can obviously be multiple Prepare(xid) markers. There should only be one Rollback(xid) or Commit(xid) marker yet not both. None of this logic is currently enforced and will most likely be implemented further up such as in the memtableinserter. All three markers are similar to PutLogData in that they are writebatch meta-data, ie stored but not counted. All three markers differ from PutLogData in that they will actually be written to disk. As for WriteBatchWithIndex, Prepare, Commit, Rollback are all implemented just as PutLogData and none are tested just as PutLogData.
Test Plan: single unit test in write_batch_test.
Reviewers: hermanlee4, sdong, anthony
Subscribers: leveldb, dhruba, vasilep, andrewkr
Differential Revision: https://reviews.facebook.net/D57867
Summary:
This diff includes an initial script running a set of benchmarks for
regression test. The script does the following things:
checkout the specified rocksdb commit (or origin/master as default)
make clean && DEBUG_LEVEL=0 make db_bench
setup test directories
run set of benchmarks and store results
Currently, the script will run couple benchmarks, store all the benchmark
output, extract micros per op and percentile information for each benchmark
and store them in a single SUMMARY.csv file. The SUMMARY.csv will make the
follow-up regression detection easier.
In addition, the current script only takes env arguments to set important
attributes of db_bench. Will follow-up with a patch that allows db_bench
to construct options from an options file.
Test Plan:
NUM_KEYS=100 ./tools/regression_test.sh
Sample SUMMARY.csv file:
commit id, benchmark, ms-per-op, p50, p75, p99, p99.9, p99.99
7e23ddf575890510e7d2fc7a79b31a1bbf317917, fillseq, 15.28, 54.66, 77.14, 5000.00, 17900.00, 18483.00
7e23ddf575890510e7d2fc7a79b31a1bbf317917, overwrite, 13.54, 57.69, 86.39, 3000.00, 15600.00, 17013.00
7e23ddf575890510e7d2fc7a79b31a1bbf317917, readrandom, 1.04, 0.80, 1.67, 293.33, 395.00, 504.00
7e23ddf575890510e7d2fc7a79b31a1bbf317917, readwhilewriting, 2.75, 1.01, 1.87, 200.00, 460.00, 485.00
7e23ddf575890510e7d2fc7a79b31a1bbf317917, deleterandom, 3.64, 48.12, 70.09, 200.00, 336.67, 347.00
7e23ddf575890510e7d2fc7a79b31a1bbf317917, seekrandom, 24.31, 391.87, 513.69, 872.73, 990.00, 1048.00
7e23ddf575890510e7d2fc7a79b31a1bbf317917, seekrandomwhilewriting, 14.02, 185.14, 294.15, 700.00, 1440.00, 1527.00
Reviewers: sdong, IslamAbdelRahman, kradhakrishnan, yiwu, andrewkr, gunnarku
Reviewed By: gunnarku
Subscribers: gunnarku, MarkCallaghan, andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57597
Summary:
This is needed so that rocksdb users can add more
commands to the included ldb tool by adding more custom
commands.
Test Plan: make -j ldb
Reviewers: sdong, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57243
Summary:
crash_test grep for 'fail' string in the output and if found it consider that we failed.
Update the output to use something else
Test Plan: make crash_test (still running)
Reviewers: yhchiang, sdong, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57381
Summary:
This adds a new metablock containing a shared dictionary that is used
to compress all data blocks in the SST file. The size of the shared dictionary
is configurable in CompressionOptions and defaults to 0. It's currently only
used for zlib/lz4/lz4hc, but the block will be stored in the SST regardless of
the compression type if the user chooses a nonzero dictionary size.
During compaction, computes the dictionary by randomly sampling the first
output file in each subcompaction. It pre-computes the intervals to sample
by assuming the output file will have the maximum allowable length. In case
the file is smaller, some of the pre-computed sampling intervals can be beyond
end-of-file, in which case we skip over those samples and the dictionary will
be a bit smaller. After the dictionary is generated using the first file in a
subcompaction, it is loaded into the compression library before writing each
block in each subsequent file of that subcompaction.
On the read path, gets the dictionary from the metablock, if it exists. Then,
loads that dictionary into the compression library before reading each block.
Test Plan: new unit test
Reviewers: yhchiang, IslamAbdelRahman, cyan, sdong
Reviewed By: sdong
Subscribers: andrewkr, yoshinorim, kradhakrishnan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D52287
Summary:
As db_stress with CompactFiles possibly catches a previous bug currently,
temporarily disable CompactFiles in db_stress in its default setting
to allows new bug to be detected while investigating the bug in CompactFiles.
Test Plan: crash test
Reviewers: sdong, kradhakrishnan, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57333
Summary:
Introduced option to dump malloc statistics using new option flag.
Added new command line option to db_bench tool to enable this
funtionality.
Also extended build to support environments with/without jemalloc.
Test Plan:
1) Build rocksdb using `make` command. Launch the following command
`./db_bench --benchmarks=fillrandom --dump_malloc_stats=true
--num=10000000` end verified that jemalloc dump is present in LOG file.
2) Build rocksdb using `DISABLE_JEMALLOC=1 make db_bench -j32` and ran
the same db_bench tool and found the following message in LOG file:
"Please compile with jemalloc to enable malloc dump".
3) Also built rocksdb using `make` command on MacOS to verify behavior
in non-FB environment.
Also to debug build configuration change temporary changed
AM_DEFAULT_VERBOSITY = 1 in Makefile to see compiler and build
tools output. For case 1) -DROCKSDB_JEMALLOC was present in compiler
command line. For both 2) and 3) this flag was not present.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57321
Comparable with Snappy on comp ratio.
Implemented using Windows API, does not require external package.
Avaiable since Windows 8 and server 2012.
Use -DXPRESS=1 with CMake to enable.
Summary:
Enable testing CompactFiles in db_stress by adding flag test_compact_files
to db_stress.
Test Plan:
./db_stress --test_compact_files=1 --compaction_style=0 --allow_concurrent_memtable_write=false --ops_per_thread=100000
./db_stress --test_compact_files=1 --compaction_style=1 --allow_concurrent_memtable_write=false --ops_per_thread=100000
Sample output (note that it's normal to have some CompactFiles() failed):
Stress Test : 491.891 micros/op 65054 ops/sec
: Wrote 21.98 MB (0.45 MB/sec) (45% of 3200352 ops)
: Wrote 1440728 times
: Deleted 441616 times
: Single deleted 38181 times
: 319251 read and 19025 found the key
: Prefix scanned 640520 times
: Iterator size sum is 9691415
: Iterated 319704 times
: Got errors 0 times
: 1323 CompactFiles() succeed
: 32 CompactFiles() failed
2016/04/11-15:50:58 Verification successful
Reviewers: sdong, IslamAbdelRahman, kradhakrishnan, yiwu, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D56565
Summary:
This is taken from the "Write(GB)" column in compaction stats, so the
units should be GB, not MB.
Test Plan: none
Reviewers: sdong, yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56889
Summary: It is useful to print out IO stats in flush jobs too. Extend options.compaction_measure_io_stats to flush jobs and raname it.
Test Plan: Try db_bench and see the stats are printed out.
Reviewers: yhchiang
Reviewed By: yhchiang
Subscribers: kradhakrishnan, yiwu, IslamAbdelRahman, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56769
* Update db_bench_tool.cc
I fixed the wrong letters, LevelDB -> rocksDB, because I thought of LevelDB as the wrong presentation.
the following show my fix :
fprintf(stderr, "LevelDB: version %d.%d\n",
kMajorVersion, kMinorVersion);
----------------->
fprintf(stderr, "rocksDB: version %d.%d\n",
kMajorVersion, kMinorVersion);
* Update db_bench_tool.cc
* Update db_bench_tool.cc
Summary:
strings.h header does not exist on Windows. So, we can try another way
to compare strings ignoring case.
Test Plan:
built and ran:
$ ./ldb_cmd_test
Reviewers: sdong, yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D56535
Summary:
Current Message
```
sst_dump [--command=check|scan|none|raw] [--verify_checksum] --file=data_dir_OR_sst_file [--output_hex] [--input_key_hex] [--from=<user_key>] [--to=<user_key>] [--read_num=NUM] [--show_properties] [--show_compression_sizes] [--show_compression_sizes [--set_block_size=<block_size>]]
```
New message
```
sst_dump --file=<data_dir_OR_sst_file> [--command=check|scan|raw]
--file=<data_dir_OR_sst_file>
Path to SST file or directory containing SST files
--command=check|scan|raw
check: Iterate over entries in files but dont print anything except if an error is encounterd (default command)
scan: Iterate over entries in files and print them to screen
raw: Dump all the table contents to <file_name>_dump.txt
--output_hex
Can be combined with scan command to print the keys and values in Hex
--from=<user_key>
Key to start reading from when executing check|scan
--to=<user_key>
Key to stop reading at when executing check|scan
--read_num=<num>
Maximum number of entries to read when executing check|scan
--verify_checksum
Verify file checksum when executing check|scan
--input_key_hex
Can be combined with --from and --to to indicate that these values are encoded in Hex
--show_properties
Print table properties after iterating over the file
--show_compression_sizes
Independent command that will recreate the SST file using 16K block size with different
compressions and report the size of the file using such compression
--set_block_size=<block_size>
Can be combined with --show_compression_sizes to set the block size that will be used
when trying different compression algorithms
```
Test Plan: none
Reviewers: yhchiang, andrewkr, kradhakrishnan, yiwu, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56325
Summary:
Added the column family name to the properties block. This property
is omitted only if the property is unavailable, such as when RepairDB()
writes SST files.
In a next diff, I will change RepairDB to use this new property for
deciding to which column family an existing SST file belongs. If this
property is missing, it will add it to the "unknown" column family (same
as its existing behavior).
Test Plan:
New unit test:
$ ./db_table_properties_test --gtest_filter=DBTablePropertiesTest.GetColumnFamilyNameProperty
Reviewers: IslamAbdelRahman, yhchiang, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D55605
Summary:
Adapted a stderr logger from the option tests. Moved it to a separate
header so we can reuse it, e.g., from ldb subcommands for faster debugging. This
is especially useful to make errors/warnings more visible when running
"ldb repair", which involves potential data loss.
Test Plan:
ran options_test and "ldb repair"
$ ./ldb repair --db=./tmp/
[WARN] **** Repaired rocksdb ./tmp/; recovered 1 files; 588bytes. Some data may have been lost. ****
OK
Reviewers: IslamAbdelRahman, yhchiang, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D56151
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
Test Plan:
'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK.
I didn't run the Java tests, I don't have Java set up on my devserver.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56133
Summary:
Undefined Behavior Sanitizer (ubsan) is //a good thing// which will help us to find sneaky bugs with low cost. Please see http://developerblog.redhat.com/2014/10/16/gcc-undefined-behavior-sanitizer-ubsan/ for more details and official GCC documentation for more context: https://gcc.gnu.org/onlinedocs/gcc/Instrumentation-Options.html.
Changes itself are quite simple and pretty much imitating whatever is implemented for ASan.
Hooking the UBsan validation build to Sandcastle is a separate step and will be dealt as separate diff because code is in internal repository.
Test Plan: Make sure that that there no regressions when it comes to builds and test pass rate.
Reviewers: leveldb, sdong, IslamAbdelRahman, yhchiang
Reviewed By: yhchiang
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56049
Summary:
Expose the inverse of ToString(hex=true) on Slice: Slice::DecodeHex
Refactor the other implementation of to/from hex in ldb_cmd.h to use the Slice
version
(Difference between the 2 is whether 0x is expected/produced in front of the hex
string or not)
Eliminated support for invalid odd length hex string - this is now invalid
instead of having 1/2 byte set
Added (inverse of HexToString) test for LDBCommand::StringToHex which also
indirectly tests Slice::ToString(true)
After moving the original implementation from ldb_cmd.h, updated it to much simpler/efficient version
(originally/inspired from https://github.com/facebook/wdt/blob/master/util/EncryptionUtils.cpp#L140-L169 )
Test Plan: run tests
Reviewers: uddipta, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56121
record.size() should not be less than 12.
This "magic number" seems to be the WriteBatch header (8 byte sequence
and 4 byte count). Replaced all the places where "12" was used
by WriteBatchInternal::kHeader.
Summary: Test seems to fail if we don't use consistent version between testing forward and backward compatibility.
Test Plan: Run the script (with some version removed manually to make it shorter)
Reviewers: IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D55773
Summary: After introducing a less forward-compatible change, update the backward compatible checking tool.
Test Plan: Run it.
Reviewers: IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: IslamAbdelRahman, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D55695
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
When the table reader is destroyed, it releases the pinned blocks (if there were any). This has to happen before the cache is destroyed, so I had to introduce a TableReader::Close(), to guarantee the order of destruction.
Test Plan:
Added two unit tests for this. Existing unit tests run fine (default is pin_l0_filter_and_index_blocks_in_cache=false).
DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32
Mac: OK.
Linux: with D55287 patched in it's OK.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D54801
Summary: For compactions benchmarks (both level and universal) we'll use `--max_write_buffer_number=4`. For all the other benchmarks which don't customize the value of `--max_background_flushes` we'll continue using `--max_write_buffer_number=8`.
Test Plan:
To validate basic correctness and command-line options:
```
cd ~/rocksdb
NKEYS=10000000 ./tools/run_flash_bench.sh
```
Reviewers: MarkCallaghan
Reviewed By: MarkCallaghan
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D55497
Summary: Set of updates to the subcompaction benchmark script which are based on our internal discussions. The intent behind the changes is to make sure that the scripts will correctly reflect how we're doing the actual benchmarking.
Test Plan: Tested by exercising the full set of compaction benchmarks and validating the execution and consistency of results.
Reviewers: MarkCallaghan, sdong, yhchiang
Reviewed By: yhchiang
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D55461
Summary: This will make it easier for admins and devs to use RepairDB.
Test Plan:
Tried deleting the manifest and verified it recovers:
$ ldb --create_if_missing --db=/tmp/test_db put ok ok
$ rm -f /tmp/test_db/MANIFEST-000001
$ ./ldb --db=/tmp/test_db repair
$ ldb --db=/tmp/test_db get ok
ok
Reviewers: yhchiang, sdong, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D55359
Summary: Refactored db_bench transaction stress tests so that they can be called from unit tests as well.
Test Plan: run new unit test as well as db_bench
Reviewers: yhchiang, IslamAbdelRahman, sdong
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D55203
Summary: After closely working with Mark, Siying, and Yueh-Hsuan this set of changes reflects the updates needed to measure RocksDB subcompaction performance in a correct manner. The essence of the benchmark is executing `fillrandom` followed by `compact` with the correct set of options for various number of subcompactions specified.
Test Plan: Tested internally to verify correctness and reliability.
Reviewers: sdong, yhchiang, MarkCallaghan
Reviewed By: MarkCallaghan
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D55089
Test Plan:
Built and ran with gflags:
% ./db_bench
LevelDB: version 4.5
Date: Tue Feb 16 12:04:23 2016
CPU: 40 * Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
...
And without gflags:
% ./db_bench
Please install gflags to run rocksdb tools
%
Reviewers: sdong, igor
Reviewed By: igor
Subscribers: igor, dhruba
Differential Revision: https://reviews.facebook.net/D54243
Summary:
Implement a benchmark for universal compaction based on the feature description (see below), in-person discussions, and reading source code:
https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guidehttps://github.com/facebook/rocksdb/wiki/Universal-Compactionhttps://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide#universal-compaction
Universal compaction benchmark is based on `overwrite` benchmark, adding compaction specific options to it, and executing it for different values of subcompaction to understand the impact of scaling out subcompactions for a particular scenario.
Test Plan:
- Execute the benchmark on various machines for multiple iterations to verify the reliability.
- Observe the output to make sure that compaction is taking place.
- Observe the execution to make sure that arguments passed to `db_bench` are correct.
Reviewers: sdong, MarkCallaghan
Reviewed By: MarkCallaghan
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D54045
Summary: Default crash test uses prefix hash memtable, which is not compatible to concurrent memtable. Allow prefix test run with skip list and use skip list memtable when concurrent insert is used.
Test Plan: Run "python -u tools/db_crashtest.py whitebox" and watch sometimes skip list is used.
Reviewers: anthony, yhchiang, kradhakrishnan, andrewkr, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D53907
Summary: If users turn on concurrent insert but the memtable doesn't support it, they might see unexcepted crash. Fix it by explicitly fail.
Test Plan:
Run different setting of stress_test and make sure it fails correctly.
Will add a unit test too.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, andrewkr, ngbronson
Reviewed By: ngbronson
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D53895
Summary: This set of changes is part of the work to introduce benchmark for universal style compaction in RocksDB. It's conceptually separate from the compaction work, so sending it out as a separate diff to get it out of the way.
Test Plan:
- Run `./tools/run_flash_bench.sh`.
- Look at the contents of `report.txt` and `report2.txt` to make sure that data is reported and attributed correctly.
- During `db_bench` execution time make sure that the correct flags are passed to `--disable_wal` depending on the benchmark being executed.
Reviewers: MarkCallaghan
Reviewed By: MarkCallaghan
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D53865
Summary: Add an option of --allow_concurrent_memtable_write in stress test and cover it in crash test
Test Plan: Run crash test and make sure three combinations of the two options show up randomly.
Reviewers: IslamAbdelRahman, yhchiang, andrewkr, anthony, kradhakrishnan
Reviewed By: kradhakrishnan
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D53811
Summary:
Some of the tests aren't considered to be critical when it comes to getting key benchmarking data for RocksDB. Therefore we'll introduce an environment variable `SKIP_LOW_PRI_TESTS` which enables skipping those test cases. By default all the tests will be run. If you want to optimize the test-case execution then do the following:
`
$ export SKIP_LOW_PRI_TESTS=1
$ ./tools/run_flash_bench.sh
`
Test Plan: Verified that when `SKIP_LOW_PRI_TESTS` is not set then `benchmark.sh` is called for all the scenarios and when `SKIP_LOW_PRI_TESTS` is set to `1` then `benchmark.sh` is called only for the test-cases which are critical.
Reviewers: MarkCallaghan
Reviewed By: MarkCallaghan
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D53739
Summary:
Add an option --column_family option, so that users can query or update specific column family.
Also add an create column family parameter to make unit test easier.
Still need to add unit tests.
Test Plan: Will add a test case in ldb python test.
Reviewers: yhchiang, rven, andrewkr, IslamAbdelRahman, kradhakrishnan, anthony
Reviewed By: anthony
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D53265
Summary:
In crash test, when coming to each kill point, we start a random class using seed as current second. With this approach, for every second, the random number used is the same. However, in each second, there are multiple kill points with different frequency. It makes it hard to reason about chance of kill point to trigger. With this commit, we use thread local random seed to generate the random number, so that it will take different values per second, hoping it makes chances of killing much easier to reason about.
Also significantly reduce the kill odd to make sure time before kiling is similar as before.
Test Plan: Run white box crash test and see the killing happens as expected and the run time time before killing reasonable.
Reviewers: kradhakrishnan, IslamAbdelRahman, rven, yhchiang, andrewkr, anthony
Reviewed By: anthony
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D52971
Summary:
util/testutil.h doesn't seem to be used in tools/sst_dump_tool_imp.h. Remove it.
Also move some other include to tools/sst_dump_tool.cc instead.
Test Plan: Build with GCC, CLANG and with GCC 4.81 and 4.9.
Reviewers: yuslepukhin, yhchiang, rven, anthony, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D52791
Summary:
This set of changes implements the following design: `ldb` will utilize `--path` parameter which can be used to specify a file name. Tool will then apply some heuristic to determine how to output the data properly. The design decision is not to probe the file content, but use file names to determine what dumping function to call.
Usage examples:
Understands that path points to a manifest file and dumps it.
`./ldb --path=/tmp/test_db/MANIFEST-000023 dump`
Understands that path points to a WAL file and dumps it.
`./ldb --path=/tmp/test_db/000024.log dump --header`
Understands that path points to a SST file and dumps it.
`./ldb --path=/tmp/test_db/000007.sst dump`
Figures out that none of the supported file types are applicable and outputs
an appropriate error message.
`./ldb --path=/tmp/cron.log dump`
Test Plan:
Basics:
git diff
make clean
make -j 32 commit-prereq
arc lint
More specific testing (done as part of commit-prereq, but can be iterated separately when making isolated changes):
make clean
make ldb
python tools/ldb_test.py
make rocksdb_dump
make rocksdb_undump
sh tools/rocksdb_dump_test.sh
Reviewers: rven, IslamAbdelRahman, yhchiang, kradhakrishnan, anthony, igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D52269
Summary:
1) changes tools/{benchmark,run_flash_bench}.sh to optionally use the write rate limit
2) removes code for --writes_per_second and switches the 'background' write rate limit
to use --benchmark_write_rate_limit
Replaces https://reviews.facebook.net/D49113
Task ID: #9555881
Blame Rev:
Test Plan:
tools/run_flash_bench.sh
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D52485
Summary:
When Get() or NewIterator() trigger file loads, skip caching the filter block if
(1) optimize_filters_for_hits is set and (2) the file is on the bottommost
level. Also skip checking filters under the same conditions, which means that
for a preloaded file or a file that was trivially-moved to the bottom level, its
filter block will eventually expire from the cache.
- added parameters/instance variables in various places in order to propagate the config ("skip_filters") from version_set to block_based_table_reader
- in BlockBasedTable::Rep, this optimization prevents filter from being loaded when the file is opened simply by setting filter_policy = nullptr
- in BlockBasedTable::Get/BlockBasedTable::NewIterator, this optimization prevents filter from being used (even if it was loaded already) by setting filter = nullptr
Test Plan:
updated unit test:
$ ./db_test --gtest_filter=DBTest.OptimizeFiltersForHits
will also run 'make check'
Reviewers: sdong, igor, paultuckfield, anthony, rven, kradhakrishnan, IslamAbdelRahman, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D51633
Summary:
Based on https://github.com/facebook/rocksdb/issues/843
It looks that when the data is hot we spend significant amount of time moving data out of RocksDB blocks. This patch reduce moving memory when possible
Original performance
```
$ time ./ldb --db=/home/tec/local/ellina_test/testdb scan > /dev/null
real 0m16.736s
user 0m11.993s
sys 0m4.725s
```
Performance after reducing memcpy
```
$ time ./ldb --db=/home/tec/local/ellina_test/testdb scan > /dev/null
real 0m11.590s
user 0m6.983s
sys 0m4.595s
```
Test Plan:
dump the output of the scan into 2 files and verifying the are exactly the same
make check
Reviewers: sdong, yhchiang, anthony, rven, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D51093
Summary: Now "ldb compact" skips the bottommost level compaction. This is an unintended behavior change. Reverting it now. Maybe we need to add another mode later for it.
Test Plan: Run a manual test of 'ldb' to make sure bottom most level is compacted.
Reviewers: IslamAbdelRahman, yhchiang, anthony, kradhakrishnan, rven
Reviewed By: rven
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D50925
Summary: We don't have access to GetLiveFilesMetadata() in RocksDB lite. If compiling write_stress for lite, I skip the check for leaked files, which depends on this function.
Test Plan: OPT=-DROCKSDB_LITE m write_stress
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D49647
Summary:
The goal of this diff is to create a simple stress test with focus on catching:
* bugs in compaction/flush processes, especially the ones that cause assertion errors
* bugs in the code that deletes obsolete files
There are two parts of the test:
* write_stress, a binary that writes to the database
* write_stress_runner.py, a script that invokes and kills write_stress
Here are some interesting parts of write_stress:
* Runs with very high concurrency of compactions and flushes (32 threads total) and tries to create a huge amount of small files
* The keys written to the database are not uniformly distributed -- there is a 3-character prefix that mutates occasionally (in prefix mutator thread), in such a way that the first character mutates slower than second, which mutates slower than third character. That way, the compaction stress tests some interesting compaction features like trivial moves and bottommost level calculation
* There is a thread that creates an iterator, holds it for couple of seconds and then iterates over all keys. This is supposed to test RocksDB's abilities to keep the files alive when there are references to them.
* Some writes trigger WAL sync. This is stress testing our WAL sync code.
* At the end of the run, we make sure that we didn't leak any of the sst files
write_stress_runner.py changes the mode in which we run write_stress and also kills and restarts it. There are some interesting characteristics:
* At the beginning we divide the full test runtime into smaller parts -- shorter runtimes (couple of seconds) and longer runtimes (100, 1000) seconds
* The first time we run write_stress, we destroy the old DB. Every next time during the test, we use the same DB.
* We can run in kill mode or clean-restart mode. Kill mode kills the write_stress violently.
* We can run in mode where delete_obsolete_files_with_fullscan is true or false
* We can run with low_open_files mode turned on or off. When it's turned on, we configure table cache to only hold a couple of files -- that way we need to reopen files every time we access them.
Another goal was to create a stress test without a lot of parameters. So tools/write_stress_runner.py should only take one parameter -- runtime_sec and it should figure out everything else on its own.
In a separate diff, I'll add this new test to our nightly legocastle runs.
Test Plan:
The goal of this test was to retroactively catch the following bugs: D33045, D48201, D46899, D42399. I failed to reproduce D48201, but all others have been caught!
When i reverted https://reviews.facebook.net/D33045:
./write_stress --runtime_sec=200 --low_open_files_mode=true
Iterator statuts not OK: IO error: /fast-rocksdb-tmp/rocksdb_test/write_stress/089166.sst: No such file or directory
When i reverted https://reviews.facebook.net/D42399:
python tools/write_stress_runner.py --runtime_sec=5000
Running write_stress, will kill after 5 seconds: ./write_stress --runtime_sec=-1
Running write_stress, will kill after 2 seconds: ./write_stress --runtime_sec=-1 --destroy_db=false --delete_obsolete_files_with_fullscan=true
Running write_stress, will kill after 7 seconds: ./write_stress --runtime_sec=-1 --destroy_db=false
Running write_stress, will kill after 5 seconds: ./write_stress --runtime_sec=-1 --destroy_db=false
Running write_stress, will kill after 8 seconds: ./write_stress --runtime_sec=-1 --destroy_db=false --low_open_files_mode=true
Write to DB failed: IO error: /fast-rocksdb-tmp/rocksdb_test/write_stress/019250.sst: No such file or directory
ERROR: write_stress died with exitcode=-6
When i reverted https://reviews.facebook.net/D46899:
python tools/write_stress_runner.py --runtime_sec=1000
runtime: 1000
Going to execute write stress for [3, 3, 100, 3, 2, 100, 1, 788]
Running write_stress for 3 seconds: ./write_stress --runtime_sec=3 --low_open_files_mode=true
Running write_stress for 3 seconds: ./write_stress --runtime_sec=3 --destroy_db=false --delete_obsolete_files_with_fullscan=true
Running write_stress, will kill after 100 seconds: ./write_stress --runtime_sec=-1 --destroy_db=false --delete_obsolete_files_with_fullscan=true
write_stress: db/db_impl.cc:2070: void rocksdb::DBImpl::MarkLogsSynced(uint64_t, bool, const rocksdb::Status&): Assertion `log.getting_synced' failed.
ERROR: write_stress died with exitcode=-6
Reviewers: IslamAbdelRahman, yhchiang, rven, kradhakrishnan, sdong, anthony
Reviewed By: anthony
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D49533
Summary: crash_test still has a very low chance to hit some crash point. Have another mode for covering them more likely.
Test Plan: Run crash_test and see db_stress is called with expected prameters.
Reviewers: kradhakrishnan, igor, anthony, rven, IslamAbdelRahman, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D49473
Summary:
in tools/db_crashtest.py, cmd_params['db'] by default is a lambda expression, not the actual db_name.
fix by get the db_name before passing it to gen_cmd.
Test Plan: run `make crashtest`
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D49119
Summary:
merge tools/db_crashtest2.py into tools/db_crashtest.py
python tools/db_crashtest.py -h # show help message, ALL parameters can be overwrite by arguments
Example usages:
python tools/db_crashtest.py blackbox # run blackbox with default parameters
python tools/db_crashtest.py blackbox --simple
python tools/db_crashtest.py whitebox # run whitebox with default parameters
python tools/db_crashtest.py whitebox --simple
all default parameters are identical to previous version.
Test Plan: `make crash_test` and make sure it can run with expected parameters pased to db_stress.
Reviewers: igor, rven, anthony, IslamAbdelRahman, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D48567
We will need the log number to validate the recycle-style CRCs. The log
is helpful for debugging, but optional, as not all callers have it.
Signed-off-by: Sage Weil <sage@redhat.com>
Summary:
For half of the crash_test run, disable fail point for file appending, in order to trigger other fail point more frequently.
Also, tune crash test parameter a little bit for it to initialize faster.
Test Plan: Run crash_test and make sure it issues db_stress commands as expected.
Reviewers: yhchiang, igor, rven, anthony, IslamAbdelRahman, kradhakrishnan
Reviewed By: kradhakrishnan
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D48843
Summary: Mode of data sync off is a much less used than the case of data sync on. Crash test should cover the more common case than a corner case. So turn data sync on in crash tests.
Test Plan: Run crash test and make sure it can run with expected parameters pased to db_stres.
Reviewers: IslamAbdelRahman, anthony, igor, rven, kradhakrishnan, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D48729
Summary:
Give a name for every kill point, and allow users to disable some kill points based on prefixes. The kill points can be passed by db_stress through a command line paramter. This provides a way for users to boost the chance of triggering low frequency kill points
This allow follow up changes in crash test scripts to improve crash test coverage.
Test Plan:
Manually run db_stress with variable values of --kill_random_test and --kill_prefix_blacklist. Like this:
--kill_random_test=2 --kill_prefix_blacklist=Posix,WritableFileWriter::Append,WritableFileWriter::WriteBuffered,WritableFileWriter::Sync
Reviewers: igor, kradhakrishnan, rven, IslamAbdelRahman, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D48735
Summary:
This patch will block all tests (not including db_test) that don't compile / fail under ROCKSDB_LITE
Test Plan:
OPT=-DROCKSDB_LITE make db_compaction_filter_test -j64 &&
OPT=-DROCKSDB_LITE make db_compaction_test -j64 &&
OPT=-DROCKSDB_LITE make db_dynamic_level_test -j64 &&
OPT=-DROCKSDB_LITE make db_log_iter_test -j64 &&
OPT=-DROCKSDB_LITE make db_tailing_iter_test -j64 &&
OPT=-DROCKSDB_LITE make db_universal_compaction_test -j64 &&
OPT=-DROCKSDB_LITE make ldb_cmd_test -j64
make clean
make db_compaction_filter_test -j64 &&
make db_compaction_test -j64 &&
make db_dynamic_level_test -j64 &&
make db_log_iter_test -j64 &&
make db_tailing_iter_test -j64 &&
make db_universal_compaction_test -j64 &&
make ldb_cmd_test -j64
Reviewers: yhchiang, igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D48723
Summary: As part of cleaning up dependencies for tech debt week, we are moving ldb and sst_dump tools from util to tools, since they are tools.
Test Plan: make check
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D48747
Summary:
Refactor dump_tool and undump_tool so that it's possible to use them with customized options
for example setting a specific comparator similar to what Dragon is doing with the LdbTool
https://phabricator.fb.com/diffusion/FBCODE/browse/master/dragon/tools/Ldb.cpp
Test Plan:
compiles
used it to dump / undump a dragon shard
Reviewers: sdong, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba, adsharma
Differential Revision: https://reviews.facebook.net/D47853
Summary:
Fixed the following tsan warning in db_stress.cc
WARNING: ThreadSanitizer: data race (pid=3163194)
Read of size 8 at 0x7fd1797cb518 by thread T32:
#0 VerifyDb tools/db_stress.cc:1731 (db_stress+0x000000040674)
#1 rocksdb::StressTest::ThreadBody(void*) tools/db_stress.cc:1191 (db_stress+0x0000000625a9)
#2 StartThreadWrapper util/env_posix.cc:1648 (db_stress+0x00000028bbbd)
Previous write of size 8 at 0x7fd1797cb518 by thread T31:
#0 VerifyDb tools/db_stress.cc:1726 (db_stress+0x00000004072a)
#1 rocksdb::StressTest::ThreadBody(void*) tools/db_stress.cc:1191 (db_stress+0x0000000625a9)
#2 StartThreadWrapper util/env_posix.cc:1648 (db_stress+0x00000028bbbd)
The cause is that in VerifyDb(), the static local const variable long max_key
can be read and written at the same time. This patch fixed it by making it
non-static.
Test Plan: db_stress
Reviewers: igor, sdong, IslamAbdelRahman, anthony
Reviewed By: anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47703
Summary:
This patch fixes#7460559. It introduces SingleDelete as a new database
operation. This operation can be used to delete keys that were never
overwritten (no put following another put of the same key). If an overwritten
key is single deleted the behavior is undefined. Single deletion of a
non-existent key has no effect but multiple consecutive single deletions are
not allowed (see limitations).
In contrast to the conventional Delete() operation, the deletion entry is
removed along with the value when the two are lined up in a compaction. Note:
The semantics are similar to @igor's prototype that allowed to have this
behavior on the granularity of a column family (
https://reviews.facebook.net/D42093 ). This new patch, however, is more
aggressive when it comes to removing tombstones: It removes the SingleDelete
together with the value whenever there is no snapshot between them while the
older patch only did this when the sequence number of the deletion was older
than the earliest snapshot.
Most of the complex additions are in the Compaction Iterator, all other changes
should be relatively straightforward. The patch also includes basic support for
single deletions in db_stress and db_bench.
Limitations:
- Not compatible with cuckoo hash tables
- Single deletions cannot be used in combination with merges and normal
deletions on the same key (other keys are not affected by this)
- Consecutive single deletions are currently not allowed (and older version of
this patch supported this so it could be resurrected if needed)
Test Plan: make all check
Reviewers: yhchiang, sdong, rven, anthony, yoshinorim, igor
Reviewed By: igor
Subscribers: maykov, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43179
Summary:
Added tests for two LDBCommands namely i) ManifestDumpCommand and ii) ListColumnFamiliesCommand.
+ Minor fix in the sscanf formatter (along relace C cast with C++ cast) + replacing localtime with localtime_r which is thread safe.
Test Plan: make all && ./tools/ldb_test.py
Reviewers: anthony, igor, IslamAbdelRahman, kradhakrishnan, lgalanis, rven, sdong
Reviewed By: sdong
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D45819
Summary: Add ZSTD compression type. The same way as adding LZ4.
Test Plan: run all tests. Generate files in db_bench. Make sure reads succeed. But the SST files cannot be opened in older versions. Also some other adhoc tests.
Reviewers: rven, anthony, IslamAbdelRahman, kradhakrishnan, igor
Reviewed By: igor
Subscribers: MarkCallaghan, maykov, yoshinorim, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D45747
Summary:
db_bench output now displays Percentile many times with --statistics after
read IO latency histograms were added. So I only need the last one in the report output.
Task ID: #
Blame Rev:
Test Plan:
run run_flash_bench.sh
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D45093
Summary:
Up until this point we had DbOptions.num_subcompactions, but
it is semantically more correct to call this max_subcompactions since
we will schedule *up to* DbOptions.max_subcompactions smaller compactions
at a time during a compaction job.
I also added a --subcompactions option to db_bench
Test Plan: make all make check
Reviewers: sdong, igor, anthony, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D45069
Summary:
Changes include:
* don't sync-on-commit for single writer thread in readwhile... tests
* make default block size 8kb rather than 4kb to avoid too small blocks after compression
* use snappy instead of zlib to avoid stalls from compression latency
* disable statistics
* use bytes_per_sync=8M to reduce throughput loss on disk
* use open_files=-1 to reduce mutex contention
Task ID: #
Blame Rev:
Test Plan:
run benchmark
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D44961
Summary:
In prepration for running multiple threads at the same time during
a compaction job, this patch assigns each subcompaction its own state
(instead of sharing the one global CompactionState). Each subcompaction then
uses this state to update its statistics, keep track of its snapshots, etc.
during the course of execution. Then at the end of all the executions the
statistics are aggregated across the subcompactions so that the final result
is the same as if only one larger compaction had run.
Test Plan: ./db_test ./db_compaction_test ./compaction_job_test
Reviewers: sdong, anthony, igor, noetzli, yhchiang
Reviewed By: yhchiang
Subscribers: MarkCallaghan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43239
Summary:
While working on single delete support for db_bench, I realized that
db_bench/db_stress contain a bunch of duplicate code related to
copmression and found some typos. This patch removes duplicate code,
typos and a redundant #ifndef in internal_stats.cc.
Test Plan: make db_stress && make db_bench && ./db_bench --benchmarks=compress,uncompress
Reviewers: yhchiang, sdong, rven, anthony, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43965
Summary: In a recent change, crash_test can put data under TEST_TMPDIR. However, the directory is not cleaned before running the test, which may cause unexpected results. Clean it.
Test Plan: Run white and black box crash test against non-existing, or non-empty but not compactible DBs, and make sure it works as expected.
Reviewers: kradhakrishnan, rven, yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D43515
Summary: Currently, whitebox crash test is not really executed, because the DB is destroyed after each crash. With this fix, in the first half of the time, DB will keep opening the crashed DB and continue from there.
Test Plan: "make whitebox_crash_test" and see the same DB keeps crashing and being reopened.
Reviewers: IslamAbdelRahman, yhchiang, rven, kradhakrishnan
Reviewed By: kradhakrishnan
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D43503
Summary: Currently crash_test only puts data under /tmp. It is less flexible if we want to cover different file systems or media. Make crash_test to appreciate TEST_TMPDIR so that users can run it against another file system.
Test Plan: Run blackbox_crash_test and whitebox_crash_test with or without TEST_TMPDIR set and make sure DBs are put in the right place
Reviewers: kradhakrishnan, yhchiang, rven, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D43509
Summary:
crash_test now only runs complicated options, multiple column families, prefix hash, frequently changing options, many compaction threads, etc. These options are good to cover new features but we loss coverage in most common use cases. Furthermore, by running only for multiple column families, we are not able to create LSM trees that are large enough to cover some stress cases.
Make half of crash_test runs the simply tests: single column family, default mem table, one compaction thread, no change options.
Test Plan: Run crash_test
Reviewers: rven, yhchiang, IslamAbdelRahman, kradhakrishnan
Reviewed By: kradhakrishnan
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D43461
Summary:
Crash tests are supposed to restart the same DB after crashing, but it is now opening a different DB. Fix it.
It's probably a leftover of https://reviews.facebook.net/D17073
Test Plan: Run the test and make sure the same Db is opened.
Reviewers: kradhakrishnan, rven, igor, IslamAbdelRahman, yhchiang, anthony
Reviewed By: anthony
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D43197
Summary:
When we first started, max_background_flushes was 0 by default and compaction thread was executing flushes (since there was no flush thread). Then, we switched the default max_background_flushes to 1. However, we still support the case where there is no flush thread and flushes are done in compaction. This is making our code a bit more complicated. By not supporting this use-case we can make our code simpler.
We have a special case that when you set max_background_flushes to 0, we
schedule the flush to execute on the compaction thread.
Test Plan: make check (there might be some unit tests that depend on this behavior)
Reviewers: IslamAbdelRahman, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D41931
Summary: This is failing our tsan tests. Our new behavior is to fail DB::Open() if the requested compression is not available. The easiest fix is to make ldb_test not depend on compression.
Test Plan: python tools/ldb_test.py
Reviewers: sdong, yhchiang, anthony
Reviewed By: anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D42075
Summary: This option is guarding the feature implemented 2 and a half years ago: D8991. The feature was enabled by default back then and has been running without issues. There is no reason why any client would turn this feature off. I found no reference in fbcode.
Test Plan: none
Reviewers: sdong, yhchiang, anthony, dhruba
Reviewed By: dhruba
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D42063
Summary: Block reduce_levels_test in ROCKSDB_LITE as LDBCommand is not supported
Test Plan:
make reduce_levels_test -j64
OPT=-DROCKSDB_LITE make reduce_levels_test -j64
make check -j64
Reviewers: sdong, igor, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D41967
Summary: This helps Windows port to format their changes, as discussed. Might have formatted some other codes too becasue last 10 commits include more.
Test Plan: Build it.
Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D41961
Summary: Make RocksDb build and run on Windows to be functionally
complete and performant. All existing test cases run with no
regressions. Performance numbers are in the pull-request.
Test plan: make all of the existing unit tests pass, obtain perf numbers.
Co-authored-by: Praveen Rao praveensinghrao@outlook.com
Co-authored-by: Sherlock Huang baihan.huang@gmail.com
Co-authored-by: Alex Zinoviev alexander.zinoviev@me.com
Co-authored-by: Dmitri Smirnov dmitrism@microsoft.com
Summary: Hack up rocksdb_dump and rocksdb_undump utilities to get this task rolling/promote discussion.
Test Plan: Dump/undump databases recursively to see if nothing is lost.
Reviewers: sdong, yhchiang, rven, anthony, kradhakrishnan, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D37269
Summary:
EventListener::OnFlushCompleted() now passes a structure instead
of a list of parameters. This minimizes the API change in the
future.
Test Plan:
listener_test
compact_files_test
example/compact_files_example
Reviewers: kradhakrishnan, sdong, IslamAbdelRahman, rven, igor
Reviewed By: rven, igor
Subscribers: IslamAbdelRahman, rven, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39543
Summary:
Add EventListener::OnTableFileCreated(), which will be called
when a table file is created. This patch is part of the
EventLogger and EventListener integration.
Test Plan: Augment existing test in db/listener_test.cc
Reviewers: anthony, kradhakrishnan, rven, igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38865
Summary:
Fixed db_stress by correcting the verification of column family
names in the Listener of db_stress
Test Plan: db_stress
Reviewers: igor, sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39255
Summary: Fixed a compile warning in db_stress in NDEBUG mode.
Test Plan: make OPT=-DNDEBUG db_stress
Reviewers: sdong, anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39213
Summary: Optimistic transactions supporting begin/commit/rollback semantics. Currently relies on checking the memtable to determine if there are any collisions at commit time. Not yet implemented would be a way of enuring the memtable has some minimum amount of history so that we won't fail to commit when the memtable is empty. You should probably start with transaction.h to get an overview of what is currently supported.
Test Plan: Added a new test, but still need to look into stress testing.
Reviewers: yhchiang, igor, rven, sdong
Reviewed By: sdong
Subscribers: adamretter, MarkCallaghan, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D33435