Summary:
Previously, if the jemalloc was built with nonempty string for
`--with-jemalloc-prefix`, then `HasJemalloc()` would return false on
Linux, so jemalloc would not be used at runtime. On Mac, it would cause
a linker failure due to no definitions found for the weak functions
declared in "port/jemalloc_helper.h". This should be a rare problem
because (1) on Linux the default `--with-jemalloc-prefix` value is the
empty string, and (2) Homebrew's build explicitly sets
`--with-jemalloc-prefix` to the empty string.
However, there are cases where `--with-jemalloc-prefix` is nonempty.
For example, when building jemalloc from source on Mac, the default
setting is `--with-jemalloc-prefix=je_`. Such jemalloc builds should be
usable by RocksDB.
The fix is simple. Defining `JEMALLOC_MANGLE` before including
"jemalloc.h" causes it to define unprefixed symbols that are aliases for
each of the prefixed symbols. Thanks to benesch for figuring this out
and explaining it to me.
Fixes https://github.com/facebook/rocksdb/issues/1462.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5521
Test Plan:
build jemalloc with prefixed symbols:
```
$ ./configure --with-jemalloc-prefix=lol
$ make
```
compile rocksdb against it:
```
$ WITH_JEMALLOC_FLAG=1 JEMALLOC=1 EXTRA_LDFLAGS="-L/home/andrew/jemalloc/lib/" EXTRA_CXXFLAGS="-I/home/andrew/jemalloc/include/" make -j12 ./db_bench
```
run db_bench and verify jemalloc actually used:
```
$ ./db_bench -benchmarks=fillrandom -statistics=true -dump_malloc_stats=true -stats_dump_period_sec=1
$ grep jemalloc /tmp/rocksdbtest-1000/dbbench/LOG
2019/06/29-12:20:52.088658 7fc5fb7f6700 [_impl/db_impl.cc:837] ___ Begin jemalloc statistics ___
...
```
Differential Revision: D16092758
fbshipit-source-id: c2c358346190ed62ceb2a3547a6c4c180b12f7c4
Summary:
Recent commit 3886dddc3b introduced a new test which is not compatible with lite mode and breaks contrun test:
```
[ RUN ] StatsHistoryTest.ForceManualFlushStatsCF
monitoring/stats_history_test.cc:642: Failure
Expected: (cfd_stats->GetLogNumber()) < (cfd_test->GetLogNumber()), actual: 15 vs 15
```
This PR excludes the test from lite mode to appease the failing test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5529
Differential Revision: D16080892
Pulled By: miasantreble
fbshipit-source-id: 2f8a22758f71250cd9f204046404226ddc13b028
Summary:
Formatting fixes in db_bench_tool that were accidentally omitted
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5525
Test Plan: Unit tests
Differential Revision: D16078516
Pulled By: elipoz
fbshipit-source-id: bf8df0e3f08092a91794ebf285396d9b8a335bb9
Summary:
This is to prevent bg flush thread from unrefing and deleting the cfd that has been dropped by a concurrent thread.
Before RocksDB calls `DBImpl::WaitForFlushMemTables`, we should increase the refcount of each `ColumnFamilyData` so that its ref count will not drop to 0 even if the column family is dropped by another thread. Otherwise the bg flush thread can deref the cfd and deletes it, causing a segfault in `WaitForFlushMemtables` upon accessing `cfd`.
Test plan (on devserver):
```
$make clean && COMPILE_WITH_ASAN=1 make -j32
$make check
```
All unit tests must pass.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5513
Differential Revision: D16062898
Pulled By: riversand963
fbshipit-source-id: 37dc511f1dc99f036d0201bbd7f0a8f5677c763d
Summary:
This PR creates cache_simulator.h file. It contains a CacheSimulator that runs against a block cache trace record. We can add alternative cache simulators derived from CacheSimulator later. For example, this PR adds a PrioritizedCacheSimulator that inserts filter/index/uncompressed dictionary blocks with high priority.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5517
Test Plan: make clean && COMPILE_WITH_ASAN=1 make check -j32
Differential Revision: D16043689
Pulled By: HaoyuHuang
fbshipit-source-id: 65f28ed52b866ffb0e6eceffd7f9ca7c45bb680d
Summary:
WAL records RocksDB writes to all column families. When user flushes a a column family, the old WAL will not accept new writes but cannot be deleted yet because it may still contain live data for other column families. (See https://github.com/facebook/rocksdb/wiki/Write-Ahead-Log#life-cycle-of-a-wal for detailed explanation)
Because of this, if there is a column family that receive very infrequent writes and no manual flush is called for it, it could prevent a lot of WALs from being deleted. PR https://github.com/facebook/rocksdb/pull/5046 introduced persistent stats column family which is a good example of such column families. Depending on the config, it may have long intervals between writes, and user is unaware of it which makes it difficult to call manual flush for it.
This PR addresses the problem for persistent stats column family by forcing a flush for persistent stats column family when 1) another column family is flushed 2) persistent stats column family's log number is the smallest among all column families, this way persistent stats column family will keep advancing its log number when necessary, allowing RocksDB to delete old WAL files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5509
Differential Revision: D16045896
Pulled By: miasantreble
fbshipit-source-id: 286837b633e988417f0096ff38384742d3b40ef4
Summary:
Enhancement to MultiGet batching to read data blocks required for keys in a batch in parallel from disk. It uses Env::MultiRead() API to read multiple blocks and reduce latency.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5464
Test Plan:
1. make check
2. make asan_check
3. make asan_crash
Differential Revision: D15911771
Pulled By: anand1976
fbshipit-source-id: 605036b9af0f90ca0020dc87c3a86b4da6e83394
Summary:
This PR is needed for integration into MyRocks. A second call on StartTrace returns Busy so that MyRocks may return an error to the user.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5519
Test Plan: make clean && USE_CLANG=1 make check -j32
Differential Revision: D16055476
Pulled By: HaoyuHuang
fbshipit-source-id: a51772fb0965c873922757eb470a332b1e02a91d
Summary:
`create_column_family` cmd already exists but was somehow missed in the help message.
also add `drop_column_family` cmd which can drop a cf without opening db.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5503
Test Plan: Updated existing ldb_test.py to test deleting a column family.
Differential Revision: D16018414
Pulled By: lightmark
fbshipit-source-id: 1fc33680b742104fea86b10efc8499f79e722301
Summary:
Mid-point insertion is a useful feature and is mature now. Make it default. Also changed cache_index_and_filter_blocks_with_high_priority=true as default accordingly, so that we won't evict index and filter blocks easier after the change, to avoid too many surprises to users.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5508
Test Plan: Run all existing tests.
Differential Revision: D16021179
fbshipit-source-id: ce8456e8d43b3bfb48df6c304b5290a9d19817eb
Summary:
Add C binding for secondary instance as well as unit test.
Test plan (on devserver)
```
$make clean && COMPILE_WITH_ASAN=1 make -j20 all
$./c_test
$make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5505
Differential Revision: D16000043
Pulled By: riversand963
fbshipit-source-id: 3361ef6bfdf4ce12438cee7290a0ac203b5250bd
Summary:
The first key is used to defer reading the data block until this file gets to the top of merging iterator's heap. For short range scans, most files never make it to the top of the heap, so this change can reduce read amplification by a lot sometimes.
Consider the following workload. There are a few data streams (we'll be calling them "logs"), each stream consisting of a sequence of blobs (we'll be calling them "records"). Each record is identified by log ID and a sequence number within the log. RocksDB key is concatenation of log ID and sequence number (big endian). Reads are mostly relatively short range scans, each within a single log. Writes are mostly sequential for each log, but writes to different logs are randomly interleaved. Compactions are disabled; instead, when we accumulate a few tens of sst files, we create a new column family and start writing to it.
So, a typical sst file consists of a few ranges of blocks, each range corresponding to one log ID (we use FlushBlockPolicy to cut blocks at log boundaries). A typical read would go like this. First, iterator Seek() reads one block from each sst file. Then a series of Next()s move through one sst file (since writes to each log are mostly sequential) until the subiterator reaches the end of this log in this sst file; then Next() switches to the next sst file and reads sequentially from that, and so on. Often a range scan will only return records from a small number of blocks in small number of sst files; in this case, the cost of initial Seek() reading one block from each file may be bigger than the cost of reading the actually useful blocks.
Neither iterate_upper_bound nor bloom filters can prevent reading one block from each file in Seek(). But this PR can: if the index contains first key from each block, we don't have to read the block until this block actually makes it to the top of merging iterator's heap, so for short range scans we won't read any blocks from most of the sst files.
This PR does the deferred block loading inside value() call. This is not ideal: there's no good way to report an IO error from inside value(). As discussed with siying offline, it would probably be better to change InternalIterator's interface to explicitly fetch deferred value and get status. I'll do it in a separate PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5289
Differential Revision: D15256423
Pulled By: al13n321
fbshipit-source-id: 750e4c39ce88e8d41662f701cf6275d9388ba46a
Summary:
This PR adds a feature in block cache trace analysis tool to write statistics into csv files.
1. The analysis tool supports grouping the number of accesses per second by various labels, e.g., block, column family, block type, or a combination of them.
2. It also computes reuse distance and reuse interval.
Reuse distance: The cumulated size of unique blocks read between two consecutive accesses on the same block.
Reuse interval: The time between two consecutive accesses on the same block.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5490
Differential Revision: D15901322
Pulled By: HaoyuHuang
fbshipit-source-id: b5454fea408a32757a80be63de6fe1c8149ca70e
Summary:
There is a compile error on Windows with MSVC in malloc_stats.cc where malloc_stats_print is referenced. The compiler only knows je_malloc_stats_print from jemalloc.h. Adding JEMALLOC_NO_RENAME replaces malloc_stats_print with je_malloc_stats_print.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5470
Differential Revision: D15978720
fbshipit-source-id: c05757a2e89e2e015a661d9626c352e4f32f97e4
Summary:
As [BlockBasedTableConfig setBlockCacheSize()](1966a7c055/java/src/main/java/org/rocksdb/BlockBasedTableConfig.java (L728)) said, If cacheSize is non-positive, then cache will not be used. but when we configure a negative number or 0, there is an unexpected result: the block cache becomes 8M.
- Allow 0 as a valid size. When block cache size is 0, an 8MB block cache is created, as it is the default C++ API behavior. Also updated the comment.
- Set no_block_cache true if negative value is passed to block cache size, and no block cache will be created.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5465
Differential Revision: D15968788
Pulled By: sagar0
fbshipit-source-id: ee02d6e95841c9e2c316a64bfdf192d46ff5638a
Summary:
This adds some compression dependencies to AppVeyor CI (those whose builds can be easily scripted on Windows, i.e. Snappy, LZ4, and ZStd).
Let's see if the CI passes ;-)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5226
Differential Revision: D15967223
fbshipit-source-id: 0914c613ac358cbb248df75cdee8099e836828dc
Summary:
It it not safe to assume application had sync the SST file before ingest it into DB. Also the directory to put the ingested file needs to be fsync, otherwise the file can be lost. For integrity of RocksDB we need to sync the ingested file and directory before apply the change to manifest.
Also syncing after writing global sequence when write_global_seqno=true was removed in https://github.com/facebook/rocksdb/issues/4172. Adding it back.
Fixes https://github.com/facebook/rocksdb/issues/5287.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5435
Test Plan:
Test ingest file with ldb command and observe fsync/fdatasync in strace output. Tried both move_files=true and move_files=false.
https://gist.github.com/yiwu-arbug/650a4023f57979056d83485fa863bef9
More test suggestions are welcome.
Differential Revision: D15941675
Pulled By: riversand963
fbshipit-source-id: 389533f3923065a96df2cdde23ff4724a1810d78
Summary:
This PR adds more callers for table readers. These information are only used for block cache analysis so that we can know which caller accesses a block.
1. It renames the BlockCacheLookupCaller to TableReaderCaller as passing the caller from upstream requires changes to table_reader.h and TableReaderCaller is a more appropriate name.
2. It adds more table reader callers in table/table_reader_caller.h, e.g., kCompactionRefill, kExternalSSTIngestion, and kBuildTable.
This PR is long as it requires modification of interfaces in table_reader.h, e.g., NewIterator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5454
Test Plan: make clean && COMPILE_WITH_ASAN=1 make check -j32.
Differential Revision: D15819451
Pulled By: HaoyuHuang
fbshipit-source-id: b6caa704c8fb96ddd15b9a934b7e7ea87f88092d
Summary:
~DBWithTTLImpl() fails after calling Close() function (will invoke the
Close() function of DBImpl), because the Close() function deletes
default_cf_handle_ which is used in the GetOptions() function called
in ~DBWithTTLImpl(), hence lead to segfault.
Fix by creating a Close() function for the DBWithTTLImpl class and do
the close and the work originally in ~DBWithTTLImpl(). If the Close()
function is not called, it will be called in the ~DBWithTTLImpl()
function.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5485
Test Plan: make clean; USE_CLANG=1 make all check -j
Differential Revision: D15924498
fbshipit-source-id: 567397fb972961059083a1ae0f9f99ff74872b78
Summary:
`Block::restart_index_`, `Block::restarts_`, and `Block::current_` are defined as uint32_t but `BlockBasedTableOptions::block_size` is defined as a size_t so user might see corruption as in https://github.com/facebook/rocksdb/issues/5486.
This PR adds a check in `BlockBasedTableFactory::SanitizeOptions` to disallow such configurations.
yiwu-arbug
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5492
Differential Revision: D15914047
Pulled By: miasantreble
fbshipit-source-id: c943f153d967e15aee7f2795730ab8259e2be201
Summary:
The usage of `AlignedBuffer` in env_encryption.cc writes and reads to/from the AlignedBuffer's internal buffer directly without going through AlignedBuffer's APIs (like `Append` and `Read`), causing encapsulation to break in some cases. The writes are especially problematic as after the data is written to the buffer (directly using either memmove or memcpy), the size of the buffer is not updated ... causing the AlignedBuffer to lose track of the encapsulated buffer's current size.
Fixed this by updating the buffer size after every write.
Todo for later:
Add an overloaded method to AlignedBuffer to support a memmove in addition to a memcopy. Encryption env does a memmove, and hence I couldn't switch to using `AlignedBuffer.Append()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5396
Test Plan: `make check`
Differential Revision: D15764756
Pulled By: sagar0
fbshipit-source-id: 2e24b52bd3b4b5056c5c1da157f91ddf89370183
Summary:
Make the generics of the Options interfaces more strict so they are usable in a Kotlin Multiplatform expect/actual typealias implementation without causing a Violation of Finite Bound Restriction.
This fix would enable the creation of a generic Kotlin multiplatform library by just typealiasing the JVM implementation to the current Java implementation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5461
Differential Revision: D15903288
Pulled By: sagar0
fbshipit-source-id: 75e83fdf5d2fcede40744a17e767563d6a4b0696
Summary:
Currently the read-ahead logic for user reads and compaction reads go through different code paths where compaction reads create new table readers and use `ReadaheadRandomAccessFile`. This change is to unify read-ahead logic to use read-ahead in BlockBasedTableReader::InitDataBlock(). As a result of the change `ReadAheadRandomAccessFile` class and `new_table_reader_for_compaction_inputs` option will no longer be used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5431
Test Plan:
make check
Here is the benchmarking - https://gist.github.com/vjnadimpalli/083cf423f7b6aa12dcdb14c858bc18a5
Differential Revision: D15772533
Pulled By: vjnadimpalli
fbshipit-source-id: b71dca710590471ede6fb37553388654e2e479b9
Summary:
When tailing the WAL with TransactionLogIterator, it used to return Corruption status to indicate that the WAL has new tail that is not visible to the iterator, which is a misleading status. The patch replaces it with TryAgain which is more descriptive of a status, indicating that the user needs to create a new iterator to fetch the recent tail.
Fixes https://github.com/facebook/rocksdb/issues/5455
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5474
Differential Revision: D15898953
Pulled By: maysamyabandeh
fbshipit-source-id: 40966f6457cb539e1aeb104daeada6b0e46059fc
Summary:
The patch brings the semantics of per-block-type read performance
context counters in sync with the generic block_read_count by only
incrementing the counter if the block was actually read from the file.
It also fixes index_block_read_count, which fell victim to the
refactoring in PR https://github.com/facebook/rocksdb/issues/5298.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5484
Test Plan: Extended the unit tests.
Differential Revision: D15887431
Pulled By: ltamasi
fbshipit-source-id: a3889759d0ac5759d56625d692cd828d1b9207a6
Summary:
sprintf is unsafe and has buffer overrun risk. Replace it with the safer version snprintf where buffer size is supplied to avoid overrun.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5475
Differential Revision: D15879481
Pulled By: sagar0
fbshipit-source-id: 7ae1958ffc9727fa50261dfbb98ddd74e70a72d8
Summary:
PR https://github.com/facebook/rocksdb/issues/5298 subtly changed how read options are applied to the index block
during a Get, MultiGet, or iteration. Earlier, only the read_tier option
applied to the index block read; since PR https://github.com/facebook/rocksdb/issues/5298, fill_cache and
verify_checksums also have an effect. This patch restores the earlier
behavior to prevent surprise memory increases for clients due to the
index block not being cached.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5481
Test Plan: make check
Differential Revision: D15883082
Pulled By: ltamasi
fbshipit-source-id: 9a065ec3a6db5a365cf6dd5e95190a20c5756356
Summary:
The changes in 8272a6de57 were untested with `USE_HDFS=1`. There were a couple compiler errors. This PR fixes them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5444
Test Plan:
```
$ EXTRA_LDFLAGS="-L/tmp/hadoop-3.1.2/lib/native/" EXTRA_CXXFLAGS="-I/tmp/hadoop-3.1.2/include" USE_HDFS=1 make -j12 check
```
Differential Revision: D15885009
fbshipit-source-id: 2a0a63739e0b9a2819b461ad63ce1292c4833fe2
Summary:
While the secondary is replaying after the primary, the primary may switch to a new MANIFEST. The secondary is already able to detect and follow the primary to the new MANIFEST. However, the current implementation has a bug, described as follows.
The new MANIFEST's first records have been generated by VersionSet::WriteSnapshot to describe the current state of the column families and the db as of the MANIFEST creation. Since the secondary instance has already finished recovering upon start, there is no need for the secondary to process these records. Actually, if the secondary were to replay these records, the secondary may end up adding the same SST files **again** to each column family, causing consistency checks done by VersionBuilder to fail. Therefore, we record the number of records to skip at the beginning of the new MANIFEST and ignore them.
Test plan (on dev server)
```
$make clean && make -j32 all
$./db_secondary_test
```
All existing unit tests must pass as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5472
Differential Revision: D15866771
Pulled By: riversand963
fbshipit-source-id: a1eec4837fb2ad13059398efb0f437e74fd53bed
Summary:
recent commit 671d15cbdd introduced some test failures:
```
===== Running stats_history_test
[==========] Running 9 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 9 tests from StatsHistoryTest
[ RUN ] StatsHistoryTest.RunStatsDumpPeriodSec
monitoring/stats_history_test.cc:63: Failure
dbfull()->SetDBOptions({{"stats_dump_period_sec", "0"}})
Not implemented: Not supported in ROCKSDB LITE
db/db_options_test.cc:28:11: error: unused variable 'kMicrosInSec' [-Werror,-Wunused-const-variable]
const int kMicrosInSec = 1000000;
```
This PR fixes these failures
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5477
Differential Revision: D15871814
Pulled By: miasantreble
fbshipit-source-id: 0a7023914d2c1784d9d2d3f5bfb47310d4855394
Summary:
This PR adds a BlockCacheTraceSimulator that reports the miss ratios given different cache configurations. A cache configuration contains "cache_name,num_shard_bits,cache_capacities". For example, "lru, 1, 1K, 2K, 4M, 4G".
When we replay the trace, we also perform lookups and inserts on the simulated caches.
In the end, it reports the miss ratio for each tuple <cache_name, num_shard_bits, cache_capacity> in a output file.
This PR also adds a main source block_cache_trace_analyzer so that we can run the analyzer in command line.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5449
Test Plan:
Added tests for block_cache_trace_analyzer.
COMPILE_WITH_ASAN=1 make check -j32.
Differential Revision: D15797073
Pulled By: HaoyuHuang
fbshipit-source-id: aef0c5c2e7938f3e8b6a10d4a6a50e6928ecf408
Summary:
`DBImplSecondary` calls `CheckConsistency()` during open. In the past, `DBImplSecondary` did not override this function thus `DBImpl::CheckConsistency()` is called.
The following can happen. The secondary instance is performing consistency check which calls `GetFileSize(file_path)` but the file at `file_path` is deleted by the primary instance. `DBImpl::CheckConsistency` does not account for this and fails the consistency check. This is undesirable. The solution is that, we call `DBImpl::CheckConsistency()` first. If it passes, then we are good. If not, we give it a second chance and handles the case of file(s) being deleted.
Test plan (on dev server):
```
$make clean && make -j20 all
$./db_secondary_test
```
All other existing unit tests must pass as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5469
Differential Revision: D15861845
Pulled By: riversand963
fbshipit-source-id: 507d72392508caed3cd003bb2e2aa43f993dd597
Summary:
This PR continues the work in https://github.com/facebook/rocksdb/pull/4748 and https://github.com/facebook/rocksdb/pull/4535 by adding a new DBOption `persist_stats_to_disk` which instructs RocksDB to persist stats history to RocksDB itself. When statistics is enabled, and both options `stats_persist_period_sec` and `persist_stats_to_disk` are set, RocksDB will periodically write stats to a built-in column family in the following form: key -> (timestamp in microseconds)#(stats name), value -> stats value. The existing API `GetStatsHistory` will detect the current value of `persist_stats_to_disk` and either read from in-memory data structure or from the hidden column family on disk.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5046
Differential Revision: D15863138
Pulled By: miasantreble
fbshipit-source-id: bb82abdb3f2ca581aa42531734ac799f113e931b
Summary:
When run under TSAN it sometimes goes over 10m and times out. The slowest ones are `DBBloomFilterTestWithParam.BloomFilter` which we have 6 of them. Making the tests run in parallel should take care of the timeout issue.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5467
Differential Revision: D15856912
Pulled By: maysamyabandeh
fbshipit-source-id: 26c43c55312974c1b809c070342dee037d0219f4