Summary:
In the past, the default value of max_manifest_file_size is uint64_t::MAX,
allowing a long running RocksDB process to grow its MANIFEST file to take up
the entire disk, as reported in [issue 3851](https://github.com/facebook/rocksdb/issues/3851). It is reasonable and common to provide a default non-max value for this option. Therefore, I set the value to 1GB.
siying miasantreble Please let me know whether this looks good to you. Thanks!
Closes https://github.com/facebook/rocksdb/pull/3867
Differential Revision: D8051524
Pulled By: riversand963
fbshipit-source-id: 50251f0804b1fa933a19a30d19d261ea8b9d2b72
Summary:
Before this PR, Iterator/InternalIterator may simultaneously have non-ok status() and Valid() = true. That state means that the last operation failed, but the iterator is nevertheless positioned on some unspecified record. Likely intended uses of that are:
* If some sst files are corrupted, a normal iterator can be used to read the data from files that are not corrupted.
* When using read_tier = kBlockCacheTier, read the data that's in block cache, skipping over the data that is not.
However, this behavior wasn't documented well (and until recently the wiki on github had misleading incorrect information). In the code there's a lot of confusion about the relationship between status() and Valid(), and about whether Seek()/SeekToLast()/etc reset the status or not. There were a number of bugs caused by this confusion, both inside rocksdb and in the code that uses rocksdb (including ours).
This PR changes the convention to:
* If status() is not ok, Valid() always returns false.
* Any seek operation resets status. (Before the PR, it depended on iterator type and on particular error.)
This does sacrifice the two use cases listed above, but siying said it's ok.
Overview of the changes:
* A commit that adds missing status checks in MergingIterator. This fixes a bug that actually affects us, and we need it fixed. `DBIteratorTest.NonBlockingIterationBugRepro` explains the scenario.
* Changes to lots of iterator types to make all of them conform to the new convention. Some bug fixes along the way. By far the biggest changes are in DBIter, which is a big messy piece of code; I tried to make it less big and messy but mostly failed.
* A stress-test for DBIter, to gain some confidence that I didn't break it. It does a few million random operations on the iterator, while occasionally modifying the underlying data (like ForwardIterator does) and occasionally returning non-ok status from internal iterator.
To find the iterator types that needed changes I searched for "public .*Iterator" in the code. Here's an overview of all 27 iterator types:
Iterators that didn't need changes:
* status() is always ok(), or Valid() is always false: MemTableIterator, ModelIter, TestIterator, KVIter (2 classes with this name anonymous namespaces), LoggingForwardVectorIterator, VectorIterator, MockTableIterator, EmptyIterator, EmptyInternalIterator.
* Thin wrappers that always pass through Valid() and status(): ArenaWrappedDBIter, TtlIterator, InternalIteratorFromIterator.
Iterators with changes (see inline comments for details):
* DBIter - an overhaul:
- It used to silently skip corrupted keys (`FindParseableKey()`), which seems dangerous. This PR makes it just stop immediately after encountering a corrupted key, just like it would for other kinds of corruption. Let me know if there was actually some deeper meaning in this behavior and I should put it back.
- It had a few code paths silently discarding subiterator's status. The stress test caught a few.
- The backwards iteration code path was expecting the internal iterator's set of keys to be immutable. It's probably always true in practice at the moment, since ForwardIterator doesn't support backwards iteration, but this PR fixes it anyway. See added DBIteratorTest.ReverseToForwardBug for an example.
- Some parts of backwards iteration code path even did things like `assert(iter_->Valid())` after a seek, which is never a safe assumption.
- It used to not reset status on seek for some types of errors.
- Some simplifications and better comments.
- Some things got more complicated from the added error handling. I'm open to ideas for how to make it nicer.
* MergingIterator - check status after every operation on every subiterator, and in some places assert that valid subiterators have ok status.
* ForwardIterator - changed to the new convention, also slightly simplified.
* ForwardLevelIterator - fixed some bugs and simplified.
* LevelIterator - simplified.
* TwoLevelIterator - changed to the new convention. Also fixed a bug that would make SeekForPrev() sometimes silently ignore errors from first_level_iter_.
* BlockBasedTableIterator - minor changes.
* BlockIter - replaced `SetStatus()` with `Invalidate()` to make sure non-ok BlockIter is always invalid.
* PlainTableIterator - some seeks used to not reset status.
* CuckooTableIterator - tiny code cleanup.
* ManagedIterator - fixed some bugs.
* BaseDeltaIterator - changed to the new convention and fixed a bug.
* BlobDBIterator - seeks used to not reset status.
* KeyConvertingIterator - some small change.
Closes https://github.com/facebook/rocksdb/pull/3810
Differential Revision: D7888019
Pulled By: al13n321
fbshipit-source-id: 4aaf6d3421c545d16722a815b2fa2e7912bc851d
Summary:
The patch clarifies the ownership of the root db after TransactionDB::Open. If it is a success the ownership if with the TransactionDB, and the root db will be deleted when the destructor of the base class, StackableDB, is called. If it is failure, the temporarily created root db will also be deleted properly.
The patch also includes lots of useful formatting changes.
Closes https://github.com/facebook/rocksdb/pull/3714 upon which this patch is built.
Closes https://github.com/facebook/rocksdb/pull/3806
Differential Revision: D7878010
Pulled By: maysamyabandeh
fbshipit-source-id: f54f3942e29434143ae5a2423ceec9c7072cd4c2
Summary:
Previously `DBOptions::use_direct_io_for_flush_and_compaction=true` combined with `DBOptions::use_direct_reads=false` could cause RocksDB to simultaneously read from two file descriptors for the same file, where background reads used direct I/O and foreground reads used buffered I/O. Our measurements found this mixed-mode I/O negatively impacted foreground read perf, compared to when only buffered I/O was used.
This PR makes the mixed-mode I/O situation impossible by repurposing `DBOptions::use_direct_io_for_flush_and_compaction` to only apply to background writes, and `DBOptions::use_direct_reads` to apply to all reads. There is no risk of direct background direct writes happening simultaneously with buffered reads since we never read from and write to the same file simultaneously.
Closes https://github.com/facebook/rocksdb/pull/3829
Differential Revision: D7915443
Pulled By: ajkr
fbshipit-source-id: 78bcbf276449b7e7766ab6b0db246f789fb1b279
Summary:
- Original commit: a4fb1f8c04
- Revert commit (we reverted as a quick fix to get crash tests passing): 6afe22db2e
This PR includes the contents of the original commit plus two bug fixes, which are:
- In whitebox crash test, only set `--expected_values_path` for `db_stress` runs in the first half of the crash test's duration. In the second half, a fresh DB is created for each `db_stress` run, so we cannot maintain expected state across `db_stress` runs.
- Made `Exists()` return true for `UNKNOWN_SENTINEL` values. I previously had an assert in `Exists()` that value was not `UNKNOWN_SENTINEL`. But it is possible for post-crash-recovery expected values to be `UNKNOWN_SENTINEL` (i.e., if the crash happens in the middle of an update), in which case this assertion would be tripped. The effect of returning true in this case is there may be cases where a `SingleDelete` deletes no data. But if we had returned false, the effect would be calling `SingleDelete` on a key with multiple older versions, which is not supported.
Closes https://github.com/facebook/rocksdb/pull/3793
Differential Revision: D7811671
Pulled By: ajkr
fbshipit-source-id: 67e0295bfb1695ff9674837f2e05bb29c50efc30
Summary:
A minor change: I wrapped TransactionLogIterator for the C API.
I needed that for the golang binding.
Closes https://github.com/facebook/rocksdb/pull/3304
Differential Revision: D6628736
Pulled By: miasantreble
fbshipit-source-id: 3374f3c64b1d7b225696b8767090917761e2f30a
Summary:
crash-recovery verification is failing in the whitebox testing, which may or may not be a valid correctness issue -- need more time to investigate. In the meantime, reverting so we don't mask other failures.
Closes https://github.com/facebook/rocksdb/pull/3786
Differential Revision: D7794516
Pulled By: ajkr
fbshipit-source-id: 28ccdfdb9ec9b3b0fb08c15cbf9d2e282201ff33
Summary:
Sometimes we want to compact files as fast as possible, but don't want to set a large `max_subcompactions` in the `DBOptions` by default.
I add a `max_subcompactions` options to `CompactionOptions` so that we can choose a proper concurrency dynamically.
Closes https://github.com/facebook/rocksdb/pull/3775
Differential Revision: D7792357
Pulled By: ajkr
fbshipit-source-id: 94f54c3784dce69e40a229721a79a97e80cd6a6c
Summary:
This change adds a virtual `Truncate` method to `Env`, which truncates
the named file to the specified size. At the moment, this is only
supported for `MockEnv`, but other `Env's` could be extended to override
the method too. This is the same approach that methods like `LinkFile` and
`AreSameFile` have taken.
This is useful for any user of the in-memory `Env`. The implementation's
header is not exported, so before this change, it was impossible to
access it's already existing `Truncate` method.
Closes https://github.com/facebook/rocksdb/pull/3779
Differential Revision: D7785789
Pulled By: ajkr
fbshipit-source-id: 3bcdaeea7b7180529f7d9b496dc67b791a00bbf0
Summary:
In include/rocksdb/db.h, enum EntryType is redeclared even though
original declaration in types.h in included.
Closes https://github.com/facebook/rocksdb/pull/3766
Differential Revision: D7765504
Pulled By: anand1976
fbshipit-source-id: 622a8ecb306993915be1b9dd5cdd79dbc6a4ea05
Summary:
Previously, our `db_stress` tool held the expected state of the DB in-memory, so after crash-recovery, there was no way to verify data correctness. This PR adds an option, `--expected_values_file`, which specifies a file holding the expected values.
In black-box testing, the `db_stress` process can be killed arbitrarily, so updates to the `--expected_values_file` must be atomic. We achieve this by `mmap`ing the file and relying on `std::atomic<uint32_t>` for atomicity. Actually this doesn't provide a total guarantee on what we want as `std::atomic<uint32_t>` could, in theory, be translated into multiple stores surrounded by a mutex. We can verify our assumption by looking at `std::atomic::is_always_lock_free`.
For the `mmap`'d file, we didn't have an existing way to expose its contents as a raw memory buffer. This PR adds it in the `Env::NewMemoryMappedFileBuffer` function, and `MemoryMappedFileBuffer` class.
`db_crashtest.py` is updated to use an expected values file for black-box testing. On the first iteration (when the DB is created), an empty file is provided as `db_stress` will populate it when it runs. On subsequent iterations, that same filename is provided so `db_stress` can check the data is as expected on startup.
Closes https://github.com/facebook/rocksdb/pull/3629
Differential Revision: D7463144
Pulled By: ajkr
fbshipit-source-id: c8f3e82c93e045a90055e2468316be155633bd8b
Summary:
Background activities like compaction can negatively affect
latency of higher-priority tasks like request processing. To avoid this,
rocksdb already lowers the IO priority of background threads on Linux
systems. While this takes care of typical IO-bound systems, it does not
help much when CPU (temporarily) becomes the bottleneck. This is
especially likely when using more expensive compression settings.
This patch adds an API to allow for lowering the CPU priority of
background threads, modeled on the IO priority API. Benchmarks (see
below) show significant latency and throughput improvements when CPU
bound. As a result, workloads with some CPU usage bursts should benefit
from lower latencies at a given utilization, or should be able to push
utilization higher at a given request latency target.
A useful side effect is that compaction CPU usage is now easily visible
in common tools, allowing for an easier estimation of the contribution
of compaction vs. request processing threads.
As with IO priority, the implementation is limited to Linux, degrading
to a no-op on other systems.
Closes https://github.com/facebook/rocksdb/pull/3763
Differential Revision: D7740096
Pulled By: gwicke
fbshipit-source-id: e5d32373e8dc403a7b0c2227023f9ce4f22b413c
Summary:
There's a group of stats in PerfContext for profiling the write path. They break down the write time into WAL write, memtable insert, throttling, and everything else. We use these stats a lot for figuring out the cause of slow writes.
These stats got a bit out of date and are now categorizing some interesting things as "everything else", and also do some double counting. This PR fixes it and adds two new stats: time spent waiting for other threads of the batch group, and time spent waiting for scheduling flushes/compactions. Probably these will be enough to explain all the occasional abnormally slow (multiple seconds) writes that we're seeing.
Closes https://github.com/facebook/rocksdb/pull/3602
Differential Revision: D7251562
Pulled By: al13n321
fbshipit-source-id: 0a2d0f5a4fa5677455e1f566da931cb46efe2a0d
Summary:
1. Add a new ticker stat rocksdb.number.multiget.keys.found to track the
number of keys successfully read
2. Update rocksdb.memtable.hit/miss in DBImpl::MultiGet(). It was being done in
DBImpl::GetImpl(), but not MultiGet
Closes https://github.com/facebook/rocksdb/pull/3730
Differential Revision: D7677364
Pulled By: anand1976
fbshipit-source-id: af22bd0ef8ddc5cf2b4244b0a024e539fe48bca5
Summary:
Currently the partitioned index iterator creates a new ReadOptions which ignores the fill_cache config set to ReadOptions passed by the user. The patch propagates fill_cache from the user's ReadOptions to that of partition index iterator.
Also it clarifies the contract of fill_cache that i) it does not apply to filters, ii) it still charges block cache for the size of the data block, it still pin the block if it is already in the block cache.
Closes https://github.com/facebook/rocksdb/pull/3739
Differential Revision: D7678308
Pulled By: maysamyabandeh
fbshipit-source-id: 53ed96424ae922e499e2d4e3580ddc3f0db893da
Summary:
Previously threads were named "rocksdb:bg\<index in thread pool\>", so the first thread in all thread pools would be named "rocksdb:bg0". Users want to be able to distinguish threads used for flush (high-pri) vs regular compaction (low-pri) vs compaction to bottom-level (bottom-pri). So I changed the thread naming convention to include the thread-pool priority.
Closes https://github.com/facebook/rocksdb/pull/3702
Differential Revision: D7581415
Pulled By: ajkr
fbshipit-source-id: ce04482b6acd956a401ef22dc168b84f76f7d7c1
Summary:
When there are many range deletions in a range, we want to trigger manual compaction on this range to reclaim disk space as soon as possible and speed up read.
After this change, we can collect informations of range deletions and store them into user properties which can guide our manual compaction.
Closes https://github.com/facebook/rocksdb/pull/3695
Differential Revision: D7570322
Pulled By: ajkr
fbshipit-source-id: c358fa43b0aac6cc954d2eadc7d3bd8015373369
Summary:
This PR comments out the rest of the unused arguments which allow us to turn on the -Wunused-parameter flag. This is the second part of a codemod relating to https://github.com/facebook/rocksdb/pull/3557.
Closes https://github.com/facebook/rocksdb/pull/3662
Differential Revision: D7426121
Pulled By: Dayvedde
fbshipit-source-id: 223994923b42bd4953eb016a0129e47560f7e352
Summary:
This is a hack as temporary fix of MyRocks with rollbacking the merge operands. The way MyRocks uses merge operands is without protection of locks, which violates the assumption behind the rollback algorithm. They are ok with not being rolled back as it would just create a gap in the autoincrement column. The hack add an option to disable the rollback of merge operands by default and only enables it to let the unit test pass.
Closes https://github.com/facebook/rocksdb/pull/3711
Differential Revision: D7597177
Pulled By: maysamyabandeh
fbshipit-source-id: 544be0f666c7e7abb7f651ec8b23124e05056728
Summary:
Add `compaction_reason` as part of event log for event `compaction started`.
Add counters for each `CompactionReason`.
Closes https://github.com/facebook/rocksdb/pull/3679
Differential Revision: D7550348
Pulled By: riversand963
fbshipit-source-id: a19cff3a678c785aa5ef41aac78b9a5968fcc34d
Summary:
1. Remove redundant text.
2. Make terminology consistent across all comments and doc of RocksDB. Also do
our best to conform to conventions. Specifically, use 'callback' instead of
'call-back' [wikipedia](https://en.wikipedia.org/wiki/Callback_(computer_programming)).
Closes https://github.com/facebook/rocksdb/pull/3693
Differential Revision: D7560396
Pulled By: riversand963
fbshipit-source-id: ba8c251c487f4e7d1872a1a8dc680f9e35a6ffb8
Summary:
Adding some stats that would be helpful to monitor if the DB has gone to unlikely stats that would hurt the performance. These are mostly when we end up needing to acquire a mutex.
Closes https://github.com/facebook/rocksdb/pull/3683
Differential Revision: D7529393
Pulled By: maysamyabandeh
fbshipit-source-id: f7d36279a8f39bd84d8ddbf64b5c97f670c5d6d9
Summary:
Currently dump_wal cannot print the prepared records from the WAL that is generated by WRITE_PREPARED write policy since the default reaction of the handler is to return NotSupported if markers of WRITE_PREPARED are encountered. This patch enables the admin to pass --write_committed=false option, which will be accordingly passed to the handler. Note that DBFileDumperCommand and DBDumperCommand are still not updated by this patch but firstly they are not urgent and secondly we need to revise this approach later when we also add WRITE_UNPREPARED markers so I leave it for future work.
Tested by running it on a WAL generated by WRITE_PREPARED:
$ ./ldb dump_wal --walfile=/dev/shm/dbbench/000003.log | grep BEGIN_PREARE | head -1
1,2,70,0,BEGIN_PREARE
$ ./ldb dump_wal --walfile=/dev/shm/dbbench/000003.log --write_committed=false | grep BEGIN_PREARE | head -1
1,2,70,0,BEGIN_PREARE PUT(0) : 0x30303031313330313938 PUT(0) : 0x30303032353732313935 END_PREPARE(0x74786E31313535383434323738303738363938313335312D30)
Closes https://github.com/facebook/rocksdb/pull/3682
Differential Revision: D7522090
Pulled By: maysamyabandeh
fbshipit-source-id: a0332207261c61e18b2f9dfbe9feecd9a1339aca
Summary:
When `max_valid_backups_to_open` is set, the `BackupEngine` doesn't know about the files referenced by existing backups. This PR prevents us from deleting valid files when that option is set, in cases where we are unable to accurately determine refcount. There are warnings logged when we may miss deleting unreferenced files, and a recommendation in the header for users to periodically unset this option and run a full `GarbageCollect`.
Closes https://github.com/facebook/rocksdb/pull/3518
Differential Revision: D7008331
Pulled By: ajkr
fbshipit-source-id: 87907f964dc9716e229d08636a895d2fc7b72305
Summary:
In this change, an option to set different paths for different column families is added.
This option is set via cf_paths setting of ColumnFamilyOptions. This option will work in a similar fashion to db_paths setting. Cf_paths is a vector of Dbpath values which contains a pair of the absolute path and target size. Multiple levels in a Column family can go to different paths if cf_paths has more than one path.
To maintain backward compatibility, if cf_paths is not specified for a column family, db_paths setting will be used. Note that, if db_paths setting is also not specified, RocksDB already has code to use db_name as the only path.
Changes :
1) A new member "cf_paths" is added to ImmutableCfOptions. This is set, based on cf_paths setting of ColumnFamilyOptions and db_paths setting of ImmutableDbOptions. This member is used to identify the path information whenever files are accessed.
2) Validation checks are added for cf_paths setting based on existing checks for db_paths setting.
3) DestroyDB, PurgeObsoleteFiles etc. are edited to support multiple cf_paths.
4) Unit tests are added appropriately.
Closes https://github.com/facebook/rocksdb/pull/3102
Differential Revision: D6951697
Pulled By: ajkr
fbshipit-source-id: 60d2262862b0a8fd6605b09ccb0da32bb331787d
Summary:
Adds two stats to allow us measuring the false positive rate of full filters:
- The total count of positives: rocksdb.bloom.filter.full.positive
- The total count of true positives: rocksdb.bloom.filter.full.true.positive
Not the term "full" in the stat name to indicate that they are meaningful in full filters. block-based filters are to be deprecated soon and supporting it is not worth the the additional cost of if-then-else branches.
Closes#3680
Tested by:
$ ./db_bench -benchmarks=fillrandom -db /dev/shm/rocksdb-tmpdb --num=1000000 -bloom_bits=10
$ ./db_bench -benchmarks="readwhilewriting" -db /dev/shm/rocksdb-tmpdb --statistics -bloom_bits=10 --duration=60 --num=2000000 --use_existing_db 2>&1 > /tmp/full.log
$ grep filter.full /tmp/full.log
rocksdb.bloom.filter.full.positive COUNT : 3628593
rocksdb.bloom.filter.full.true.positive COUNT : 3536026
which gives the false positive rate of 2.5%
Closes https://github.com/facebook/rocksdb/pull/3681
Differential Revision: D7517570
Pulled By: maysamyabandeh
fbshipit-source-id: 630ab1a473afdce404916d297035b6318de4c052
Summary:
This change models Optimistic Tx db after Pessimistic TX db. The motivation for this change is to make the ptr polymorphic so it can be held by the same raw or smart ptr.
Currently, due to the inheritance of the Opt Tx db not being rooted in the manner of Pess Tx from a single DB root it is more difficult to write clean code and have clear ownership of the database in cases when options dictate instantiate of plan DB, Pess Tx DB or Opt tx db.
Closes https://github.com/facebook/rocksdb/pull/3566
Differential Revision: D7184502
Pulled By: yiwu-arbug
fbshipit-source-id: 31d06efafd79497bb0c230e971857dba3bd962c3
Summary:
Level Compaction with TTL.
As of today, a file could exist in the LSM tree without going through the compaction process for a really long time if there are no updates to the data in the file's key range. For example, in certain use cases, the keys are not actually "deleted"; instead they are just set to empty values. There might not be any more writes to this "deleted" key range, and if so, such data could remain in the LSM for a really long time resulting in wasted space.
Introducing a TTL could solve this problem. Files (and, in turn, data) older than TTL will be scheduled for compaction when there is no other background work. This will make the data go through the regular compaction process and get rid of old unwanted data.
This also has the (good) side-effect of all the data in the non-bottommost level being newer than ttl, and all data in the bottommost level older than ttl. It could lead to more writes while reducing space.
This functionality can be controlled by the newly introduced column family option -- ttl.
TODO for later:
- Make ttl mutable
- Extend TTL to Universal compaction as well? (TTL is already supported in FIFO)
- Maybe deprecate CompactionOptionsFIFO.ttl in favor of this new ttl option.
Closes https://github.com/facebook/rocksdb/pull/3591
Differential Revision: D7275442
Pulled By: sagar0
fbshipit-source-id: dcba484717341200d419b0953dafcdf9eb2f0267
Summary:
Updated comments around AllowSingleOperand.
Reason: A couple of users were confused and encountered issues due to no overriding PartialMerge with AllowSingleOperand=true.
I'll also look into modifying the default merge operator implementation so that overriding PartialMerge is not mandatory when AllowSingleOp=true.
Closes https://github.com/facebook/rocksdb/pull/3659
Differential Revision: D7422691
Pulled By: sagar0
fbshipit-source-id: 3d075a6ced0120f5d65cb7ae5412936f1862f342
Summary:
Provide a block_align option in BlockBasedTableOptions to allow
alignment of SST file data blocks. This will avoid higher
IOPS/throughput load due to < 4KB data blocks spanning 2 4KB pages.
When this option is set to true, the block alignment is set to lower of
block size and 4KB.
Closes https://github.com/facebook/rocksdb/pull/3502
Differential Revision: D7400897
Pulled By: anand1976
fbshipit-source-id: 04cc3bd144e88e3431a4f97604e63ad7a0f06d44
Summary:
Summary
========
`InlineSkipList<>::Insert` takes the `key` parameter as a C-string. Then, it performs multiple comparisons with it requiring the `GetLengthPrefixedSlice()` to be spawn in `MemTable::KeyComparator::operator()(const char* prefix_len_key1, const char* prefix_len_key2)` on the same data over and over. The patch tries to optimize that.
Rough performance comparison
=====
Big keys, no compression.
```
$ ./db_bench --writes 20000000 --benchmarks="fillrandom" --compression_type none -key_size 256
(...)
fillrandom : 4.222 micros/op 236836 ops/sec; 80.4 MB/s
```
```
$ ./db_bench --writes 20000000 --benchmarks="fillrandom" --compression_type none -key_size 256
(...)
fillrandom : 4.064 micros/op 246059 ops/sec; 83.5 MB/s
```
TODO
======
In ~~a separated~~ this PR:
- [x] Go outside the write path. Maybe even eradicate the C-string-taking variant of `KeyIsAfterNode` entirely.
- [x] Try to cache the transformations applied by `KeyComparator` & friends in situations where we havy many comparisons with the same key.
Closes https://github.com/facebook/rocksdb/pull/3516
Differential Revision: D7059300
Pulled By: ajkr
fbshipit-source-id: 6f027dbb619a488129f79f79b5f7dbe566fb2dbb
Summary:
Add `bytes_max_delete_chunk` in SstFileManager so that we can drop a large file in multiple batches.
Closes https://github.com/facebook/rocksdb/pull/3640
Differential Revision: D7358679
Pulled By: siying
fbshipit-source-id: ef17f0da2f5723dbece2669485a9b91b3edc0bb7
Summary:
This pull request exposes the interface of PerfContext as C API
Closes https://github.com/facebook/rocksdb/pull/3607
Differential Revision: D7294225
Pulled By: ajkr
fbshipit-source-id: eddcfbc13538f379950b2c8b299486695ffb5e2c
Summary:
This is a small API extension to allow the CompactFiles method to return the names of files that were created during the compaction.
Closes https://github.com/facebook/rocksdb/pull/3608
Differential Revision: D7275789
Pulled By: siying
fbshipit-source-id: 1ec0c3954a0f10cd877efb5f29f9be6c7b59e9ba
Summary:
We missed updating version.h on master when cutting 5.11.fb and 5.12.fb branches. It should be the same as the version in the latest release branch (or should it be one more?).
I noticed this when trying to run some upgrade/downgrade tests from 5.11 to some new code on master.
Closes https://github.com/facebook/rocksdb/pull/3611
Differential Revision: D7282917
Pulled By: sagar0
fbshipit-source-id: 205ee75b77c5b6bbcea95a272760b427025a4aba
Summary:
Added a stat that counts the number of cancelled compactions.
Closes https://github.com/facebook/rocksdb/pull/3574
Differential Revision: D7190259
Pulled By: amytai
fbshipit-source-id: d5ce82dc9398da6d6d34023ad4ed8cec909852a3
Summary:
... so that people are not confused.
Closes https://github.com/facebook/rocksdb/pull/3580
Differential Revision: D7187175
Pulled By: sagar0
fbshipit-source-id: bce70093d52e38cd24c9432fd708885d7c2c013e
Summary:
This diff handles cases where compaction causes an ENOSPC error.
This does not handle corner cases where another background job is started while compaction is running, and the other background job triggers ENOSPC, although we do allow the user to provision for these background jobs with SstFileManager::SetCompactionBufferSize.
It also does not handle the case where compaction has finished and some other background job independently triggers ENOSPC.
Usage: Functionality is inside SstFileManager. In particular, users should set SstFileManager::SetMaxAllowedSpaceUsage, which is the reference highwatermark for determining whether to cancel compactions.
Closes https://github.com/facebook/rocksdb/pull/3449
Differential Revision: D7016941
Pulled By: amytai
fbshipit-source-id: 8965ab8dd8b00972e771637a41b4e6c645450445
Summary:
This patch addressed several issues.
Portability including db_test std::thread -> port::Thread Cc: @
and %z to ROCKSDB portable macro. Cc: maysamyabandeh
Implement Env::AreFilesSame
Make the implementation of file unique number more robust
Get rid of C-runtime and go directly to Windows API when dealing
with file primitives.
Implement GetSectorSize() and aling unbuffered read on the value if
available.
Adjust Windows Logger for the new interface, implement CloseImpl() Cc: anand1976
Fix test running script issue where $status var was of incorrect scope
so the failures were swallowed and not reported.
DestroyDB() creates a logger and opens a LOG file in the directory
being cleaned up. This holds a lock on the folder and the cleanup is
prevented. This fails one of the checkpoin tests. We observe the same in production.
We close the log file in this change.
Fix DBTest2.ReadAmpBitmapLiveInCacheAfterDBClose failure where the test
attempts to open a directory with NewRandomAccessFile which does not
work on Windows.
Fix DBTest.SoftLimit as it is dependent on thread timing. CC: yiwu-arbug
Closes https://github.com/facebook/rocksdb/pull/3552
Differential Revision: D7156304
Pulled By: siying
fbshipit-source-id: 43db0a757f1dfceffeb2b7988043156639173f5b
Summary:
Improving blob db FIFO eviction with the following changes,
* Change blob_dir_size to max_db_size. Take into account SST file size when computing DB size.
* FIFO now only take into account live sst files and live blob files. It is normal for disk usage to go over max_db_size because there are obsolete sst files and blob files pending deletion.
* FIFO eviction now also evict TTL blob files that's still open. It doesn't evict non-TTL blob files.
* If FIFO is triggered, it will pass an expiration and the current sequence number to compaction filter. Compaction filter will then filter inlined keys to evict those with an earlier expiration and smaller sequence number. So call LSM FIFO.
* Compaction filter also filter those blob indexes where corresponding blob file is gone.
* Add an event listener to listen compaction/flush event and update sst file size.
* Implement DB::Close() to make sure base db, as well as event listener and compaction filter, destruct before blob db.
* More blob db statistics around FIFO.
* Fix some locking issue when accessing a blob file.
Closes https://github.com/facebook/rocksdb/pull/3556
Differential Revision: D7139328
Pulled By: yiwu-arbug
fbshipit-source-id: ea5edb07b33dfceacb2682f4789bea61de28bbfa
Summary:
Fix the following bugs:
- During recovery a duplicate key was inserted twice into the write batch of the recovery transaction,
once when the memtable returns false (because it was duplicates) and once for the 2nd attempt. This would result into different SubBatch count measured when the recovered transactions is committing.
- If a cf is flushed during recovery the memtable is not available to assist in detecting the duplicate key. This could result into not advancing the sequence number when iterating over duplicate keys of a flushed cf and hence inserting the next key with the wrong sequence number.
- SubBacthCounter would reset the comparator to default comparator after the first duplicate key. The 2nd duplicate key hence would have gone through a wrong comparator and not being detected.
Closes https://github.com/facebook/rocksdb/pull/3562
Differential Revision: D7149440
Pulled By: maysamyabandeh
fbshipit-source-id: 91ec317b165f363f5d11ff8b8c47c81cebb8ed77
Summary:
Red diff to remove existing implementation of garbage collection. The current approach is reference counting kind of approach and require a lot of effort to get the size counter right on compaction and deletion. I'm going to go with a simple mark-sweep kind of approach and will send another PR for that.
CompactionEventListener was added solely for blob db and it adds complexity and overhead to compaction iterator. Removing it as well.
Closes https://github.com/facebook/rocksdb/pull/3551
Differential Revision: D7130190
Pulled By: yiwu-arbug
fbshipit-source-id: c3a375ad2639a3f6ed179df6eda602372cc5b8df
Summary:
Add "rocksdb.live-sst-files-size" DB property which only include files of latest version. Existing "rocksdb.total-sst-files-size" include files from all versions and thus include files that's obsolete but not yet deleted. I'm going to use this new property to cap blob db sst + blob files size.
Closes https://github.com/facebook/rocksdb/pull/3548
Differential Revision: D7116939
Pulled By: yiwu-arbug
fbshipit-source-id: c6a52e45ce0f24ef78708156e1a923c1dd6bc79a
Summary:
CompactRange has a call to Flush because we guarantee that, at the time it's called, all existing keys in the range will be pushed through the user's compaction filter. However, previously the flush was done blindly, so it'd happen even if the memtable does not contain keys in the range specified by the user. This caused unnecessarily many L0 files to be created, leading to write stalls in some cases. This PR checks the memtable's contents, and decides to flush only if it overlaps with `CompactRange`'s range.
- Move the memtable overlap check logic from `ExternalSstFileIngestionJob` to `ColumnFamilyData::RangesOverlapWithMemtables`
- Reuse the above logic in `CompactRange` and skip flushing if no overlap
Closes https://github.com/facebook/rocksdb/pull/3520
Differential Revision: D7018897
Pulled By: ajkr
fbshipit-source-id: a3c6b1cfae56687b49dd89ccac7c948e53545934
Summary:
The recent Logger::Close() and DBImpl::Close() implementation rely on
calling the CloseImpl() virtual function from the destructor, which will
not work. Refactor the implementation to have a private close helper
function in derived classes that can be called by both CloseImpl() and
the destructor.
Closes https://github.com/facebook/rocksdb/pull/3528
Reviewed By: gfosco
Differential Revision: D7049303
Pulled By: anand1976
fbshipit-source-id: 76a64cbf403209216dfe4864ecf96b5d7f3db9f4
Summary:
Make use of the index in WriteBatchWithIndex to also count the number of sub-batches. This eliminates the need to separately scan the batch to count the number of sub-batches once a duplicate key is detected.
Closes https://github.com/facebook/rocksdb/pull/3529
Differential Revision: D7049947
Pulled By: maysamyabandeh
fbshipit-source-id: 81cbf12c4e662541c772c7265a8f91631e25c7cd
Summary:
Added a new iterator property: `rocksdb.iterator.internal-key` to get the internal-key (converted to user key) at which the iterator stopped.
Closes https://github.com/facebook/rocksdb/pull/3525
Differential Revision: D7033694
Pulled By: sagar0
fbshipit-source-id: d51e6c00f5e9d766c6276ef79774b81c6c5216f8
Summary:
The MemTableRep API was broken by this commit: 813719e952
This patch reverts the changes and instead adds InsertKey (and etc.) overloads to extend the MemTableRep API without breaking the existing classes that inherit from it.
Closes https://github.com/facebook/rocksdb/pull/3513
Differential Revision: D7004134
Pulled By: maysamyabandeh
fbshipit-source-id: e568d91fe1e17dd76c0c1f6c7dd51a18633b1c4f
Summary:
TransactionDB::Write can receive some optimization hints from the user. One is to skip the concurrency control mechanism. WritePreparedTxnDB is currently ignoring such hints. This patch optimizes WritePreparedTxnDB::Write for skip_concurrency_control and skip_duplicate_key_check hints.
Closes https://github.com/facebook/rocksdb/pull/3496
Differential Revision: D6971784
Pulled By: maysamyabandeh
fbshipit-source-id: cbab10ad538fa2b8bcb47e37c77724afe6e30f03
Summary:
- Refactored logic for checking write stall condition to a helper function: `GetWriteStallConditionAndCause`. Now it is decoupled from the logic for updating WriteController / stats in `RecalculateWriteStallConditions`, so we can reuse it for predicting whether write stall will occur.
- Updated `CompactRange` to first check whether the one additional immutable memtable / L0 file would cause stalling before it flushes. If so, it waits until that is no longer true.
- Updated `bg_cv_` to be signaled on `SetOptions` calls. The stall conditions `CompactRange` cares about can change when (1) flush finishes, (2) compaction finishes, or (3) options dynamically change. The cv was already signaled for (1) and (2) but not yet for (3).
Closes https://github.com/facebook/rocksdb/pull/3381
Differential Revision: D6754983
Pulled By: ajkr
fbshipit-source-id: 5613e03f1524df7192dc6ae885d40fd8f091d972
Summary:
It's always a mystery from the logs why flush was triggered -- user triggered it manually, WriteBufferManager triggered it, logs were full, write buffer was full, etc.
This PR logs Flush reason whenever a flush is scheduled.
Closes https://github.com/facebook/rocksdb/pull/3401
Differential Revision: D6788142
Pulled By: miasantreble
fbshipit-source-id: a867e54d493c06adf5172bd36a180fb3faae3511
Summary:
Compared to DB::Write, TransactionDB::Write has the additional overhead of creating and initializing an internal transaction object, as well as the overhead of locking/unlocking the keys. This patch extends the TransactionDB::Write with an skip_cc option to allow the users to indicate that the write batch do not conflict with others and the concurrency control and its overhead can be skipped. TransactionDB::Write by default calls DB::Write when skip_cc is set, which works for WriteCommitted WritePolicy. Any other flavor of TransactionDB that is not compatible with this default behavior (such as WritePreparedTxnDB) can extend ::Write and implement their own approach for taking into account the skip_cc optimization.
Closes https://github.com/facebook/rocksdb/pull/3457
Differential Revision: D6877318
Pulled By: maysamyabandeh
fbshipit-source-id: 56f4e21db87ff71492db4e376fb7c2b03dfeab6b
Summary:
This patch takes advantage of memtable being able to detect duplicate <key,seq> and returning TryAgain to handle duplicate keys in WritePrepared Txns. Through WriteBatchWithIndex's index it detects existence of at least a duplicate key in the write batch. If duplicate key was reported, it then pays the cost of counting the number of sub-patches by iterating over the write batch and pass it to DBImpl::Write. DB will make use of the provided batch_count to assign proper sequence numbers before sending them to the WAL. When later inserting the batch to the memtable, it increases the seq each time memtbale reports a duplicate (a sub-patch in our counting) and tries again.
Closes https://github.com/facebook/rocksdb/pull/3455
Differential Revision: D6873699
Pulled By: maysamyabandeh
fbshipit-source-id: db8487526c3a5dc1ddda0ea49f0f979b26ae648d
Summary:
There are a couple of places where we swallow any error from
WriteBuffer() - in SwitchMemtable() and DBImpl::CloseImpl(). Propagate
the error up in those cases rather than ignoring it.
Closes https://github.com/facebook/rocksdb/pull/3404
Differential Revision: D6879954
Pulled By: anand1976
fbshipit-source-id: 2ef88b554be5286b0a8bad7384ba17a105395bdb
Summary:
Currently DB does not accept duplicate keys (keys with the same user key and the same sequence number). If Memtable returns false when receiving such keys, we can benefit from this signal to properly increase the sequence number in the rare cases when we have a duplicate key in the write batch written to DB under WritePrepared transactions.
Closes https://github.com/facebook/rocksdb/pull/3418
Differential Revision: D6822412
Pulled By: maysamyabandeh
fbshipit-source-id: adea3ce5073131cd38ed52b16bea0673b1a19e77
Summary:
Using `DeleteFilesInRange` to delete files in a lot of ranges can be slow, because
`VersionSet::LogAndApply` is expensive.
This PR adds a new `DeleteFilesInRange` function to delete files in multiple
ranges at once.
Close https://github.com/facebook/rocksdb/issues/2951
Closes https://github.com/facebook/rocksdb/pull/3431
Differential Revision: D6849228
Pulled By: ajkr
fbshipit-source-id: daeedcabd8def4b1d9ee95a58266dee77b5d68cb
Summary:
This replaces a vague warning about the mostly-obsolete ext3 filesystem with
a more detailed note about a historical bug in the still-relevant ext4.
Fixes#3410
Closes https://github.com/facebook/rocksdb/pull/3421
Differential Revision: D6834881
Pulled By: siying
fbshipit-source-id: 7771ef5c89a54c0ac17821680779c48178d0b400
Summary:
Allow StackableDB optionally takes a shared_ptr on construction and thus hold shared ownership of the underlying DB.
Closes https://github.com/facebook/rocksdb/pull/3423
Differential Revision: D6824163
Pulled By: yiwu-arbug
fbshipit-source-id: dbdc30c42e007533a987ef413785e192340f03eb
Summary:
The default changed in 6a14f7a976 but this comment was not updated.
Closes https://github.com/facebook/rocksdb/pull/3409
Differential Revision: D6808264
Pulled By: maysamyabandeh
fbshipit-source-id: 0d7e2a054eb181e9a144fcb783cf0b2c77219bc0
Summary:
To prevent user who implement the Iterator interface fail to implement SeekForPrev by mistake.
Closes https://github.com/facebook/rocksdb/pull/3402
Differential Revision: D6790681
Pulled By: yiwu-arbug
fbshipit-source-id: bd75b8ced30208982e0a1414d34384d93496827a
Summary:
We have seen cases where it could be good to change TTL on already open DB.
Change ttl in TtlCompactionFilterFactory on open db.
Next time a filter is created, it will filter accroding to the set TTL.
Is this something that could be useful for others?
Any downsides?
Closes https://github.com/facebook/rocksdb/pull/3292
Differential Revision: D6731993
Pulled By: miasantreble
fbshipit-source-id: 73b94d69237b11e8730734389052429d621a6b1e
Summary:
Currently, the only way to close an open DB is to destroy the DB
object. There is no way for the caller to know the status. In one
instance, the destructor encountered an error due to failure to
close a log file on HDFS. In order to prevent silent failures, we add
DB::Close() that calls CloseImpl() which must be implemented by its
descendants.
The main failure point in the destructor is closing the log file. This
patch also adds a Close() entry point to Logger in order to get status.
When DBOptions::info_log is allocated and owned by the DBImpl, it is
explicitly closed by DBImpl::CloseImpl().
Closes https://github.com/facebook/rocksdb/pull/3348
Differential Revision: D6698158
Pulled By: anand1976
fbshipit-source-id: 9468e2892553eb09c4c41b8723f590c0dbd8ab7d
Summary:
to save a string copy for some use cases.
The change is pretty straightforward, please feel free to let me know if you want to suggest any tests for it.
Closes https://github.com/facebook/rocksdb/pull/3349
Differential Revision: D6706828
Pulled By: yiwu-arbug
fbshipit-source-id: 873ce4442937bdc030b395c7f99228eda7f59eb7
Summary:
Add a new histogram stat called rocksdb.db.flush.micros for memtable
flush
Closes https://github.com/facebook/rocksdb/pull/3269
Differential Revision: D6559496
Pulled By: anand1976
fbshipit-source-id: f5c771ba2568630458751795e8c37a493ff9b14d
Summary:
added `ThreadType::BOTTOM_PRIORITY` which is used in the `ThreadStatus` object to indicate the thread is used for bottom-pri compactions. Previously there was a bug where we mislabeled such threads as `ThreadType::LOW_PRIORITY`.
Closes https://github.com/facebook/rocksdb/pull/3270
Differential Revision: D6559428
Pulled By: ajkr
fbshipit-source-id: 96b1a50a9c19492b1a5fd1b77cf7061a6f9f1d1c
Summary:
Fix for #2833.
- In `DeleteFilesInRange`, use `GetCleanInputsWithinInterval` instead of `GetOverlappingInputs` to make sure we get a clean cut set of files to delete.
- In `GetCleanInputsWithinInterval`, support nullptr as `begin_key` or `end_key`.
- In `GetOverlappingInputsRangeBinarySearch`, move the assertion for non-empty range away from `ExtendFileRangeWithinInterval`, which should be allowed to return an empty range (via `end_index < begin_index`).
Closes https://github.com/facebook/rocksdb/pull/2843
Differential Revision: D5772387
Pulled By: ajkr
fbshipit-source-id: e554e8461823c6be82b21a9262a2da02b3957881
Summary:
Adding a list of blob db counters.
Also remove WaStats() which doesn't expose the stats and can be substitute by (BLOB_DB_BYTES_WRITTEN / BLOB_DB_BLOB_FILE_BYTES_WRITTEN).
Closes https://github.com/facebook/rocksdb/pull/3193
Differential Revision: D6394216
Pulled By: yiwu-arbug
fbshipit-source-id: 017508c8ff3fcd7ea7403c64d0f9834b24816803
Summary:
Problem: Option string accepts only cache_size as parameter for block_cache which is specified as "block_cache=1M".
It doesn't accept other parameters like num_shards etc.
Changes :
1) ParseBlockBasedTableOption in block_based_table_factory is edited to accept cache options in the format "block_cache=<cache_size>:<num_shard_bits>:<strict_capacity_limit>:<high_pri_pool_ratio>".
Options other than cache_size are optional to maintain backward compatibility. The changes are valid for block_cache_compressed as well.
For example, "block_cache=1M:6:true:0.5", "block_cache=1M:6:true", "block_cache=1M:6" and "block_cache=1M" are all valid option strings.
2) Corresponding unit tests are added.
Closes https://github.com/facebook/rocksdb/pull/3108
Differential Revision: D6420997
Pulled By: sagar0
fbshipit-source-id: cdea8b785688d2802907974af27225ccc1c0cd43
Summary:
Expose read and write options via the C API
Closes https://github.com/facebook/rocksdb/pull/3185
Differential Revision: D6389658
Pulled By: sagar0
fbshipit-source-id: 1848912750329a476805b3cb2f315e7b71f61472
Summary:
This diff adds a new ticker stat, NUMBER_ITER_SKIP, to count the
number of internal keys skipped during iteration. Keys can be skipped
due to deletes, or lower sequence number, or higher sequence number
than the one requested.
Also, fix the issue when StatisticsData is naturally aligned on cacheline boundary,
padding becomes a zero size array, which the Windows compiler doesn't
like. So add a cacheline worth of padding in that case to keep it happy.
We cannot conditionally add padding as gcc doesn't allow using sizeof
in preprocessor directives.
Closes https://github.com/facebook/rocksdb/pull/3177
Differential Revision: D6353897
Pulled By: anand1976
fbshipit-source-id: 441d5a09af9c4e22e7355242dfc0c7b27aa0a6c2
Summary:
Allow users to configure the trash-to-DB size ratio limit, so
that ratelimits for deletes can be enforced even when larger portions of
the database are being deleted.
Closes https://github.com/facebook/rocksdb/pull/3158
Differential Revision: D6304897
Pulled By: gdavidsson
fbshipit-source-id: a28dd13059ebab7d4171b953ed91ce383a84d6b3
Summary:
The sequence number was not properly advanced after a rollback marker. The patch extends the existing unit tests to detect the bug and also fixes it.
Closes https://github.com/facebook/rocksdb/pull/3157
Differential Revision: D6304291
Pulled By: maysamyabandeh
fbshipit-source-id: 1b519c44a5371b802da49c9e32bd00087a8da401
Summary:
The current implementation of PinnableSlice move assignment have an issue #3163. We are moving away from it instead of try to get the move assignment right, since it is too tricky.
Closes https://github.com/facebook/rocksdb/pull/3164
Differential Revision: D6319201
Pulled By: yiwu-arbug
fbshipit-source-id: 8f3279021f3710da4a4caa14fd238ed2df902c48