Summary:
It'd be nice to use the error status type to distinguish
between user error and system error. For example, GetChildren can fail
listing a backup directory's contents either because a bad path was provided
(user error) or because an operation failed, e.g., a remote storage service
call failed (system error). In the former case, we want to continue and treat
the backup directory as empty; in the latter case, we want to immediately
propagate the error to the caller.
This diff uses NotFound to indicate user error and IOError to indicate
system error. Previously IOError indicated both.
Closes https://github.com/facebook/rocksdb/pull/1644
Differential Revision: D4312157
Pulled By: ajkr
fbshipit-source-id: 51b4f24
Summary:
The info log file ("LOG") is stored in the db directory by default. When the db is on a distributed env, this is unnecessarily slow. So, I added an option to db_bench to just print the info log messages to stderr.
Closes https://github.com/facebook/rocksdb/pull/1641
Differential Revision: D4309348
Pulled By: ajkr
fbshipit-source-id: 1e6f851
Summary:
When compiling with GCC>=7.0.0, "db/internal_stats.cc" fails to compile as the data being written to the buffer potentially exceeds its size.
This fix simply doubles the size of the buffer, thus accommodating the max possible data size.
Closes https://github.com/facebook/rocksdb/pull/1635
Differential Revision: D4302162
Pulled By: yiwu-arbug
fbshipit-source-id: c76ad59
Summary:
Remove "util/testharness.h" from list of includes for "db/db_filesnapshot.cc", as it wasn't being used and thus caused an extraneous dependency on gtest.
Closes https://github.com/facebook/rocksdb/pull/1634
Differential Revision: D4302146
Pulled By: yiwu-arbug
fbshipit-source-id: e900c0b
Summary:
It was doing `&range_del_iters[0]` on an empty vector. Even though the resulting pointer is never dereferenced, it's still bad for two reasons:
* the practical reason: it crashes with `std::out_of_range` exception in our debug build,
* the "C++ standard lawyer" reason: it's undefined behavior because, in `std::vector` implementation, it probably "dereferences" a null pointer, which is invalid even though it doesn't actually read the pointed memory, just converts a pointer into a reference (and then flush_job.cc converts it back to pointer); nullptr references are undefined behavior.
Closes https://github.com/facebook/rocksdb/pull/1612
Differential Revision: D4265625
Pulled By: al13n321
fbshipit-source-id: db26fb9
Summary:
When we Ingest an external file we open it to read some metadata and first/last key
during doing that we insert blocks into the block cache with global_seqno = 0
If we move the file (did not copy it) into the DB, we will use these blocks with the wrong seqno in the read path
Closes https://github.com/facebook/rocksdb/pull/1627
Differential Revision: D4293332
Pulled By: yiwu-arbug
fbshipit-source-id: 3ce5523
Summary:
The second variable "SHELL" simply tells make explicitly which shell to use, instead of allowing it to default to "/bin/sh", which may or may not be Bash.
However, simply defining the second variable by itself causes make to throw an error concerning a circular definition, as it would be attempting to use the "shell" command while simultaneously trying to set which shell to use. Thus, the first variable "BASH_EXISTS" is defined such that make already knows about "/path/to/bash" before trying to use it to set "SHELL".
A more technically correct solution would be to edit the makefile itself to make it compatible with non-bash shells (see the original Issue discussion for details). However, as it seems very few of the people working on this project were building with non-bash shells, I figured this solution would be good enough.
Closes https://github.com/facebook/rocksdb/pull/1631
Differential Revision: D4295689
Pulled By: yiwu-arbug
fbshipit-source-id: e4f9532
Summary:
made db_stress capable of adding range deletions to its db and verifying their correctness. i'll make db_crashtest.py use this option later once the collapsing optimization (https://github.com/facebook/rocksdb/pull/1614) is committed because currently it slows down the test too much.
Closes https://github.com/facebook/rocksdb/pull/1625
Differential Revision: D4293939
Pulled By: ajkr
fbshipit-source-id: d3beb3a
Summary:
IsTrivialMove returns true if no input file overlaps with output_level+1 with more than max_compaction_bytes_ bytes.
Closes https://github.com/facebook/rocksdb/pull/1619
Differential Revision: D4278338
Pulled By: yiwu-arbug
fbshipit-source-id: 994c001
Summary:
When enabled, this option replaces range tombstones with a sequence of
point tombstones covering the same range. This can be used to A/B test perf of
range tombstones vs sequential point tombstones, and help us find the cross-over
point, i.e., the size of the range above which range tombstones outperform point
tombstones.
Closes https://github.com/facebook/rocksdb/pull/1594
Differential Revision: D4246312
Pulled By: ajkr
fbshipit-source-id: 3b00b23
Summary:
This is an implementation of non-exclusive locks for pessimistic transactions. It is relatively simple and does not prevent starvation (ie. it's possible that request for exclusive access will never be granted if there are always threads holding shared access). It is done by changing `KeyLockInfo` to hold an set a transaction ids, instead of just one, and adding a flag specifying whether this lock is currently held with exclusive access or not.
Some implementation notes:
- Some lock diagnostic functions had to be updated to return a set of transaction ids for a given lock, eg. `GetWaitingTxn` and `GetLockStatusData`.
- Deadlock detection is a bit more complicated since a transaction can now wait on multiple other transactions. A BFS is done in this case, and deadlock detection depth is now just a limit on the number of transactions we visit.
- Expirable transactions do not work efficiently with shared locks at the moment, but that's okay for now.
Closes https://github.com/facebook/rocksdb/pull/1573
Differential Revision: D4239097
Pulled By: lth
fbshipit-source-id: da7c074
Summary:
I hit the land button too fast and didn't include the line.
Closes https://github.com/facebook/rocksdb/pull/1622
Differential Revision: D4281316
Pulled By: yiwu-arbug
fbshipit-source-id: c7b38e0
Summary:
These changes are included in the new branch-cut.
Closes https://github.com/facebook/rocksdb/pull/1621
Differential Revision: D4281015
Pulled By: yiwu-arbug
fbshipit-source-id: d88858b
Summary:
Embarassingly enough, the first time I tried to use my new feature in logdevice it crashed with this assertion failure:
db/pinned_iterators_manager.h:30: void rocksdb::PinnedIteratorsManager::StartPinning(): Assertion `pinning_enabled == false' failed
The issue was that `pinned_iters_mgr_.StartPinning()` was called but `pinned_iters_mgr_.ReleasePinnedData()` wasn't.
Closes https://github.com/facebook/rocksdb/pull/1611
Differential Revision: D4265622
Pulled By: al13n321
fbshipit-source-id: 747b10f
Summary:
Allow user to explicitly specify that the generated file by SstFileWriter will be ingested in a specific CF.
This allow us to persist the CF id in the generated file
Closes https://github.com/facebook/rocksdb/pull/1615
Differential Revision: D4270422
Pulled By: IslamAbdelRahman
fbshipit-source-id: 7fb954e
Summary:
Made delete_obsolete_files_period_micros option dynamic. It can be updating using DB::SetDBOptions().
Closes https://github.com/facebook/rocksdb/pull/1595
Differential Revision: D4246569
Pulled By: tonek
fbshipit-source-id: d23f560
Summary:
Reduce number of comparisons in heap by caching which child node in the first level is smallest (left_child or right_child)
So next time we can compare directly against the smallest child
I see that the total number of calls to comparator drops significantly when using this optimization
Before caching (~2mil key comparison for iterating the DB)
```
$ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq" --db="/dev/shm/heap_opt" --use_existing_db --disable_auto_compactions --cache_size=1000000000 --perf_level=2
readseq : 0.338 micros/op 2959201 ops/sec; 327.4 MB/s user_key_comparison_count = 2000008
```
After caching (~1mil key comparison for iterating the DB)
```
$ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq" --db="/dev/shm/heap_opt" --use_existing_db --disable_auto_compactions --cache_size=1000000000 --perf_level=2
readseq : 0.309 micros/op 3236801 ops/sec; 358.1 MB/s user_key_comparison_count = 1000011
```
It also improves
Closes https://github.com/facebook/rocksdb/pull/1600
Differential Revision: D4256027
Pulled By: IslamAbdelRahman
fbshipit-source-id: 76fcc66
Summary:
Now that we have userspace persisted cache, we don't need flashcache anymore.
Closes https://github.com/facebook/rocksdb/pull/1588
Differential Revision: D4245114
Pulled By: igorcanadi
fbshipit-source-id: e2c1c72
Summary:
Multi-write thread may update the status of the parallel_group in
WriteThread::CompleteParallelWorker if the status of Writer is not ok!
When copy write status to the paralle_group, the write thread just hold the
mutex of the the writer processed by itself. it is useless. The thread
should held the the leader of the parallel_group instead.
Closes https://github.com/facebook/rocksdb/pull/1598
Differential Revision: D4252335
Pulled By: siying
fbshipit-source-id: 3864cf7
Summary:
This adds the ability for compaction filter to say "drop this key-value, and also drop everything up to key x". This will cause the compaction to seek input iterator to x, without reading the data. This can make compaction much faster when large consecutive chunks of data are filtered out. See the changes in include/rocksdb/compaction_filter.h for the new API.
Along the way this diff also adds ability for compaction filter changing merge operands, similar to how it can change values; we're not going to use this feature, it just seemed easier and cleaner to implement it than to document that it's not implemented :)
The diff is not as big as it may seem, about half of the lines are a test.
Closes https://github.com/facebook/rocksdb/pull/1599
Differential Revision: D4252092
Pulled By: al13n321
fbshipit-source-id: 41e1e48
Summary:
Add C API to set base_backgroud_compactions
Closes https://github.com/facebook/rocksdb/pull/1571
Differential Revision: D4245709
Pulled By: yiwu-arbug
fbshipit-source-id: 792c6b8
Summary:
99c052a34f fixes integer overflow in GetL0ThresholdSpeedupCompaction() by checking if int become -ve.
UBSAN will complain about that since this is still an overflow, we can fix the issue by simply using int64_t
Closes https://github.com/facebook/rocksdb/pull/1582
Differential Revision: D4241525
Pulled By: IslamAbdelRahman
fbshipit-source-id: b3ae21f
Summary:
disable UBSAN for functions with intentional left shift on -ve number / overflow
These functions are
rocksdb:: Hash
FixedLengthColBufEncoder::Append
FaultInjectionTest:: Key
Closes https://github.com/facebook/rocksdb/pull/1577
Differential Revision: D4240801
Pulled By: IslamAbdelRahman
fbshipit-source-id: 3e1caf6
Summary:
Having -ve value for max_write_buffer_number does not make sense and cause us to do a left shift on a -ve value number
Closes https://github.com/facebook/rocksdb/pull/1579
Differential Revision: D4240798
Pulled By: IslamAbdelRahman
fbshipit-source-id: bd6267e
Summary:
Both the single deletion and the value are included in compaction outputs, so no need to update the stat for the value's deletion yet, otherwise it'd be double-counted.
Closes https://github.com/facebook/rocksdb/pull/1574
Differential Revision: D4241181
Pulled By: ajkr
fbshipit-source-id: c9aaa15
Summary:
- "rocksdb.compaction.key.drop.range_del" - number of keys dropped during compaction due to a range tombstone covering them
- "rocksdb.compaction.range_del.drop.obsolete" - number of range tombstones dropped due to compaction to bottom level and no snapshot saving them
- s/CompactionIteratorStats/CompactionIterationStats/g since this class is no longer specific to CompactionIterator -- it's also updated for range tombstone iteration during compaction
- Move the above class into a separate .h file to avoid circular dependency.
Closes https://github.com/facebook/rocksdb/pull/1520
Differential Revision: D4187179
Pulled By: ajkr
fbshipit-source-id: 10c2103
Summary:
In one deployment we saw high latencies (presumably from slow iterator operations) and a lot of CPU time reported by perf with this stack:
```
rocksdb::MergingIterator::Next
rocksdb::DBIter::FindNextUserEntryInternal
rocksdb::DBIter::Seek
```
I think what's happening is:
1. we create a snapshot iterator,
2. we do lots of Put()s for the same key x; this creates lots of entries in memtable,
3. we seek the iterator to a key slightly smaller than x,
4. the seek walks over lots of entries in memtable for key x, skipping them because of high sequence numbers.
CC IslamAbdelRahman
Closes https://github.com/facebook/rocksdb/pull/1413
Differential Revision: D4083879
Pulled By: IslamAbdelRahman
fbshipit-source-id: a83ddae
Summary:
Current write stalling system has the problem of lacking of positive feedback if the restricted rate is already too low. Users sometimes stack in very low slowdown value. With the diff, we add a positive feedback (increasing the slowdown value) if we recover from slowdown state back to normal. To avoid the positive feedback to keep the slowdown value to be to high, we add issue a negative feedback every time we are close to the stop condition. Experiments show it is easier to reach a relative balance than before.
Also increase level0_stop_writes_trigger default from 24 to 32. Since level0_slowdown_writes_trigger default is 20, stop trigger 24 only gives four files as the buffer time to slowdown writes. In order to avoid stop in four files while 20 files have been accumulated, the slowdown value must be very low, which is amost the same as stop. It also doesn't give enough time for the slowdown value to converge. Increase it to 32 will smooth out the system.
Closes https://github.com/facebook/rocksdb/pull/1562
Differential Revision: D4218519
Pulled By: siying
fbshipit-source-id: 95e4088
Summary:
This PR is based on nbronson's diff with small
modifications to wire it up with existing interface. Comparing to
previous version, this approach works better for inserting keys in
decreasing order or updating the same key, and impose less restriction
to the prefix extractor.
---- Summary from original diff ----
This diff introduces a single InlineSkipList::Insert that unifies
the existing sequential insert optimization (prev_), concurrent insertion,
and insertion using externally-managed insertion point hints.
There's a deep symmetry between insertion hints (cursors) and the
concurrent algorithm. In both cases we have partial information from
the recent past that is likely but not certain to be accurate. This diff
introduces the struct InlineSkipList::Splice, which encodes predecessor
and successor information in the same form that was previously only used
within a single call to InsertConcurrently. Splice holds information
about an insertion point that can be used to levera
Closes https://github.com/facebook/rocksdb/pull/1561
Differential Revision: D4217283
Pulled By: yiwu-arbug
fbshipit-source-id: 33ee437
Summary:
The persistent cache is designed to hop over errors and return key not found. So far, it has shown resilience to write errors, encoding errors, data corruption etc. It is not resilient against disappearing files/directories. This was exposed during testing when multiple instances of persistence cache was started sharing the same directory simulating an unpredictable filesystem environment.
This patch
- makes the write code path more resilient to errors while creating files
- makes the read code path more resilient to handle situation where files are not found
- added a test that does negative write/read testing by removing the directory while writes are in progress
Closes https://github.com/facebook/rocksdb/pull/1472
Differential Revision: D4143413
Pulled By: kradhakrishnan
fbshipit-source-id: fd25e9b