Summary: IO Posix depends on too many .h files. Move most of them to .cc files.
Test Plan: make all
Reviewers: anthony, rven, IslamAbdelRahman, yhchiang, kradhakrishnan, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D49479
Summary: include/posix/io_posix.h doesn't not prevent multiple includes. Need to fix it. It is also breaking unity build.
Test Plan: Run unity build and see error go away.
Reviewers: rven, igor, IslamAbdelRahman, kradhakrishnan, anthony
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D49281
Summary: This patch splits the posix storage backend into Env and
the actual *File implementations. The motivation is to allow other Envs
to use posix as a library. This enables a storage backend different from
posix to split its secondary storage between a normal file system
partition managed by posix, and it own media.
Test Plan: No new functionality is added to posix Env or the library,
thus the current tests should suffice.
Summary: Need to pass through the memtable parameter.
Test Plan: built, tested through myrocks
Reviewers: igor, sdong, rven
Reviewed By: rven
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D49167
Summary: Bump the version to 4.2 ( the unreleased version ), so that when fbcode_unittests run it can differentiate between old and new APIs
Test Plan: make check
Reviewers: sdong, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D49041
Summary: Remove warning about an issue that was resolved. Turns out the issue was a false-alarm.
Test Plan: n/a
Reviewers: igor, yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D49011
Summary: As above.
Test Plan: USE_CLANG=1 make check -j
Reviewers: igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D48981
Add an environment method to reuse an existing file. Provide a generic
implementation that does a simple rename + open (writeable), and also a
posix variant that is more careful about error handling (if we fail to
open, do not rename, etc.).
Signed-off-by: Sage Weil <sage@redhat.com>
Summary: In MyRocks, it is sometimes important to get propeties only for the subset of the database. This diff implements the API in RocksDB.
Test Plan: ran the GetPropertiesOfTablesInRange
Reviewers: rven, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D48651
Summary:
Add rocksdb.num-running-compactions and rocksdb.num-running-flushes
to GetIntProperty() that reports the number of currently running
compactions / flushes.
Test Plan: augmented existing tests in db_test
Reviewers: igor, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D48693
Summary:
MyRocks needs the ability to clear a snapshot for Read Committed support
Test Plan: transaction_test
Reviewers: anthony
Reviewed By: anthony
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D48861
Summary: DefaultCompactionFilterFactory is not used anymore after recent changes. Remove it.
Test Plan: Just run existing tests.
Reviewers: rven, igor, yhchiang, kradhakrishnan, anthony, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D48699
Summary: Include the time unit in the comment of perf_context timers
Test Plan: make perf_context_test
Reviewers: igor, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D48663
Summary:
Separate a new class InternalIterator from class Iterator, when the look-up is done internally, which also means they operate on key with sequence ID and type.
This change will enable potential future optimizations but for now InternalIterator's functions are still the same as Iterator's.
At the same time, separate the cleanup function to a separate class and let both of InternalIterator and Iterator inherit from it.
Test Plan: Run all existing tests.
Reviewers: igor, yhchiang, anthony, kradhakrishnan, IslamAbdelRahman, rven
Reviewed By: rven
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D48549
Summary:
Introduce TableOptions section and support BlockBasedTable in RocksDB
options file. A TableOptions section has the following format:
[TableOptions/<FactoryClassName> "<ColumnFamily Name>"]
which includes information about its TableFactory class and belonging
column family. Below is an example TableOptions section of a
BlockBasedTableOptions that belongs to the default column family:
[TableOptions/BlockBasedTable "default"]
format_version=0
whole_key_filtering=true
block_size_deviation=10
block_size=4096
block_restart_interval=16
filter_policy=nullptr
no_block_cache=false
checksum=kCRC32c
cache_index_and_filter_blocks=false
index_type=kBinarySearch
hash_index_allow_collision=true
flush_block_policy_factory=FlushBlockBySizePolicyFactory
Currently, Cache-type options (i.e., block_cache and block_cache_compressed)
are not supported.
Test Plan: options_test
Reviewers: igor, anthony, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D48435
Summary: It would be nice to have and access to table properties in compaction callbacks. In MyRocks project, it will make possible to update optimizer statistics online.
Test Plan: ran the unit test. Ran myrocks with the new way of collecting stats.
Reviewers: igor, rven, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D48267
Summary: Support for Transaction::CreateSnapshotOnNextOperation(). This is to fix a write-conflict race-condition that Yoshinori was running into when testing MyRocks with LinkBench.
Test Plan: New tests
Reviewers: yhchiang, spetrunia, rven, igor, yoshinorim, sdong
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D48099
Summary:
MyRocks reported some perfomance issues when inserting many keys into a transaction due to the cost of inserting new keys into WriteBatchWithIndex. Frequently, they don't even need the keys to be indexed as they don't need to read them back. DisableIndexing() can be used to avoid the cost of indexing.
I also plan on eventually investigating if we can improve WriteBatchWithIndex performance. But even if we improved the perf here, it is still beneficial to be able to disable the indexing all together for large transactions.
Test Plan: unit test
Reviewers: igor, rven, yoshinorim, spetrunia, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D48471
Summary: Pass column family ID through TablePropertiesCollectorFactory::CreateTablePropertiesCollector() so that users can identify which column family this file is for and handle it differently.
Test Plan: Add unit test scenarios in tests related to table properties collectors to verify the information passed in is correct.
Reviewers: rven, yhchiang, anthony, kradhakrishnan, igor, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: yoshinorim, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D48411
Summary: Add the column family ID to compaction filter context, so it is easier for compaction filter to apply different logic for different column families.
Test Plan: Add a unit test to verify the column family ID passed is correct.
Reviewers: rven, yhchiang, igor
Reviewed By: igor
Subscribers: yoshinorim, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D48357
Summary: Change RocksDB version to 4.1
Test Plan: no code change.
Reviewers: sdong, anthony, IslamAbdelRahman, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D48387
Summary:
Two changes:
1. remove *V2 filter stuff. we deprecated that a while ago
2. clarify what happens when user sets max_subcompactions to bigger than 1
Test Plan: none
Reviewers: yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47871
Summary: Add 2 new counters BLOCK_CACHE_BYTES_WRITE, BLOCK_CACHE_BYTES_READ to keep track of how many bytes were written to the cache and how many bytes that we read from cache
Test Plan: make check
Reviewers: sdong, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D48195
Summary:
hit and miss bloom filter stats for memtable and SST
stats added to perf_context struct
key matches and prefix matches combined into one stat
Test Plan: unit test veryfing the functionality added, see BloomStatsTest in db_test.cc for details
Reviewers: yhchiang, igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47859
Summary:
Added boolean variable to guard fallocate() calls.
Set to false to prevent space leaks when tests fail.
Test Plan:
Compliles
Set to false and ran log device tests
Reviewers: sdong, lovro, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D48027
Summary:
Since Andres' internship is over, I took over https://reviews.facebook.net/D42555 and rebased and simplified it a bit.
The behavior in this diff is a bit simpler than in D42555:
* only merge operators are passed through FilterMergeValue(). If fitler function returns true, the merge operator is ignored
* compaction filter is *not* called on: 1) results of merge operations and 2) base values that are getting merged with merge operands (the second case was also true in previous diff)
Do we also need a compaction filter to get called on merge results?
Test Plan: make && make check
Reviewers: lovro, tnovak, rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: noetzli, kolmike, leveldb, dhruba, sdong
Differential Revision: https://reviews.facebook.net/D47847
Summary:
Refactor dump_tool and undump_tool so that it's possible to use them with customized options
for example setting a specific comparator similar to what Dragon is doing with the LdbTool
https://phabricator.fb.com/diffusion/FBCODE/browse/master/dragon/tools/Ldb.cpp
Test Plan:
compiles
used it to dump / undump a dragon shard
Reviewers: sdong, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba, adsharma
Differential Revision: https://reviews.facebook.net/D47853
Summary:
To support a new MongoDB capability, we need to make sure that we don't do any IO for a short period of time. For background, see:
* https://jira.mongodb.org/browse/SERVER-20704
* https://jira.mongodb.org/browse/SERVER-18899
To implement that, I add a new API calls PauseBackgroundWork() and ContinueBackgroundWork() which reuse the capability we already have in place for RefitLevel() function.
Test Plan: Added a new test in db_test. Made sure that test fails when PauseBackgroundWork() is commented out.
Reviewers: IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47901
Summary:
WriteBatchWithIndex::GetFromBatchAndDB only works correctly for overwrite_key=false. Transactions use overwrite_key=true (since WriteBatchWithIndex::GetIteratorWithBase only works when overwrite_key=true). So currently, Transactions could return incorrectly merged results when calling Get/GetForUpdate().
Until a permanent fix can be put in place, Transaction::Get[ForUpdate] and WriteBatchWithIndex::GetFromBatch[AndDB] will now return MergeInProgress if the most recent write to a key in the batch is a Merge.
Test Plan: more tests
Reviewers: sdong, yhchiang, rven, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47817
Summary:
This patch defines the format of RocksDB options file, which
follows the INI file format, and implements functions for its
serialization and deserialization. An example RocksDB options
file can be found in examples/rocksdb_option_file_example.ini.
A typical RocksDB options file has three sections, which are
Version, DBOptions, and more than one CFOptions. The RocksDB
options file in general follows the basic INI file format
with the following extensions / modifications:
* Escaped characters
We escaped the following characters:
- \n -- line feed - new line
- \r -- carriage return
- \\ -- backslash \
- \: -- colon symbol :
- \# -- hash tag #
* Comments
We support # style comments. Comments can appear at the ending
part of a line.
* Statements
A statement is of the form option_name = value.
Each statement contains a '=', where extra white-spaces
are supported. However, we don't support multi-lined statement.
Furthermore, each line can only contain at most one statement.
* Section
Sections are of the form [SecitonTitle "SectionArgument"],
where section argument is optional.
* List
We use colon-separated string to represent a list.
For instance, n1:n2:n3:n4 is a list containing four values.
Below is an example of a RocksDB options file:
[Version]
rocksdb_version=4.0.0
options_file_version=1.0
[DBOptions]
max_open_files=12345
max_background_flushes=301
[CFOptions "default"]
[CFOptions "the second column family"]
[CFOptions "the third column family"]
Test Plan: Added many tests in options_test.cc
Reviewers: igor, IslamAbdelRahman, sdong, anthony
Reviewed By: anthony
Subscribers: maykov, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46059
Summary: Transactional SingleDelete is needed for MyRocks. Note: This diff requires D47529.
Test Plan: Added some new tests in this diff as well as more tests added in D47529
Reviewers: rven, sdong, igor, yhchiang
Reviewed By: yhchiang
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47535
Summary: Fixed some bugs in using SingleDelete on a WriteBatchWithIndex and added some tests.
Test Plan: new tests
Reviewers: sdong, yhchiang, rven, kradhakrishnan, IslamAbdelRahman, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47529
Summary:
This is an initial version of bulk load feature
This diff allow us to create sst files, and then bulk load them later, right now the restrictions for loading an sst file are
(1) Memtables are empty
(2) Added sst files have sequence number = 0, and existing values in database have sequence number = 0
(3) Added sst files values are not overlapping
Test Plan: unit testing
Reviewers: igor, ott, sdong
Reviewed By: sdong
Subscribers: leveldb, ott, dhruba
Differential Revision: https://reviews.facebook.net/D39081
Summary:
Add options.compaction_pri, which specifies the policy about which file to compact first.
kCompactionPriByLargestSeq will compact oldest files first.
Verified the behavior in db_bench but did not write unit tests yet. Also need to make it settable through option string and dynamically changeable.
Test Plan: Will write unit tests
Reviewers: igor, rven, anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, MarkCallaghan
Reviewed By: yhchiang
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D45951
Summary:
This patch fixes#7460559. It introduces SingleDelete as a new database
operation. This operation can be used to delete keys that were never
overwritten (no put following another put of the same key). If an overwritten
key is single deleted the behavior is undefined. Single deletion of a
non-existent key has no effect but multiple consecutive single deletions are
not allowed (see limitations).
In contrast to the conventional Delete() operation, the deletion entry is
removed along with the value when the two are lined up in a compaction. Note:
The semantics are similar to @igor's prototype that allowed to have this
behavior on the granularity of a column family (
https://reviews.facebook.net/D42093 ). This new patch, however, is more
aggressive when it comes to removing tombstones: It removes the SingleDelete
together with the value whenever there is no snapshot between them while the
older patch only did this when the sequence number of the deletion was older
than the earliest snapshot.
Most of the complex additions are in the Compaction Iterator, all other changes
should be relatively straightforward. The patch also includes basic support for
single deletions in db_stress and db_bench.
Limitations:
- Not compatible with cuckoo hash tables
- Single deletions cannot be used in combination with merges and normal
deletions on the same key (other keys are not affected by this)
- Consecutive single deletions are currently not allowed (and older version of
this patch supported this so it could be resurrected if needed)
Test Plan: make all check
Reviewers: yhchiang, sdong, rven, anthony, yoshinorim, igor
Reviewed By: igor
Subscribers: maykov, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43179
Summary:
In case of huge db backup infromation about progress of downloading would help.
New callback parameter in CreateNewBackup() function will trigger whenever a some amount of data downloaded.
Task: 8057631
Test Plan:
ProgressCallbackDuringBackup test that cover new functionality added to BackupableDBTest tests.
other test succeed as well.
Reviewers: Guenena, benj, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46575
Summary:
RocksDB options can be dumped to the log file, and
up to this point the max_subcompactions option was not included
in this dump. This fixes that.
Test Plan: makek all && make check
Reviewers: MarkCallaghan, igor, noetzli, anthony, yhchiang, sdong
Reviewed By: yhchiang, sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D46971
Summary:
There are some use cases in MyRocks to compare two slices
and to return the first byte where they differ. It may be
useful to add it as a RocksDB Slice function.
Test Plan: db_test
Reviewers: sdong, rven, igor
Reviewed By: igor
Subscribers: jkedgar, dhruba
Differential Revision: https://reviews.facebook.net/D46935
Summary: Add an option to stop writes if compaction lefts behind. If estimated pending compaction bytes is more than threshold specified by options.hard_pending_compaction_bytes_liimt, writes will stop until compactions are cleared to under the threshold.
Test Plan: Add unit test DBTest.HardLimit
Reviewers: rven, kradhakrishnan, anthony, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: MarkCallaghan, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D45999
Summary:
Although there are currently counters to keep track of the
stall caused by having too many L0 files, there is no distinction as
to whether when that stall occurs either (A) L0-L1 compaction is taking
place to try and mitigate it, or (B) no L0-L1 compaction has been scheduled
at the moment. This diff adds a counter for (A) so that the nature of L0
stalls can be better understood.
Test Plan: make all && make check
Reviewers: sdong, igor, anthony, noetzli, yhchiang
Reviewed By: yhchiang
Subscribers: MarkCallaghan, dhruba
Differential Revision: https://reviews.facebook.net/D46749
Summary:
There are currently no statistics on seeks, only on gets. This adds the following counters:
rocksdb.number.db.seek
rocksdb.number.db.next
rocksdb.number.db.prev
(number of calls)
rocksdb.db.iterate.bytes.read
(number of bytes read from key + value using seek/next/prev)
rocksdb.number.keys.seek.found
rocksdb.number.keys.next.found
rocksdb.number.keys.prev.found
(number of calls where seek/next/prev found a value)
Test Plan:
./db_bench -statistics -benchmarks fillrandom,seekrandom -seek_nexts 5
./db_bench -statistics -benchmarks fillrandom,seekrandom -seek_nexts 5 -reverse_iterator
Reviewers: yhchiang, rven, kradhakrishnan, IslamAbdelRahman, MarkCallaghan, sdong, igor
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D46605
Summary:
Refactoring NewTableReader to accept TableReaderOptions
This will make it easier to add new options in the future, for example in this diff https://reviews.facebook.net/D46071
Test Plan: run existing tests
Reviewers: igor, yhchiang, anthony, rven, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D46179
Summary. A change https://reviews.facebook.net/differential/diff/224721/
Has attempted to move common functionality out of platform dependent
code to a new facility called file_reader_writer.
This includes:
- perf counters
- Buffering
- RateLimiting
However, the change did not attempt to refactor Windows code.
To mitigate, we introduce new quering interfaces such as UseOSBuffer(),
GetRequiredBufferAlignment() and ReaderWriterForward()
for pure forwarding where required.
Introduce WritableFile got a new method Truncate(). This is to communicate
to the file as to how much data it has on close.
- When space is pre-allocated on Linux it is filled with zeros implicitly,
no such thing exist on Windows so we must truncate file on close.
- When operating in unbuffered mode the last page is filled with zeros but we still want to truncate.
Previously, Close() would take care of it but now buffer management is shifted to the wrappers and the file has
no idea about the file true size.
This means that Close() on the wrapper level must always include
Truncate() as well as wrapper __dtor should call Close() and
against double Close().
Move buffered/unbuffered write logic to the wrapper.
Utilize Aligned buffer class.
Adjust tests and implement Truncate() where necessary.
Come up with reasonable defaults for new virtual interfaces.
Forward calls for RandomAccessReadAhead class to avoid double
buffering and locking (double locking in unbuffered mode on WIndows).
Summary:
Up to this point, the subcompactions that make up a compaction
job have been divided based on the key range of the L1 files, and each
subcompaction has handled the key range of only one file. However
DBOption.max_subcompactions allows the user to designate how many
subcompactions at most to perform. This patch updates the
CompactionJob::GetSubcompactionBoundaries() to determine these
divisions accordingly based on that option and other input/system factors.
The current approach orders the starting and/or ending keys of certain
compaction input files and then generates a histogram to approximate the
size covered by the key range between each consecutive pair of keys. Then
it groups these ranges into groups so that the sizes are approximately equal
to one another. The approach has also been adapted to work for universal
compaction as well instead of just for level-based compaction as it was before.
These subcompactions are then executed in parallel by locally spawning
threads, one for each. The results are then aggregated and the compaction
completed.
Test Plan: make all && make check
Reviewers: yhchiang, anthony, igor, noetzli, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43269
Summary: We should never set max_open_files to be bigger than the system's ulimit. Otherwise we will get "Too many open files" errors. See an example in this Travis run: https://travis-ci.org/facebook/rocksdb/jobs/79591566
Test Plan:
make check
I will also verify that max_max_open_files is reasonable.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46551
Summary: Release RocksDB 4.0.0
Test Plan: no test
Reviewers: sdong, yhchiang, anthony, rven, kradhakrishnan, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D46515
Summary: Added funtions to fetch the number of locked keys in a transaction, the number of pending puts/merge/deletes, and the elapsed time
Test Plan: unit tests
Reviewers: yoshinorim, jkedgar, rven, sdong, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D45417
Summary: Currently, if users didn't set options.arena_block_size, we set "result.arena_block_size = result.write_buffer_size / 10". It makes result.arena_block_size not a multiplier of 4KB, even if options.write_buffer_size is a multiplier of MBs. When calling malloc to arena_block_size, we may waste a small amount of memory for it. We now make the default to be /8 or /16 and align it to 4KB.
Test Plan: unit tests
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46467
Summary:
Prototype of API to allow MyRocks to override default Mutex/CondVar used by transactions with their own implementations. They would simply need to pass their own implementations of Mutex/CondVar to the templated TransactionDB::Open().
Default implementation of TransactionDBMutex/TransactionDBCondVar provided (but the code is not currently changed to use this).
Let me know if this API makes sense or if it should be changed
Test Plan: n/a
Reviewers: yhchiang, rven, igor, sdong, spetrunia
Reviewed By: spetrunia
Subscribers: maykov, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43761
Summary:
In some cases, equality comparisons can be done more efficiently than three-way
comparisons. There are quite a few places in the code where we only care about
equality. This patch adds an Equal() method that defaults to using the
Compare() method.
Test Plan: make clean all check
Reviewers: rven, anthony, yhchiang, igor, sdong
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46233
Summary: This diff adds a verifyBackup method to BackupEngine. The method verifies the name and size of each file in the backup.
Test Plan: Unit test cases created and passing.
Reviewers: igor, benj
Subscribers: zelaine.fong, yhchiang, sdong, lgalanis, dhruba, AaronFeldman
Differential Revision: https://reviews.facebook.net/D46029
Summary: BackupRateLimiter removed and uses replaced with the existing GenericRateLimiter
Test Plan:
make all check
make clean
USE_CLANG=1 make all
make clean
OPT=-DROCKSDB_LITE make release
Reviewers: leveldb, igor
Reviewed By: igor
Subscribers: igor, dhruba
Differential Revision: https://reviews.facebook.net/D46095
Summary: MyRocks wants to be able to change the lock timeout of a transaction that has already started. Expose existing SetLockTimeout function to users.
Test Plan: unit test
Reviewers: spetrunia, rven, sdong, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D45987
Summary:
This diff is a collection of cleanups that were initially part of D43179.
Additionally it adds a unified way of defining key-value maps that use a
Comparator for sorting (this was previously implemented in four different
places).
Test Plan: make clean check all
Reviewers: rven, anthony, yhchiang, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D45993
Previous change for the function
555ca3e7b7 (diff-bdc04e0404c2db4fd3ac5118a63eaa4a)
made use of the QueryPerformanceCounter to return microseconds values that do not repeat
as std::chrono::system_clock returned values that made auto_roll_logger_test fail.
The interface documentation does not state that we need to return
system time describing the return value as a number of microsecs since some
moment in time. However, because on Linux it is implemented using gettimeofday
various pieces of code (such as GenericRateLimiter) took advantage of that
and make use of NowMicros() as a system timestamp. Thus the previous change
broke rate_limiter_test on Windows.
In addition, the interface name NowMicros() suggests that it is actually
a timestamp so people use it as such.
This change makes use of the new system call on Windows that returns
system time with required precision. This change preserves the fix
for auto_roll_logger_test and fixes rate_limiter_test.
Note that DBTest.RateLimitingTest still fails due to a separately reported issue.
Summary: Provide a way to specify a detailed static error message for a Status without incurring a memcpy. Let me know what people think of this approach.
Test Plan: added simple test
Reviewers: igor, yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D44259
Summary: Add ZSTD compression type. The same way as adding LZ4.
Test Plan: run all tests. Generate files in db_bench. Make sure reads succeed. But the SST files cannot be opened in older versions. Also some other adhoc tests.
Reviewers: rven, anthony, IslamAbdelRahman, kradhakrishnan, igor
Reviewed By: igor
Subscribers: MarkCallaghan, maykov, yoshinorim, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D45747
Summary:
This patch adds the helper functions and variables to allow a backend
implementing WritableFile to support direct IO when persisting a
memtable.
Test Plan:
Since there is no upstream implementation of WritableFile supporting
direct IO, the new behavior is disabled.
Tests should be provided by the backend implementing WritableFile.
Summary:
This patch adds GetStringFromColumnFamilyOptions(), the inverse function
of the existing GetColumnFamilyOptionsFromString(), and improves
the implementation of GetColumnFamilyOptionsFromString().
Test Plan: Add a test in options_test.cc
Reviewers: igor, sdong, anthony, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: noetzli, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D45009
Summary:
ReadaheadRandomAccessFile acts as a transparent layer on top of RandomAccessFile. When a Read() request is issued, it issues a much bigger request to the OS and caches the result. When a new request comes in and we already have the data cached, it doesn't have to issue any requests to the OS.
We add ReadaheadRandomAccessFile layer only when file is read during compactions.
D45105 was incorrectly closed by Phabricator because I committed it to a separate branch (not master), so I'm resubmitting the diff.
Test Plan: make check
Reviewers: MarkCallaghan, sdong
Reviewed By: sdong
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D45123
Summary:
"rocksdb.aggregated-table-properties" and "rocksdb.aggregated-table-properties-at-level<N>"
should belong to GetProperty() instead of GetIntProperty(), but the comment mistakenly
classifies them to GetIntProperty().
This patch fix this comment error.
Test Plan: no code change.
Reviewers: sdong, anthony, IslamAbdelRahman, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D45561
Summary:
See task #7983654. The example was triggering an assert in compaction job
because the compaction was not marked as manual. With this patch,
CompactionPicker::FormCompaction() marks compactions as manual. This patch
also fixes a couple of typos, adds optimistic_transaction_example to
.gitignore and librocksdb as a dependency for examples. Adding librocksdb as
a dependency makes sure that the examples are built with the latest changes
in librocksdb.
Test Plan: make clean && cd examples && make all && ./compact_files_example
Reviewers: rven, sdong, anthony, igor, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D45117
Summary:
This patch adds "rocksdb.aggregated-table-properties"
and "rocksdb.aggregated-table-properties-at-levelN", the former
returns the aggreated table properties of a column family,
while the later returns the aggregated table properties
of the specified level N.
Test Plan: Added tests in db_test
Reviewers: igor, sdong, IslamAbdelRahman, anthony
Reviewed By: anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D45087
Summary:
Up until this point we had DbOptions.num_subcompactions, but
it is semantically more correct to call this max_subcompactions since
we will schedule *up to* DbOptions.max_subcompactions smaller compactions
at a time during a compaction job.
I also added a --subcompactions option to db_bench
Test Plan: make all make check
Reviewers: sdong, igor, anthony, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D45069
Summary: Currently compaction inputs share the same file descriptor and table reader as other foreground threads. It makes fadvise works less predictable. Add options.new_table_reader_for_compaction_inputs to enforce to create a new file descriptor and new table reader for it.
Test Plan: Add the option.
Reviewers: rven, anthony, kradhakrishnan, IslamAbdelRahman, igor, yhchiang
Reviewed By: igor
Subscribers: igor, MarkCallaghan, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D43311
Summary:
Add a counter of estimated bytes the DB needs to compact for all the compactions to finish. Expose it as a DB Property.
In the future, we can use threshold of this counter to replace soft rate limit and hard rate limit. A single threshold of estimated compaction debt in bytes will be easier for users to reason about when should slow down and stopping than more abstract soft and hard rate limits.
Test Plan: Add unit tests
Reviewers: IslamAbdelRahman, yhchiang, rven, kradhakrishnan, anthony, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D44205
Summary: Add a new DB property that calculate the total size of files used by all RocksDB Versions
Test Plan: Unittests for the new property
Reviewers: igor, yhchiang, anthony, rven, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D44799
Summary: Update DestroyDB so that all SST files in the first path id go through DeleteScheduler instead of being deleted immediately
Test Plan: added a unittest
Reviewers: igor, yhchiang, anthony, kradhakrishnan, rven, sdong
Reviewed By: sdong
Subscribers: jeanxu2012, dhruba
Differential Revision: https://reviews.facebook.net/D44955
Summary:
Currently, GetIntProperty("rocksdb.cur-size-all-mem-tables") only returns
the memory usage by those memtables which have not yet been flushed.
This patch introduces GetIntProperty("rocksdb.size-all-mem-tables"),
which includes the memory usage by all the memtables, includes those
have been flushed but pinned by iterators.
Test Plan: Added a test in db_test
Reviewers: igor, anthony, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D44229
Summary:
This patch implements DBOptions deserialization and improve
the current implementation of DBOptions serialization by
using a static structure that stores the offset of each
DBOptions member variables to perform serialization and
deserialization instead of using tons of if-then-branch
to determine the mapping between string and variables.
Test Plan: Added test in options_test.cc
Reviewers: igor, anthony, sdong, IslamAbdelRahman
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D44097
Summary:
In prepration for running multiple threads at the same time during
a compaction job, this patch assigns each subcompaction its own state
(instead of sharing the one global CompactionState). Each subcompaction then
uses this state to update its statistics, keep track of its snapshots, etc.
during the course of execution. Then at the end of all the executions the
statistics are aggregated across the subcompactions so that the final result
is the same as if only one larger compaction had run.
Test Plan: ./db_test ./db_compaction_test ./compaction_job_test
Reviewers: sdong, anthony, igor, noetzli, yhchiang
Reviewed By: yhchiang
Subscribers: MarkCallaghan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43239
Summary:
While working on supporting mixing merge operators with
single deletes ( https://reviews.facebook.net/D43179 ),
I realized that returning and dealing with merge results
can be made simpler. Submitting this as a separate diff
because it is not directly related to single deletes.
Before, callers of merge helper had to retrieve the merge
result in one of two ways depending on whether the merge
was successful or not (success = result of merge was single
kTypeValue). For successful merges, the caller could query
the resulting key/value pair and for unsuccessful merges,
the result could be retrieved in the form of two deques of
keys and values. However, with single deletes, a successful merge
does not return a single key/value pair (if merge
operands are merged with a single delete, we have to generate
a value and keep the original single delete around to make
sure that we are not accidentially producing a key overwrite).
In addition, the two existing call sites of the merge
helper were taking the same actions independently from whether
the merge was successful or not, so this patch simplifies that.
Test Plan: make clean all check
Reviewers: rven, sdong, yhchiang, anthony, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43353
Summary:
Add options.compaction_measure_io_stats to print out / pass to listener accumulated time spent on write calls. Example outputs in info logs:
2015/08/12-16:27:59.463944 7fd428bff700 (Original Log Time 2015/08/12-16:27:59.463922) EVENT_LOG_v1 {"time_micros": 1439422079463897, "job": 6, "event": "compaction_finished", "output_level": 1, "num_output_files": 4, "total_output_size": 6900525, "num_input_records": 111483, "num_output_records": 106877, "file_write_nanos": 15663206, "file_range_sync_nanos": 649588, "file_fsync_nanos": 349614797, "file_prepare_write_nanos": 1505812, "lsm_state": [2, 4, 0, 0, 0, 0, 0]}
Add two more counters in iostats_context.
Also add a parameter of db_bench.
Test Plan: Add a unit test. Also manually verify LOG outputs in db_bench
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D44115
Summary: Change master version to 3.14
Test Plan: simple change
Reviewers: sdong, yhchiang, kradhakrishnan, rven, anthony, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D44187
Summary: RandomRWFile is not used anywhere in out code base, this patch remove RandomRWFile
Test Plan:
make check -j64
USE_CLANG=1 make all -j64
OPT=-DROCKSDB_LITE make release -j64
Reviewers: sdong, yhchiang, anthony, kradhakrishnan, rven, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D44091
Summary:
Clean up transactions to use the new RollbackToSavePoint api in WriteBatchWithIndex.
Note, this diff depends on Pessimistic Transactions diff and ManagedSnapshot diff (D40869 and D43293).
Test Plan: unit tests
Reviewers: rven, yhchiang, kradhakrishnan, spetrunia, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43371
Summary:
Initial implementation of Pessimistic Transactions. This diff contains the api changes discussed in D38913. This diff is pretty large, so let me know if people would prefer to meet up to discuss it.
MyRocks folks: please take a look at the API in include/rocksdb/utilities/transaction[_db].h and let me know if you have any issues.
Also, you'll notice a couple of TODOs in the implementation of RollbackToSavePoint(). After chatting with Siying, I'm going to send out a separate diff for an alternate implementation of this feature that implements the rollback inside of WriteBatch/WriteBatchWithIndex. We can then decide which route is preferable.
Next, I'm planning on doing some perf testing and then integrating this diff into MongoRocks for further testing.
Test Plan: Unit tests, db_bench parallel testing.
Reviewers: igor, rven, sdong, yhchiang, yoshinorim
Reviewed By: sdong
Subscribers: hermanlee4, maykov, spetrunia, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D40869
Summary: Add a new option that all LoadTableHandlers to use multiple threads to load files on DB Open and Recover
Test Plan:
make check -j64
COMPILE_WITH_TSAN=1 make check -j64
DISABLE_JEMALLOC=1 make all valgrind_check -j64 (still running)
Reviewers: yhchiang, anthony, rven, kradhakrishnan, igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D43755
Summary:
Changed compaction_job_test to support better/more thorough
tests and added two tests. Also changed MockFileContents
to order using InternalKeyComparator.
Test Plan: make compaction_job_test && ./compaction_job_test; make all && make check
Reviewers: sdong, rven, igor, yhchiang, anthony
Reviewed By: anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D42837
Summary: Implemented this simple wrapper for something else I was working on. Seemed like it makes sense to expose it instead of burying it in some random code.
Test Plan: added test
Reviewers: rven, kradhakrishnan, sdong, yhchiang
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43293
Summary: The list of info log files of a db can be obtained using the new function.
Test Plan: New test in db_test.cc passed.
Reviewers: yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: IslamAbdelRahman, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D41715
Summary: Measure read latency histogram and put in statistics. Compaction inputs are excluded from it when possible (unfortunately usually no possible as we usually take table reader from table cache.
Test Plan:
Run db_bench and it shows the stats, like:
rocksdb.sst.read.micros statistics Percentiles :=> 50 : 1.238522 95 : 2.529740 99 : 3.912180
Reviewers: kradhakrishnan, rven, anthony, IslamAbdelRahman, MarkCallaghan, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D43275
Summary:
While working on https://reviews.facebook.net/D43179 , I found
duplicate code in the tests. This patch removes it.
Test Plan: make clean all check
Reviewers: igor, sdong, rven, anthony, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43263
Summary:
Subj. We really need this feature.
Previous diff D40899 has most of the changes to make this possible, this diff just adds the method.
Test Plan: `make check`, the new test fails without this diff; ran with ASAN, TSAN and valgrind.
Reviewers: igor, rven, IslamAbdelRahman, anthony, kradhakrishnan, tnovak, yhchiang, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, maykov, hermanlee4, yoshinorim, tnovak, dhruba
Differential Revision: https://reviews.facebook.net/D40905
Summary:
Updated DBTest DBCompactionTest and CompactionJobStatsTest
to run compaction-related tests once with subcompactions enabled and
once disabled using the TEST_P test type in the Google Test suite.
Test Plan: ./db_test ./db_compaction-test ./compaction_job_stats_test
Reviewers: sdong, igor, anthony, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43443
Summary:
Introduce DeleteScheduler that allow enforcing a rate limit on file deletion
Instead of deleting files immediately, files are moved to trash directory and deleted in a background thread that apply sleep penalty between deletes if needed.
I have updated PurgeObsoleteFiles and PurgeObsoleteWALFiles to use the delete_scheduler instead of env_->DeleteFile
Test Plan:
added delete_scheduler_test
existing unit tests
Reviewers: kradhakrishnan, anthony, rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D43221
Summary:
UpdateAccumulatedStats() is used to optimize compaction decision
esp. when the number of deletion entries are high, but this function
can slowdown DBOpen esp. in disk environment.
This patch adds DBOptions::skip_sats_update_on_db_open, which skips
UpdateAccumulatedStats() in DB::Open() time when it's set to true.
Test Plan: Add DBCompactionTest.SkipStatsUpdateTest
Reviewers: igor, anthony, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: tnovak, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D42843
Summary:
This diff adds CompactOnDeletionCollector in utilities/table_properties_collectors,
which applies a sliding window to a sst file and mark this file as need-compaction
when it observe enough deletion entries within the consecutive keys covered by
the sliding window.
Test Plan: compact_on_deletion_collector_test
Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yoshinorim, sdong
Reviewed By: sdong
Subscribers: maykov, dhruba
Differential Revision: https://reviews.facebook.net/D41175
Summary:
As of now compactions involving files from Level 0 and Level 1 are single
threaded because the files in L0, although sorted, are not range partitioned like
the other levels. This means that during L0-L1 compaction each file from L1
needs to be merged with potentially all the files from L0.
This attempt to parallelize the L0-L1 compaction assigns a thread and a
corresponding iterator to each L1 file that then considers only the key range
found in that L1 file and only the L0 files that have those keys (and only the
specific portion of those L0 files in which those keys are found). In this way
the overlap is minimized and potentially eliminated between different iterators
focusing on the same files.
The first step is to restructure the compaction logic to break L0-L1 compactions
into multiple, smaller, sequential compactions. Eventually each of these smaller
jobs will be run simultaneously. Areas to pay extra attention to are
# Correct aggregation of compaction job statistics across multiple threads
# Proper opening/closing of output files (make sure each thread's is unique)
# Keys that span multiple L1 files
# Skewed distributions of keys within L0 files
Test Plan: Make and run db_test (newer version has separate compaction tests) and compaction_job_stats_test
Reviewers: igor, noetzli, anthony, sdong, yhchiang
Reviewed By: yhchiang
Subscribers: MarkCallaghan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D42699
Summary:
Support RollbackToSavePoint() in WriteBatch and WriteBatchWithIndex. Support for partial transaction rollback is needed for MyRocks.
An alternate implementation of Transaction::RollbackToSavePoint() exists in D40869. However, the other implementation is messier because it is implemented outside of WriteBatch. This implementation is much cleaner and also exposes a potentially useful feature to WriteBatch.
Test Plan: Added unit tests
Reviewers: IslamAbdelRahman, kradhakrishnan, maykov, yoshinorim, hermanlee4, spetrunia, sdong, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D42723
Summary:
For task #7771355, we would like to log the number of corrupt keys
during a compaction. This patch implements and tests the count
as part of CompactionJobStats.
Test Plan: make && make check
Reviewers: rven, igor, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D42921
Summary:
My latest fix to pragma_error.h caused compilation errors for another internal project. I am now unable to figure out how to get pragma_error working on all platforms and build environments (nor am I able to test any other options).
Seems like the best option is to get rid of this macro. include/utilities has been deprecated for a year now, so lets just deal with a breaking change in 3.13 to remove these files. And I guess we'll have to live with having an extra convenience.h.
Thoughts?
Test Plan: build
Reviewers: igor, yhchiang, kradhakrishnan, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D42597
Summary:
Fixes T6548822. Added a new function for estimating the size of the live data
as proposed in the task. The value can be accessed through the property
rocksdb.estimate-live-data-size.
Test Plan:
There are two unit tests in version_set_test and a simple test in db_test.
make version_set_test && ./version_set_test;
make db_test && ./db_test gtest_filter=GetProperty
Reviewers: rven, igor, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D41493
Summary:
From other ones' investigation:
"sync_file_range() behavior highly depends on kernel version and filesystem.
xfs does neighbor page flushing outside of the specified ranges. For example, sync_file_range(fd, 8192, 16384) does not only trigger flushing page #3 to #4, but also flushing many more dirty pages (i.e. up to page#16)... Ranges of the sync_file_range() should be far enough from write() offset (at least 1MB)."
Test Plan: make all check
Reviewers: igor, rven, kradhakrishnan, yhchiang, IslamAbdelRahman, anthony
Reviewed By: anthony
Subscribers: yoshinorim, MarkCallaghan, sumeet, domas, dhruba, leveldb, ljin
Differential Revision: https://reviews.facebook.net/D15807
Summary: Add new CheckFileExists method. Considered changing the FileExists api but didn't want to break anyone's builds.
Test Plan: unit tests
Reviewers: yhchiang, igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D42003
Summary:
Skipping these tests in ROCKSDB_LITE since they are not supported
json_document_test
wal_manager_test
ttl_test
sst_dump_test
deletefile_test
compact_files_test
prefix_test
checkpoint_test
Test Plan:
json_document_test
wal_manager_test
ttl_test
sst_dump_test
deletefile_test
compact_files_test
prefix_test
checkpoint_test
Reviewers: igor, sdong, yhchiang, kradhakrishnan, anthony
Reviewed By: anthony
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D42573
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
Summary:
When we first started, max_background_flushes was 0 by default and compaction thread was executing flushes (since there was no flush thread). Then, we switched the default max_background_flushes to 1. However, we still support the case where there is no flush thread and flushes are done in compaction. This is making our code a bit more complicated. By not supporting this use-case we can make our code simpler.
We have a special case that when you set max_background_flushes to 0, we
schedule the flush to execute on the compaction thread.
Test Plan: make check (there might be some unit tests that depend on this behavior)
Reviewers: IslamAbdelRahman, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D41931
Summary:
ROCKSDB_WARNING is only defined if either ROCKSDB_PLATFORM_POSIX or OS_WIN is defined. This works well for building rocksdb with its own build scripts. But this won't work when an outside project(like mongodb) doesn't define ROCKSDB_PLATFORM_POSIX.
This fix defines ROCKSDB_WARNING for all platforms. No idea if its defined correctly on non-posix,non-windows platforms but this is no worse that the current situation where this macro is missing on unexpected platforms.
This fix should hopefully fix anyone whose build broke now that we've switched from using #warning to Pragma (to support windows). Unfortunately, while mongo-rocks compiles, it ignores the Pragma and doesn't print a warning. I have not been able to figure out a way to implement this portably on all platforms.
Of course, an alternate solution would be to just get rid of ROCKSDB_WARNING and live with include file redirects indefinitely. Thoughts?
Test Plan: build rocks, build mongorocks
Reviewers: igor, kradhakrishnan, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D42477
Summary: It has been around for a while and it looks like it never found any uses in the wild. It's also complicating our compaction_job code quite a bit. We're deprecating it in 3.13, but will put it back in 3.14 if we actually find users that need this feature.
Test Plan: make check
Reviewers: noetzli, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D42405
- Remove make file defines from public headers and use _WIN32 because it is compiler defined
- use __GNUC__ and __clang__ to guard non-portable attributes
- add #include "port/port.h" to some new .cc files.
- minor changes in CMakeLists to reflect recent changes
Summary: quick fix for now. will figure out a better fix soon
Test Plan: build
Reviewers: sdong, igor, kradhakrishnan
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D42381
Summary: Unbreak build for mongo. later, we'll have to figure out if there is a platform independent way to implement ROCKSDB_WARNING
Test Plan: build
Reviewers: igor, sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D42369
Summary: master is now 3.13-to-be. I'm testing some internal stuff that has #ifdef based on RocksDB version.
Test Plan: none
Reviewers: sdong, anthony
Reviewed By: anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D42249
Summary: Moved convenience.h out of utilities to remove a dependency on utilities in db.
Test Plan: unit tests. Also compiled a link to the old location to verify the _Pragma works.
Reviewers: sdong, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D42201
Summary:
Couple of changes here:
* NewBackupEngine() and NewReadOnlyBackupEngine() are now removed. They were deprecated since RocksDB 3.8. Changing these to new functions should be pretty straight-forward. As a followup, I'll fix all fbcode callsights
* Instead of initializing backup engine in the constructor, we initialize it in a separate function now. That way, we can catch all errors and return appropriate status code.
* We catch all errors during initializations and return them to the client properly.
* Added new tests to backupable_db_test, to make sure that we can't open BackupEngine when there are Env errors.
* Transitioned backupable_db_test to use BackupEngine rather than BackupableDB. From the two available APIs, judging by the current use-cases, it looks like BackupEngine API won. It's much more flexible since it doesn't require StackableDB.
Test Plan: Added a new unit test to backupable_db_test
Reviewers: yhchiang, sdong, AaronFeldman
Reviewed By: AaronFeldman
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D41925
Summary: This option is guarding the feature implemented 2 and a half years ago: D8991. The feature was enabled by default back then and has been running without issues. There is no reason why any client would turn this feature off. I found no reference in fbcode.
Test Plan: none
Reviewers: sdong, yhchiang, anthony, dhruba
Reviewed By: dhruba
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D42063
Summary:
In one of our recent meetings, we discussed deprecating features that are not being actively used. One of those features, at least within Facebook, is timeout_hint. The feature is really nicely implemented, but if nobody needs it, we should remove it from our code-base (until we get a valid use-case). Some arguments:
* Less code == better icache hit rate, smaller builds, simpler code
* The motivation for adding timeout_hint_us was to work-around RocksDB's stall issue. However, we're currently addressing the stall issue itself (see @sdong's recent work on stall write_rate), so we should never see sharp lock-ups in the future.
* Nobody is using the feature within Facebook's code-base. Googling for `timeout_hint_us` also doesn't yield any users.
Test Plan: make check
Reviewers: anthony, kradhakrishnan, sdong, yhchiang
Reviewed By: yhchiang
Subscribers: sdong, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D41937
Summary:
Added new statistics in CompactionJobStats to keep track of
deletion entries and the expiration of those entries. Updated these
fields in compaction_job.cc as compaction took place and wrote a new
test in compaction_job_stats_test.cc to verify accuracy.
Test Plan:
Wrote new test DeletionStatsTest in
compaction_job_stats_test.cc to verify
Reviewers: sdong, igor, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D41355
Summary: This helps Windows port to format their changes, as discussed. Might have formatted some other codes too becasue last 10 commits include more.
Test Plan: Build it.
Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D41961
Summary:
Public API depends on port/port.h which is wrong. Fix it.
Also with gcc 4.8.1 build was broken as MAX_INT32 was not recognized. Fix it by using ::max in linux.
Test Plan: Build it and try to build an external project on top of it.
Reviewers: anthony, yhchiang, kradhakrishnan, igor
Reviewed By: igor
Subscribers: yoshinorim, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D41745
Summary:
new_table_iterator_nanos is not cleaned in PerfContext::Reset() while new_table_block_iter_nanos is cleaned twice. Fix it.
Also fix a comment.
Test Plan: Build and db_bench with --perf_context to see the value shown.
Reviewers: kradhakrishnan, anthony, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D41721
Summary: Add a perf context counter to help users figure out time spent on reading indexes and bloom filter blocks.
Test Plan: Will write a unit test
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D41433
Summary: This change enables trivial move if all the input files are non onverlapping while doing Universal Compaction.
Test Plan: ./compaction_picker_test and db_test ran successfully with the new testcases.
Reviewers: sdong
Reviewed By: sdong
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D40875
Summary:
Introduced a new category in the enum InfoLogLevel in env.h.
Modifed Log() in env.cc to use the Header()
when the InfoLogLevel == HEADER_LEVEL.
Updated tests in auto_roll_logger_test to ensure
the header is handled properly in these cases.
Test Plan: Augment existing tests in auto_roll_logger_test
Reviewers: igor, sdong, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D41067
Summary: About to cut release
Test Plan: none
Reviewers: igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D41061
Summary:
Add a new field: BackupableDBOptions.max_background_copies.
CreateNewBackup() and RestoreDBFromBackup() will use this number of threads to perform copies.
If there is a backup rate limit, then max_background_copies must be 1.
Update backupable_db_test.cc to test multi-threaded backup and restore.
Update backupable_db_test.cc to test backups when the backup environment is not the same as the database environment.
Test Plan:
Run ./backupable_db_test
Run valgrind ./backupable_db_test
Run with TSAN and ASAN
Reviewers: yhchiang, rven, anthony, sdong, igor
Reviewed By: igor
Subscribers: yhchiang, anthony, sdong, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D40725
invalid suffix on literal
no return statement in function returning non-void CuckooStep::operator=
extra qualification ‘rocksdb::spatial::Variant::
dereferencing type-punned pointer will break strict-aliasing rules
Summary: Make RocksDb build and run on Windows to be functionally
complete and performant. All existing test cases run with no
regressions. Performance numbers are in the pull-request.
Test plan: make all of the existing unit tests pass, obtain perf numbers.
Co-authored-by: Praveen Rao praveensinghrao@outlook.com
Co-authored-by: Sherlock Huang baihan.huang@gmail.com
Co-authored-by: Alex Zinoviev alexander.zinoviev@me.com
Co-authored-by: Dmitri Smirnov dmitrism@microsoft.com
Summary: It's not really nice to call user's API with garbage data in new_value. This diff makes sure that new_value is empty before calling the merge operator.
Test Plan: Added assert to Merge operator in merge_test
Reviewers: sdong, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40773
Summary:
BYTES_READ only count the number of logical bytes read from
the DB::Get() function. It neither includes all logical bytes read
nor indicates IO read bytes.
This patch improves the comment for BYTES_READ.
Test Plan: Only change comment.
Reviewers: sdong, rven, anthony, kradhakrishnan, IslamAbdelRahman, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40599
Summary: Replace force_bottommost_level_compaction in CompactRangeOption with an option that allow the user to (always skip, always compact, compact if compaction filter is present) the bottommost level for level based compaction.
Test Plan: make check
Reviewers: sdong, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D40527
Summary:
Implementation of a table-level row cache.
It only caches point queries done through the `DB::Get` interface, queries done through the `Iterator` interface will completely skip the cache.
Supports snapshots and merge operations.
Test Plan: Ran `make valgrind_check commit-prereq`
Reviewers: igor, philipp, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D39849
Summary:
The "one size fits all" approach with WAL recovery will only introduce inconvenience for our varied clients as we go forward. The current recovery is a bit heuristic. We introduce the following levels of consistency while replaying the WAL.
1. RecoverAfterRestart (kTolerateCorruptedTailRecords)
This mocks the current recovery mode.
2. RecoverAfterCleanShutdown (kAbsoluteConsistency)
This is ideal for unit test and cases where the store is shutdown cleanly. We tolerate no corruption or incomplete writes.
3. RecoverPointInTime (kPointInTimeRecovery)
This is ideal when using devices with controller cache or file systems which can loose data on restart. We recover upto the point were is no corruption or incomplete write.
4. RecoverAfterDisaster (kSkipAnyCorruptRecord)
This is ideal mode to recover data. We tolerate corruption and incomplete writes, and we hop over those sections that we cannot make sense of salvaging as many records as possible.
Test Plan:
(1) Run added unit test to cover all levels.
(2) Run make check.
Reviewers: leveldb, sdong, igor
Subscribers: yoshinorim, dhruba
Differential Revision: https://reviews.facebook.net/D38487
Summary: MyRocks need a mechanism to track read outliers. We need to expose this
stat.
Test Plan: None
Reviewers: sdong
CC: leveldb
Task ID: #7152512
Blame Rev:
Summary:
Currently RocksDB silently ignores this issue and doesn't compress the data. Based on discussion, we agree that this is pretty bad because it can cause confusion for our users.
This patch fails DB::Open() if we don't support the compression that is specified in the options.
Test Plan: make check with LZ4 not present. If Snappy is not present all tests will just fail because Snappy is our default library. We should make Snappy the requirement, since without it our default DB::Open() fails.
Reviewers: sdong, MarkCallaghan, rven, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39687
Summary:
Add the funcion Cache.GetPinnedUsage() to return the memory size of entries
that are in use by the system (that is, all the entries not in the LRU list).
Test Plan:
Run ./cache_test and examine PinnedUsageTest.
Reviewers: tnovak, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D40305
Summary:
This is https://reviews.facebook.net/D39999 but after introducing an option to force compaction the bottom most level
Changes in this patch
- Introduce force_bottommost_level_compaction to CompactRangeOptions that force compacting bottommost level during compaction
- Skip bottommost level compaction if we dont have a compaction filter and force_bottommost_level_compaction options is not set
Although tests pass on my machine but I suspect that there maybe some tests that I am not aware of that should use force_bottommost_level_compaction to pass in a deterministic way
Test Plan:
make check
adding new tests
Reviewers: igor, sdong, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D40059
Summary: Currently we dump DBOptions for each column family options we dump. This leads to duplicate lines in our LOG file. This diff fixes that.
Test Plan: Check out the LOG
Reviewers: sdong, rven, yhchiang
Reviewed By: yhchiang
Subscribers: IslamAbdelRahman, yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39729
Summary:
This diff update DB::CompactRange to use RangeCompactionOptions instead of using multiple parameters
Old CompactRange is still available but deprecated
Test Plan:
make all check
make rocksdbjava
USE_CLANG=1 make all
OPT=-DROCKSDB_LITE make release
Reviewers: sdong, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D40209
Summary:
Add an option in GetApproximateSize() so that the result will include estimated sizes in mem tables.
To implement it, implement an estimated count from the beginning to a key in skip list. The approach is to count to find the entry, how many Next() is issued from each level, and sum them with a weight that is <branching factor> ^ <level>.
Test Plan: Add a test case
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D40119
Summary:
We slow down data into the database to the rate of options.delayed_write_rate (a new option) with this patch.
The thread synchronization approach I take is to still synchronize write controller by DB mutex and GetDelay() is inside DB mutex. Try to minimize the frequency of getting time in GetDelay(). I verified it through db_bench and it seems to work
hard_rate_limit is deprecated.
options.delayed_write_rate is still not dynamically changeable. Need to work on it as a follow-up.
Test Plan: Add new unit tests in db_test
Reviewers: yhchiang, rven, kradhakrishnan, anthony, MarkCallaghan, igor
Reviewed By: igor
Subscribers: ikabiljo, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D36351
Summary:
Adding largest sequence number to FlushJobInfo
and passing flushed file metadata to NotifyOnFlushCompleted which include alot of other values that we may want to expose in FlushJobInfo
Test Plan: make check
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D39927
Summary:
Add Env::GetThreadID(), which returns the ID of the current thread.
In addition, make GetThreadList() and InfoLog use same unique ID for the same thread.
Test Plan:
db_test
listener_test
Reviewers: igor, rven, IslamAbdelRahman, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D39735
Summary:
[This is the resubmit of D39813. Tests were failing, so I reverted the diff. I found the bug and I'm now resubmitting]
If we don't do this, any calls to Entry() after WBWI mutation will result in undefined behavior. We need to re-fetch the offset from the skip list and regenerate the new pointer (because string's base pointer can change while mutating).
Test Plan: COMPILE_WITH_ASAN=1 make write_batch_with_index_test && ./write_batch_with_index_test
Reviewers: sdong
Reviewed By: sdong
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D39897
Summary: If we don't do this, any calls to Entry() after WBWI mutation will result in undefined behavior. We need to re-fetch the offset from the skip list and regenerate the new pointer (because string's base pointer can change while mutating).
Test Plan: COMPILE_WITH_ASAN=1 make write_batch_with_index_test && ./write_batch_with_index_test
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39813
Summary:
Replacing the default value for compaction_filter_factory and compaction_filter_factory_v2 to be nullptr instead of DefaultCompactionFilterFactory / DefaultCompactionFilterFactoryV2
The reason for this is to be able to determine easily if we have compaction filter factory or not without depending on RTTI
Test Plan: make check
Reviewers: yoshinorim, ott, igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D39693
Summary:
It is experimental. Allow users to return from a call back function TablePropertiesCollector::NeedCompact(), based on the data in the file.
It can be used to allow users to suggest DB to clear up delete tombstones faster.
Test Plan: Add a unit test.
Reviewers: igor, yhchiang, kradhakrishnan, rven
Reviewed By: rven
Subscribers: yoshinorim, MarkCallaghan, maykov, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D39585
Summary:
EventListener::OnFlushCompleted() now passes a structure instead
of a list of parameters. This minimizes the API change in the
future.
Test Plan:
listener_test
compact_files_test
example/compact_files_example
Reviewers: kradhakrishnan, sdong, IslamAbdelRahman, rven, igor
Reviewed By: rven, igor
Subscribers: IslamAbdelRahman, rven, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39543
Summary:
Keys in RocksDB can be arbitrary byte strings. However, in the current
CompactionJobStats, smallest_output_key_prefix and largest_output_key_prefix
are of type char[] without having a length, which is insufficient to handle
non-null terminated strings.
This patch change their type to std::string.
Test Plan: compaction_job_stats_test
Reviewers: igor, rven, IslamAbdelRahman, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39537
Summary:
Add EventListener::OnTableFileDeletion(), which will be
called when a table file is deleted.
Test Plan: Extend three existing tests in db_test to verify the deleted files.
Reviewers: rven, anthony, kradhakrishnan, igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38931
Summary:
Allow EventListener::OnCompactionCompleted to return CompactionJobStats,
which contains useful information about a compaction.
Example CompactionJobStats returned by OnCompactionCompleted():
smallest_output_key_prefix 05000000
largest_output_key_prefix 06990000
elapsed_time 42419
num_input_records 300
num_input_files 3
num_input_files_at_output_level 2
num_output_records 200
num_output_files 1
actual_bytes_input 167200
actual_bytes_output 110688
total_input_raw_key_bytes 5400
total_input_raw_value_bytes 300000
num_records_replaced 100
is_manual_compaction 1
Test Plan: Developed a mega test in db_test which covers 20 variables in CompactionJobStats.
Reviewers: rven, igor, anthony, sdong
Reviewed By: sdong
Subscribers: tnovak, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38463
Summary: Fixed ROCKSDB_LITE compile error due to the missing of TableFileCreationInfo
Test Plan: make OPT=-DROCKSDB_LITE shared_lib -j32
Reviewers: sdong, igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39405
Summary:
Add EventListener::OnTableFileCreated(), which will be called
when a table file is created. This patch is part of the
EventLogger and EventListener integration.
Test Plan: Augment existing test in db/listener_test.cc
Reviewers: anthony, kradhakrishnan, rven, igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38865
Summary: In previous change https://reviews.facebook.net/D39099 , while renaming parameters, use a wrong parameter, causing CompactRange() to compact not wrong level.
Test Plan: Run "DBTest.MigrateToDynamicLevelMaxBytesBase" which failed with the patch.
Reviewers: rven, yhchiang, kradhakrishnan, igor, anthony
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D39393
Summary:
We occasionally get write stalls (>1s Write() calls) on HDD under read load. The following timers explain almost all of the stalls:
- perf_context.db_mutex_lock_nanos
- perf_context.db_condition_wait_nanos
- iostats_context.open_time
- iostats_context.allocate_time
- iostats_context.write_time
- iostats_context.range_sync_time
- iostats_context.logger_time
In my experiments each of these occasionally takes >1s on write path under some workload. There are rare cases when Write() takes long but none of these takes long.
Test Plan: Added code to our application to write the listed timings to log for slow writes. They usually add up to almost exactly the time Write() call took.
Reviewers: rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: march, dhruba, tnovak
Differential Revision: https://reviews.facebook.net/D39177
Summary: In DB::CompactRange(), change parameter "reduce_level" to "change_level". Users can compact all data to the last level if needed. By doing it, users can migrate the DB to options.level_compaction_dynamic_level_bytes=true.
Test Plan: Add a unit test for it.
Reviewers: yhchiang, anthony, kradhakrishnan, igor, rven
Reviewed By: rven
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D39099
Summary: It used to be no good (known to me) non-intrusive way to wrap WritableFile - you can't call protected virtual methods of the wrapped pointer to WritableFile. This diff adds a convenience class WritableFileWrapper that makes wrapping WritableFile both possible and easy.
Test Plan: `make clean; make -j release`, `make clean; OPT=-DROCKSDB_LITE make release`, `make clean; USE_CLANG=1 make -j all`.
Reviewers: sdong, yhchiang, rven
Reviewed By: rven
Subscribers: dhruba, tnovak, march
Differential Revision: https://reviews.facebook.net/D39147
Summary: Optimistic transactions supporting begin/commit/rollback semantics. Currently relies on checking the memtable to determine if there are any collisions at commit time. Not yet implemented would be a way of enuring the memtable has some minimum amount of history so that we won't fail to commit when the memtable is empty. You should probably start with transaction.h to get an overview of what is currently supported.
Test Plan: Added a new test, but still need to look into stress testing.
Reviewers: yhchiang, igor, rven, sdong
Reviewed By: sdong
Subscribers: adamretter, MarkCallaghan, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D33435
Summary: Fixed a compile error in ROCKSDB_LITE
Test Plan: make db_stress OPT=-DROCKSDB_LITE -j32
Reviewers: sdong, igor, anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39201
Summary:
For transactions, we are using the memtables to validate that there are no write conflicts. But after flushing, we don't have any memtables, and transactions could fail to commit. So we want to someone keep around some extra history to use for conflict checking. In addition, we want to provide a way to increase the size of this history if too many transactions fail to commit.
After chatting with people, it seems like everyone prefers just using Memtables to store this history (instead of a separate history structure). It seems like the best place for this is abstracted inside the memtable_list. I decide to create a separate list in MemtableListVersion as using the same list complicated the flush/installalflushresults logic too much.
This diff adds a new parameter to control how much memtable history to keep around after flushing. However, it sounds like people aren't too fond of adding new parameters. So I am making the default size of flushed+not-flushed memtables be set to max_write_buffers. This should not change the maximum amount of memory used, but make it more likely we're using closer the the limit. (We are now postponing deleting flushed memtables until the max_write_buffer limit is reached). So while we might use more memory on average, we are still obeying the limit set (and you could argue it's better to go ahead and use up memory now instead of waiting for a write stall to happen to test this limit).
However, if people are opposed to this default behavior, we can easily set it to 0 and require this parameter be set in order to use transactions.
Test Plan: Added a xfunc test to play around with setting different values of this parameter in all tests. Added testing in memtablelist_test and planning on adding more testing here.
Reviewers: sdong, rven, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D37443
Summary: Having stats in our LOG more often will help a lot with perf debugging.
Test Plan: none
Reviewers: sdong, MarkCallaghan
Reviewed By: MarkCallaghan
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38781
Summary:
sync_file_range is not always asyncronous and thus can block writes if we do this for WAL in the foreground thread. See more here: http://yoshinorimatsunobu.blogspot.com/2014/03/how-syncfilerange-really-works.html
Some users don't want us to call sync_file_range on WALs. Some other do.
Thus, I'm adding a separate option wal_bytes_per_sync to control calling
sync_file_range on WAL files. bytes_per_sync will apply only to table
files now.
Test Plan: no more sync_file_range for WAL as evidenced by strace
Reviewers: yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38253
Summary: CompactRange() now is much more expensive for dynamic level base size as it goes through all the levels. Skip those not used levels between level 0 an base level.
Test Plan: Run all unit tests
Reviewers: yhchiang, rven, anthony, kradhakrishnan, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D37125
Summary:
Allow GetThreadList to report Flush properties, which includes:
* job id
* number of bytes that has been written since flush started.
* total size of input mem-tables
Test Plan:
./db_bench --threads=30 --num=1000000 --benchmarks=fillrandom --thread_status_per_interval=100 --value_size=1000
Sample output from db_bench which tracks same flush job
ThreadID ThreadType cfName Operation ElapsedTime Stage State OperationProperties
140213879898240 High Pri default Flush 5789 us FlushJob::WriteLevel0Table BytesMemtables 4112835 | BytesWritten 577104 | JobID 8 |
ThreadID ThreadType cfName Operation ElapsedTime Stage State OperationProperties
140213879898240 High Pri default Flush 30.634 ms FlushJob::WriteLevel0Table BytesMemtables 4112835 | BytesWritten 1734865 | JobID 8 |
Reviewers: rven, igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38505
Summary: Use a better way to initialize ThreadStatus::kNumOperationProperties.
Test Plan: make
Reviewers: sdong, rven, anthony, krishnanm86, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38547
Summary: Universal compactions with multiple levels should use file preallocation size based on file size if output level is not level 0
Test Plan: Run all tests.
Reviewers: igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D38439
Summary: Make ThreadStatus::InterpretOperationProperties take const uint64_t*
Test Plan:
make
make OPT=-DROCKSDB_LITE shared_lib
Reviewers: igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38445
Summary:
Fixed a bug in EventListener::OnCompactionCompleted() that returns
incorrect list of input / output file names.
Test Plan: Extend existing test in listener_test.cc
Reviewers: sdong, rven, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38349
Summary:
Added a couple functions to WriteBatchWithIndex to make it easier to query the value of a key including reading pending writes from a batch. (This is needed for transactions).
I created write_batch_with_index_internal.h to use to store an internal-only helper function since there wasn't a good place in the existing class hierarchy to store this function (and it didn't seem right to stick this function inside WriteBatchInternal::Rep).
Since I needed to access the WriteBatchEntryComparator, I moved some helper classes from write_batch_with_index.cc into write_batch_with_index_internal.h/.cc. WriteBatchIndexEntry, ReadableWriteBatch, and WriteBatchEntryComparator are all unchanged (just moved to a different file(s)).
Test Plan: Added new unit tests.
Reviewers: rven, yhchiang, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38037
Summary:
When new capacity is larger than existing capacity, simply update the capacity to the new valie
When new capacity is less than existing capacity, but more than the usage, simply update the capacity to new value
When new capacity is less than the existing capacity and existing usage both, try to purge entries in LRU if feasible to make usage < capacity
Test Plan: Created unit tests in cache_test.cc
Reviewers: sdong, rven, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D37527
Summary:
Make it build for CYGWIN.
Need to define "-std=gnu++11" instead of "-std=c++11" and use some replacement functions.
Test Plan: Build it and run some unit tests in CYGWIN
Reviewers: yhchiang, rven, anthony, kradhakrishnan, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D37605
Summary:
This diff implements a new `DB` method `PromoteL0` which moves all files in L0
to a given level skipping compaction, provided that the files have disjoint
ranges and all levels up to the target level are empty.
This method provides finer-grain control for trivial compactions, and it is
useful for bulk-loading pre-sorted keys. Compared to D34797, it does not change
the semantics of an existing operation, which can impact existing code.
PromoteL0 is designed to work well in combination with the proposed
`GetSstFileWriter`/`AddFile` interface, enabling to "design" the level structure
by populating one level at a time. Such fine-grained control can be very useful
for static or mostly-static databases.
Test Plan: `make check`
Reviewers: IslamAbdelRahman, philipp, MarkCallaghan, yhchiang, igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D37107
Summary: To further distinguish the corruption cases were caused by storage media or in memory states when writing it, add a paranoid check after writing the file to iterate all the rows.
Test Plan: Add a new unit test for it
Reviewers: rven, igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D37335
Summary:
Some Mongo+Rocks datasets in Parse's environment are not doing compactions very frequently. During the quiet period (with no IO), we'd like to schedule compactions so that our reads become faster. Also, aggressively compacting during quiet periods helps when write bursts happen. In addition, we also want to compact files that are containing deleted key ranges (like old oplog keys).
All of this is currently not possible with CompactRange() because it's single-threaded and blocks all other compactions from happening. Running CompactRange() risks an issue of blocking writes because we generate too much Level 0 files before the compaction is over. Stopping writes is very dangerous because they hold transaction locks. We tried running manual compaction once on Mongo+Rocks and everything fell apart.
MarkForCompaction() solves all of those problems. This is very light-weight manual compaction. It is lower priority than automatic compactions, which means it shouldn't interfere with background process keeping the LSM tree clean. However, if no automatic compactions need to be run (or we have extra background threads available), we will start compacting files that are marked for compaction.
Test Plan: added a new unit test
Reviewers: yhchiang, rven, MarkCallaghan, sdong
Reviewed By: sdong
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D37083
Summary: Other than making some class members private, this is a documentation-only change
Test Plan: unit tests
Reviewers: sdong, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D36567
Summary:
There are some cases when flachcache file descriptor was
already allocated (i.e. fb-MySQL). Then NewFlashcacheAwareEnv returns an
error at open() because fd was already assigned. This diff adds another
function to instantiate FlashcacheAwareEnv, with pre-allocated fd cachedev_fd.
Test Plan: Tested with MyRocks using this function, then worked
Reviewers: sdong, igor
Reviewed By: igor
Subscribers: dhruba, MarkCallaghan, rven
Differential Revision: https://reviews.facebook.net/D36447
Summary: Now we add warnings when user configures compression and the compression is not supported.
Test Plan:
Configured compression to non-supported values. Observed messages in my log:
2015/03/26-12:17:57.586341 7ffb8a496840 [WARN] Compression type chosen for level 2 is not supported: LZ4. RocksDB will not compress data on level 2.
2015/03/26-12:19:10.768045 7f36f15c5840 [WARN] Compression type chosen is not supported: LZ4. RocksDB will not compress data.
Reviewers: rven, sdong, yhchiang
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D35979
Summary:
Currently users have no idea a key is add, delete or merge from TablePropertiesCollector call back. Add a new function to add it.
Also refactor the codes so that
(1) make table property collector and internal table property collector two separate data structures with the later one now exposed
(2) table builders only receive internal table properties
Test Plan: Add cases in table_properties_collector_test to cover both of old and new ways of using TablePropertiesCollector.
Reviewers: yhchiang, igor.sugak, rven, igor
Reviewed By: rven, igor
Subscribers: meyering, yoshinorim, maykov, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D35373
Summary: Most of the approach is copied from WebSQL's MySQL branch. It's nice that we can do this without touching core RocksDB code.
Test Plan: Compiles and runs. Didn't test flashback code, as I don't have flashback device and most if it is c/p
Reviewers: MarkCallaghan, sdong
Reviewed By: sdong
Subscribers: rven, lgalanis, kradhakrishnan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D35391
Summary:
Assign the string properties to const string variables under the
DB::Properties namespace. This helps catch typos during compilation and
also consolidates the property definition in one place.
Test Plan: Run rocksdb unit tests
Reviewers: sdong, yoshinorim, igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D35991
Summary:
We have addded new stats and perf_context for measuring the merge and filter operation time consumption.
We have bounded all the merge operations within the GUARD statment and collected the total time for these operations in the DB.
Test Plan: WIP
Reviewers: rven, yhchiang, kradhakrishnan, igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D34377
Summary:
Report elapsed time of a thread operation in micros in ThreadStatus
instead of start time of a thread operation in seconds since the
Epoch, 1970-01-01 00:00:00 (UTC).
Test Plan:
./db_bench --benchmarks=fillrandom --num=100000 --threads=40 \
--max_background_compactions=10 --max_background_flushes=3 \
--thread_status_per_interval=1000 --key_size=16 --value_size=1000 \
--num_column_families=10
Sample Output:
ThreadID ThreadType cfName Operation ElapsedTime Stage State
140667724562496 High Pri column_family_name_000002 Flush 772.419 ms FlushJob::WriteLevel0Table
140667728756800 High Pri default Flush 617.845 ms FlushJob::WriteLevel0Table
140667732951104 High Pri column_family_name_000005 Flush 772.078 ms FlushJob::WriteLevel0Table
140667875557440 Low Pri column_family_name_000008 Compaction 1409.216 ms CompactionJob::Install
140667737145408 Low Pri
140667749728320 Low Pri
140667816837184 Low Pri column_family_name_000007 Compaction 1071.815 ms CompactionJob::ProcessKeyValueCompaction
140667787477056 Low Pri column_family_name_000009 Compaction 772.516 ms CompactionJob::ProcessKeyValueCompaction
140667741339712 Low Pri
140667758116928 Low Pri column_family_name_000004 Compaction 620.739 ms CompactionJob::ProcessKeyValueCompaction
140667753922624 Low Pri
140667842003008 Low Pri column_family_name_000006 Compaction 1260.079 ms CompactionJob::ProcessKeyValueCompaction
140667745534016 Low Pri
Reviewers: sdong, igor, rven
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D35769
Summary: There is no alternative to GetLiveFiles() function
Test Plan: none
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D35805
Summary: Add a DB property for number of deletions in memtables. It can sometimes help people debug slowness because of too many deletes.
Test Plan: Add test cases.
Reviewers: rven, yhchiang, kradhakrishnan, igor
Reviewed By: igor
Subscribers: leveldb, dhruba, yoshinorim
Differential Revision: https://reviews.facebook.net/D35247
Summary: This feature is going to be useful for mongodb+rocksdb. I'll expose it through mongo's API.
Test Plan: added new unit test. also will run TSAN on the new unit test
Reviewers: meyering, sdong
Reviewed By: meyering, sdong
Subscribers: meyering, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D35307
Summary: WriteBatch and WriteBatchWithIndex now both inherit from a common abstract base class. This makes it easier to write code that is agnostic toward the implementation of the particular write batch. In particular, I plan on utilizing this abstraction to allow transactions to support using either implementation of a write batch.
Test Plan: modified existing WriteBatchWithIndex tests to test new functions. Running all tests.
Reviewers: igor, rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D34017
Summary: It is no longer used by the implementation, so we should also remove it from the public API.
Test Plan: make check
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D34971
Summary: Mark function as override.
Test Plan: USE_CLANG=1 make -j32 check
Reviewers: sdong, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D35163
Summary:
On RocksDB, when there are multiple instances doing
flushes/compactions in the background, the close call takes a long time
because the flushes/compactions need to complete before the database can
shut down. If another instance is using the background threads and the compaction for this instance is in the queue since it has been scheduled, we still cannot shutdown. We now remove the scheduled background tasks which have not yet started running, so that shutdown is speeded up.
Test Plan: DB Test added.
Reviewers: yhchiang, igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D33741
Summary:
Change the way options.compression_per_level is used when options.level_compaction_dynamic_level_bytes=true so that options.compression_per_level[1] determines compression for the level L0 is merged to, options.compression_per_level[2] to the level after that, etc.
Test Plan: run all tests
Reviewers: rven, yhchiang, kradhakrishnan, igor
Reviewed By: igor
Subscribers: yoshinorim, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D34431
Summary:
Provide an API which enables users to infor Rocksdb that it is
shutting down.
Test Plan: db_test
Reviewers: sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D34617
Summary: Allow GetThreadList() to report the start time of the current operation.
Test Plan:
./db_bench --benchmarks=fillrandom --num=100000 --threads=40 \
--max_background_compactions=10 --max_background_flushes=3 \
--thread_status_per_interval=1000 --key_size=16 --value_size=1000 \
--num_column_families=10
Sample output:
ThreadID ThreadType cfName Operation OP_StartTime State
140338840797248 High Pri column_family_name_000003 Flush 2015/03/09-17:49:59
140338844991552 High Pri column_family_name_000004 Flush 2015/03/09-17:49:59
140338849185856 Low Pri
140338983403584 Low Pri
140339008569408 Low Pri
140338861768768 Low Pri
140338924683328 Low Pri
140338899517504 Low Pri
140338853380160 Low Pri
140338882740288 Low Pri
140338865963072 High Pri column_family_name_000006 Flush 2015/03/09-17:49:59
140338954043456 Low Pri
140338857574464 Low Pri
Reviewers: igor, rven, sdong
Reviewed By: sdong
Subscribers: lgalanis, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D34689
Summary:
Add --thread_status_per_interval to db_bench, which allows
db_bench to optionally enable print the current thread status
periodically.
Test Plan:
./db_bench --benchmarks=fillrandom --num=100000 --threads=40 --max_background_compactions=10 --max_background_flushes=3 --thread_status_per_interval=1000 --key_size=16 --value_size=1000 --num_column_families=10
Sample output:
ThreadID ThreadType dbName cfName Operation State
140281571770432 Low Pri
140281575964736 High Pri /tmp/rocksdbtest-5297/dbbench column_family_name_000001 Flush
140281710182464 Low Pri /tmp/rocksdbtest-5297/dbbench column_family_name_000008 Compaction
140281638879296 Low Pri /tmp/rocksdbtest-5297/dbbench column_family_name_000007 Compaction
140281592741952 Low Pri
140281580159040 High Pri /tmp/rocksdbtest-5297/dbbench column_family_name_000002 Flush
140281676628032 Low Pri /tmp/rocksdbtest-5297/dbbench column_family_name_000006 Compaction
140281584353344 Low Pri
140281622102080 Low Pri /tmp/rocksdbtest-5297/dbbench column_family_name_000009 Compaction
140281605324864 Low Pri /tmp/rocksdbtest-5297/dbbench column_family_name_000004 Compaction
140281601130560 High Pri /tmp/rocksdbtest-5297/dbbench default Flush
140281596936256 Low Pri
140281588547648 Low Pri
Reviewers: igor, rven, sdong
Reviewed By: rven
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D34515
Summary:
When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.
In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.
Test Plan: New unit tests and pass tests suites including valgrind.
Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo
Reviewed By: ikabiljo
Subscribers: yoshinorim, ikabiljo, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31437
Summary:
This diff fixes a bug introduced by D28521. Read-only backup engine can delete a backup that is later than the latest -- we never check the condition.
I also added a bunch of logging that will help with debugging cases like this in the future.
See more discussion at t6218248.
Test Plan: Added a unit test that was failing before the change. Also, see new LOG file contents: https://phabricator.fb.com/P19738984
Reviewers: benj, sanketh, sumeet, yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D33897
Summary:
Summary:
Added a new option to ColumnFamllyOptions - optimize_filters_for_hits. This option can be used in the case where most
accesses to the store are key hits and we dont need to optimize performance for key misses.
This is useful when you have a very large database and most of your lookups succeed. The option allows the store to
not store and use filters in the last level (the largest level which contains data). These filters can take a large amount of
space for large databases (in memory and on-disk). For the last level, these filters are only useful for key misses and not
for key hits. If we are not optimizing for key misses, we can choose to not store these filters for that level.
This option is only provided for BlockBasedTable. We skip the filters when we are compacting
Test Plan:
1. Modified db_test toalso run tests with an additonal option (skip_filters_on_last_level)
2. Added another unit test to db_test which specifically tests that filters are being skipped
Reviewers: rven, igor, sdong
Reviewed By: sdong
Subscribers: lgalanis, yoshinorim, MarkCallaghan, rven, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D33717
Summary: mac compile is fixed in fbson, so it can be returned back from 7ce1b2c
Test Plan:
make all check
make valgrind_check
Reviewers: golovachalexander, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D33855
Summary:
When using latest clang (3.6 or 3.7/trunck) rocksdb is failing with many errors. Almost all of them are missing override errors. This diff adds missing override keyword. No manual changes.
Prerequisites: bear and clang 3.5 build with extra tools
```lang=bash
% USE_CLANG=1 bear make all # generate a compilation database http://clang.llvm.org/docs/JSONCompilationDatabase.html
% clang-modernize -p . -include . -add-override
% make format
```
Test Plan:
Make sure all tests are passing.
```lang=bash
% #Use default fb code clang.
% make check
```
Verify less error and no missing override errors.
```lang=bash
% # Have trunk clang present in path.
% ROCKSDB_NO_FBCODE=1 CC=clang CXX=clang++ make
```
Reviewers: igor, kradhakrishnan, rven, meyering, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D34077
Summary:
The "const" attribute does not make sense on a return type,
and provokes a warning/error from gcc -W -Wextra.
Test Plan:
Run "make EXTRA_CXXFLAGS='-W -Wextra'" and see fewer errors.
Reviewers: ljin, sdong, igor.sugak, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D33759
Summary:
In mongo, we currently have a single column family and I'd like to support setting rocksdb::Options from string. This diff provides an option to GetOptionsFromString()
There's one more problem. Currently GetColumnFamilyOptionsFromString() overwrites block_based options. In mongo I set default values for block_cache and some other values of BlockBasedTableOptions and I don't want them reset to default with GetOptionsFromString().
Test Plan: added unit test
Reviewers: yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D33729
Summary: Add a DB property about live versions. It can be helpful to figure out whether there are files not live but not yet deleted, in some use cases.
Test Plan: make all check
Reviewers: rven, yhchiang, igor
Reviewed By: igor
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D33327