Summary:
Separate a new class InternalIterator from class Iterator, when the look-up is done internally, which also means they operate on key with sequence ID and type.
This change will enable potential future optimizations but for now InternalIterator's functions are still the same as Iterator's.
At the same time, separate the cleanup function to a separate class and let both of InternalIterator and Iterator inherit from it.
Test Plan: Run all existing tests.
Reviewers: igor, yhchiang, anthony, kradhakrishnan, IslamAbdelRahman, rven
Reviewed By: rven
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D48549
Summary:
In the current implementation, perf_context.db_mutex_lock_nanos and
perf_context.db_condition_wait_nanos also include the mutex-wait time
other than DB Mutex.
This patch fix this issue by incrementing the counters only when it detects
a DB mutex.
Test Plan: perf_context_test
Reviewers: anthony, IslamAbdelRahman, sdong, igor
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D48555
Summary:
Fix tests that compile under ROCKSDB_LITE but currently failing.
table_test:
RandomizedLongDB test is using internal stats which is not supported in ROCKSDB_LITE
compaction_job_test:
Using CompactionJobStats which is not supported
perf_context_test:
KeyComparisonCount test try to open DB in ReadOnly mode which is not supported
Test Plan: run the tests under ROCKSDB_LITE
Reviewers: yhchiang, sdong, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D48585
Summary:
Remove db_impl_debug from NDEBUG, but allow it in ROCKSDB_LITE
These functions by definition should not be included in NDEBUG and they are only used for testing
This is based on offline discussion with @yhchiang and @igor
Test Plan:
make static_lib
make check
Reviewers: igor, sdong, yhchiang
Reviewed By: yhchiang
Subscribers: igor, yhchiang, dhruba
Differential Revision: https://reviews.facebook.net/D48573
Summary:
Long time ago we add InternalDumpCommand to ldb_tool https://reviews.facebook.net/D11517
This command is using TEST_NewInternalIterator although it's not a test. This patch move TEST_NewInternalIterator outside of db_impl_debug.cc
Test Plan:
make check
make static_lib
Reviewers: yhchiang, igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D48561
Summary:
As part of tech debt week, we are cleaning up dependencies.
This diff moves db_test_util.[h,cc] from util to db directory.
Test Plan: make check
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D48543
Summary: My previous commit ('Passing table properties to compaction callback') broke the clang build. Here is the fix.
Test Plan: USE_CLANG=1 make all -j
Reviewers: igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D48489
Summary: It would be nice to have and access to table properties in compaction callbacks. In MyRocks project, it will make possible to update optimizer statistics online.
Test Plan: ran the unit test. Ran myrocks with the new way of collecting stats.
Reviewers: igor, rven, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D48267
Summary: Pass column family ID through TablePropertiesCollectorFactory::CreateTablePropertiesCollector() so that users can identify which column family this file is for and handle it differently.
Test Plan: Add unit test scenarios in tests related to table properties collectors to verify the information passed in is correct.
Reviewers: rven, yhchiang, anthony, kradhakrishnan, igor, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: yoshinorim, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D48411
Summary: Now based on environment, DBTest.AggregatedTableProperties has a possibility of issuing a L0->L1 compaction after reopening and the results are not what we expected. We tune the L0 compaction trigger to make it less likely to happen.
Test Plan: I can't repro the failure but I think the change is better. Just run the test and make sure it passes.
Reviewers: kradhakrishnan, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D48423
Summary: With this commit, we add a new format in manifest when adding a new file. Now path ID and need-compaction hint are first two customized fields.
Test Plan: Add a test case in version_edit_test to verify the encoding and decoding logic. Add a unit test in db_test to verify need compaction is persistent after DB restarting.
Reviewers: kradhakrishnan, anthony, IslamAbdelRahman, yhchiang, rven, igor
Reviewed By: igor
Subscribers: javigon, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D48123
Summary: Add the column family ID to compaction filter context, so it is easier for compaction filter to apply different logic for different column families.
Test Plan: Add a unit test to verify the column family ID passed is correct.
Reviewers: rven, yhchiang, igor
Reviewed By: igor
Subscribers: yoshinorim, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D48357
Summary:
Two changes:
1. remove *V2 filter stuff. we deprecated that a while ago
2. clarify what happens when user sets max_subcompactions to bigger than 1
Test Plan: none
Reviewers: yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47871
Summary:
hit and miss bloom filter stats for memtable and SST
stats added to perf_context struct
key matches and prefix matches combined into one stat
Test Plan: unit test veryfing the functionality added, see BloomStatsTest in db_test.cc for details
Reviewers: yhchiang, igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47859
Summary:
Travis is complaining against using {} to initialize KVMap: https://travis-ci.org/facebook/rocksdb/jobs/84132600
db/compaction_job_test.cc:526:26: error: chosen constructor is explicit in copy-initialization
RunCompaction({files}, {});
This diff should fix it
Test Plan: travis
Reviewers: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D48309
Summary:
Since Andres' internship is over, I took over https://reviews.facebook.net/D42555 and rebased and simplified it a bit.
The behavior in this diff is a bit simpler than in D42555:
* only merge operators are passed through FilterMergeValue(). If fitler function returns true, the merge operator is ignored
* compaction filter is *not* called on: 1) results of merge operations and 2) base values that are getting merged with merge operands (the second case was also true in previous diff)
Do we also need a compaction filter to get called on merge results?
Test Plan: make && make check
Reviewers: lovro, tnovak, rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: noetzli, kolmike, leveldb, dhruba, sdong
Differential Revision: https://reviews.facebook.net/D47847
Summary:
Handle SST files with both ".sst" and ".ldb" suffix.
This enables user to migrate from leveldb to rocksdb.
Test Plan:
Added unit test with DB operating on SSTs with names schema.
See db/dc_test.cc:SSTsWithLdbSuffixHandling for details
Reviewers: yhchiang, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D48003
Summary:
The function GetBoundaryKeys() returns the smallest key from the first file and largest key from the last file. This is good for any level >0, but it's not correct for level 0. In level 0, files can overlap, so we need to check all files for boundary keys. This bug can cause wrong value for bottommost_level in compaction (value of true, although correct is false), which means we can set sequence numbers to 0 even if the key is not the oldest one in the database.
Herman reported corruption while testing MyRocks. Fortunately, the patch that added the bug was not released yet.
Test Plan: added a new test to compaction_picker_test.
Reviewers: hermanlee4, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D48201
Summary:
Latest travis failed because of corruption test TableFileIndexData: https://travis-ci.org/facebook/rocksdb/jobs/83732558
This diff makes the test more explicit:
1. create two files
2. corrupt the second's file index
3. expect to get only 5000 keys when range scanning
Test Plan: the test is still passing :)
Reviewers: sdong, rven, yhchiang, kradhakrishnan, IslamAbdelRahman, anthony
Reviewed By: anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D48183
Summary:
To support a new MongoDB capability, we need to make sure that we don't do any IO for a short period of time. For background, see:
* https://jira.mongodb.org/browse/SERVER-20704
* https://jira.mongodb.org/browse/SERVER-18899
To implement that, I add a new API calls PauseBackgroundWork() and ContinueBackgroundWork() which reuse the capability we already have in place for RefitLevel() function.
Test Plan: Added a new test in db_test. Made sure that test fails when PauseBackgroundWork() is commented out.
Reviewers: IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47901
Summary: Add an option to db_bench for max_file_opening_threads
Test Plan: compile and run db_bench
Reviewers: sdong, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba, paultuckfield
Differential Revision: https://reviews.facebook.net/D47811
Summary:
CompactionJobStatsTest.UniversalCompactionTest assumes compaction
kicks in when the number of L0 files equals to the compaction trigger.
However, in some case, the compaction might not catch up the write
speed and thus compaction might not kick in until the number of L0 files
is GREATER than the compaction trigger.
This patch tries to fix this corner case by making the Put thread wait
for a potential compaction whenever it flushes.
Test Plan: ./compaction_job_stats_test
Reviewers: sdong, anthony, IslamAbdelRahman, igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D47589
Summary: Fixed a bug which causes rocksdb.flush.write.bytes stat is always zero
Test Plan: augment existing db_test
Reviewers: sdong, anthony, IslamAbdelRahman, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47595
Summary: As title
Test Plan: make check
Reviewers: yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46983
Summary: Test cases for IsBottommostLevel function create FileMetaData objects which were not getting deleted in the destructor.
Test Plan: Valgrind check on compaction_picker_test
Reviewers: yhchiang, igor, sdong
Subscribers: rven, kradhakrishnan, IslamAbdelRahman, dhruba, anthony
Differential Revision: https://reviews.facebook.net/D47463
Summary:
This is an initial version of bulk load feature
This diff allow us to create sst files, and then bulk load them later, right now the restrictions for loading an sst file are
(1) Memtables are empty
(2) Added sst files have sequence number = 0, and existing values in database have sequence number = 0
(3) Added sst files values are not overlapping
Test Plan: unit testing
Reviewers: igor, ott, sdong
Reviewed By: sdong
Subscribers: leveldb, ott, dhruba
Differential Revision: https://reviews.facebook.net/D39081
Summary:
PlainTableReader now only allows mmap-mode. Add the support to non-mmap mode for more flexibility.
Refactor the codes to move all logic of reading data to PlainTableKeyDecoder, and consolidate the calls to Read() call and ReadVarint32() call. Implement the calls for both of mmap and non-mmap case seperately. For non-mmap mode, make copy of keys in several places when we need to move the buffer after reading the keys.
Test Plan: Add the mode of non-mmap case in plain_table_db_test. Run it in valgrind mode too.
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D47187
Summary: https://reviews.facebook.net/D23343 changed WAL sync bytes to extra fsync. This change does the same for internal stats.
Test Plan: Run all existing unit tests and verify results in db_bench.
Reviewers: anthony, rven, igor, MarkCallaghan, kradhakrishnan, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D47349
Summary:
Add options.compaction_pri, which specifies the policy about which file to compact first.
kCompactionPriByLargestSeq will compact oldest files first.
Verified the behavior in db_bench but did not write unit tests yet. Also need to make it settable through option string and dynamically changeable.
Test Plan: Will write unit tests
Reviewers: igor, rven, anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, MarkCallaghan
Reviewed By: yhchiang
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D45951
Summary:
This patch fixes#7460559. It introduces SingleDelete as a new database
operation. This operation can be used to delete keys that were never
overwritten (no put following another put of the same key). If an overwritten
key is single deleted the behavior is undefined. Single deletion of a
non-existent key has no effect but multiple consecutive single deletions are
not allowed (see limitations).
In contrast to the conventional Delete() operation, the deletion entry is
removed along with the value when the two are lined up in a compaction. Note:
The semantics are similar to @igor's prototype that allowed to have this
behavior on the granularity of a column family (
https://reviews.facebook.net/D42093 ). This new patch, however, is more
aggressive when it comes to removing tombstones: It removes the SingleDelete
together with the value whenever there is no snapshot between them while the
older patch only did this when the sequence number of the deletion was older
than the earliest snapshot.
Most of the complex additions are in the Compaction Iterator, all other changes
should be relatively straightforward. The patch also includes basic support for
single deletions in db_stress and db_bench.
Limitations:
- Not compatible with cuckoo hash tables
- Single deletions cannot be used in combination with merges and normal
deletions on the same key (other keys are not affected by this)
- Consecutive single deletions are currently not allowed (and older version of
this patch supported this so it could be resurrected if needed)
Test Plan: make all check
Reviewers: yhchiang, sdong, rven, anthony, yoshinorim, igor
Reviewed By: igor
Subscribers: maykov, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43179
Summary:
Some users have observed errors in the log file when
the log file or sst file is already deleted.
Test Plan:
Make sure that the errors do not appear for already deleted
files.
Reviewers: sdong
Reviewed By: sdong
Subscribers: anthony, kradhakrishnan, yhchiang, rven, igor, IslamAbdelRahman, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47115
Summary: The diff modifies the condition checked to determine the bottommost level during compaction. Previously, absence of files in higher levels alone was used as the condition. Now, the function additionally evaluates if the higher levels have files which have non-overlapping key ranges, then the level can be safely considered as the bottommost level.
Test Plan: Unit test cases added and passing. However, unit tests of universal compaction are failing as a result of the changes made in this diff. Need to understand why that is happening.
Reviewers: igor
Subscribers: dhruba, sdong, lgalanis, meyering
Differential Revision: https://reviews.facebook.net/D46473
Summary: Now DB::Open() flushes info log before printing DB pointer, so it may not show up if no activity after DB open. Move log flushing from after printing options to printing DB pointer.
Test Plan: make commit-prereq
Reviewers: igor, IslamAbdelRahman, yhchiang, kradhakrishnan, anthony, rven
Reviewed By: rven
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D47121
Summary: Change the log level of DB start-up log from Warn to Header.
Test Plan: db_bench and observe the LOG header
Reviewers: igor, anthony, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47067
Summary: This will unblock the corresponding change in MyRocks
Test Plan: ran rocksdb.write_sync test
Reviewers: sdong, kolmike
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D46911
Summary: If we skip a test, we shouldn't mark `make check` as failure. This fixes travis CI test.
Test Plan: Travis CI
Reviewers: noetzli, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47031
Summary:
Although compaction time is recorded in the statistics,
it is helpful to include this value in the log output corresponding
to the end of compaction.
Test Plan: make all && make check
Reviewers: yhchiang, sdong, igor, noetzli, MarkCallaghan
Reviewed By: MarkCallaghan
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D47007
Summary: There was a merge issue with SleepingBackgroundTask
Test Plan: compiles now
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46977
Summary:
This patch finally fixes the ColumnFamilyTest.ReadDroppedColumnFamily test. The test has been failing very sporadically and it was hard to repro. However, I managed to write a new tests that reproes the failure deterministically.
Here's what happens:
1. We start the flush for the column family
2. We check if the column family was dropped here: a3fc49bfdd/db/flush_job.cc (L149)
3. This check goes through, ends up in InstallMemtableFlushResults() and it goes into LogAndApply()
4. At about this time, we start dropping the column family. Dropping the column family process gets to LogAndApply() at about the same time as LogAndApply() from flush process
5. Drop column family goes through LogAndApply() first, marking the column family as dropped.
6. Flush process gets woken up and gets a chance to write to the MANIFEST. However, this is where it gets stuck: a3fc49bfdd/db/version_set.cc (L1975)
7. We see that the column family was dropped, so there is no need to write to the MANIFEST. We return OK.
8. Flush gets OK back from LogAndApply() and it deletes the memtable, thinking that the data is now safely persisted to sst file.
The fix is pretty simple. Instead of OK, we return ShutdownInProgress. This is not really true, but we have been using this status code to also mean "this operation was canceled because the column family has been dropped".
The fix is only one LOC. All other code is related to tests. I added a new test that reproes the failure. I also moved SleepingBackgroundTask to util/testutil.h (because I needed it in column_family_test for my new test). There's plenty of other places where we reimplement SleepingBackgroundTask, but I'll address that in a separate commit.
Test Plan:
1. new test
2. make check
3. Make sure the ColumnFamilyTest.ReadDroppedColumnFamily doesn't fail on Travis: https://travis-ci.org/facebook/rocksdb/jobs/79952386
Reviewers: yhchiang, anthony, IslamAbdelRahman, kradhakrishnan, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46773
Summary:
There are some use cases in MyRocks to compare two slices
and to return the first byte where they differ. It may be
useful to add it as a RocksDB Slice function.
Test Plan: db_test
Reviewers: sdong, rven, igor
Reviewed By: igor
Subscribers: jkedgar, dhruba
Differential Revision: https://reviews.facebook.net/D46935
Summary:
Releasing mutex between getting min_pending_output and scanning files may cause min_pending_output to be max but some non-final files are found in file scanning, ending up with deleting wrong files.
As a recent regression, mutex can be released while waiting for log sync. We move it to after file scanning.
Test Plan: Run all existing tests. Don't think it is easy to write a unit test. Maybe we should find a way to assert lock not released so that we can have some test verification for similar cases.
Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, kolmike, rven
Reviewed By: rven
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D46899
Summary: Add an option to stop writes if compaction lefts behind. If estimated pending compaction bytes is more than threshold specified by options.hard_pending_compaction_bytes_liimt, writes will stop until compactions are cleared to under the threshold.
Test Plan: Add unit test DBTest.HardLimit
Reviewers: rven, kradhakrishnan, anthony, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: MarkCallaghan, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D45999
Summary:
Although there are currently counters to keep track of the
stall caused by having too many L0 files, there is no distinction as
to whether when that stall occurs either (A) L0-L1 compaction is taking
place to try and mitigate it, or (B) no L0-L1 compaction has been scheduled
at the moment. This diff adds a counter for (A) so that the nature of L0
stalls can be better understood.
Test Plan: make all && make check
Reviewers: sdong, igor, anthony, noetzli, yhchiang
Reviewed By: yhchiang
Subscribers: MarkCallaghan, dhruba
Differential Revision: https://reviews.facebook.net/D46749
Summary:
RocksDB debug version failed to build under gcc-4.8.1 on sandcastle with the following error:
```
db/db_compaction_filter_test.cc:570:33: error: ‘snapshot’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
```
Test Plan: make db_compaction_filter_test && ./db_compaction_filter_test
Reviewers: rven, anthony, yhchiang, aekmekji, igor, sdong
Reviewed By: igor, sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46725
Summary:
There are currently no statistics on seeks, only on gets. This adds the following counters:
rocksdb.number.db.seek
rocksdb.number.db.next
rocksdb.number.db.prev
(number of calls)
rocksdb.db.iterate.bytes.read
(number of bytes read from key + value using seek/next/prev)
rocksdb.number.keys.seek.found
rocksdb.number.keys.next.found
rocksdb.number.keys.prev.found
(number of calls where seek/next/prev found a value)
Test Plan:
./db_bench -statistics -benchmarks fillrandom,seekrandom -seek_nexts 5
./db_bench -statistics -benchmarks fillrandom,seekrandom -seek_nexts 5 -reverse_iterator
Reviewers: yhchiang, rven, kradhakrishnan, IslamAbdelRahman, MarkCallaghan, sdong, igor
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D46605
Summary:
Refactoring NewTableReader to accept TableReaderOptions
This will make it easier to add new options in the future, for example in this diff https://reviews.facebook.net/D46071
Test Plan: run existing tests
Reviewers: igor, yhchiang, anthony, rven, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D46179
Summary:
During the refactoring, the condition that makes sure that compaction
filters are only applied to records newer than the latest snapshot
got butchered. This patch fixes the condition and adds a test case.
Test Plan: make db_compaction_filter_test && ./db_compaction_filter_test
Reviewers: rven, anthony, yhchiang, sdong, aekmekji, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46707
Summary. A change https://reviews.facebook.net/differential/diff/224721/
Has attempted to move common functionality out of platform dependent
code to a new facility called file_reader_writer.
This includes:
- perf counters
- Buffering
- RateLimiting
However, the change did not attempt to refactor Windows code.
To mitigate, we introduce new quering interfaces such as UseOSBuffer(),
GetRequiredBufferAlignment() and ReaderWriterForward()
for pure forwarding where required.
Introduce WritableFile got a new method Truncate(). This is to communicate
to the file as to how much data it has on close.
- When space is pre-allocated on Linux it is filled with zeros implicitly,
no such thing exist on Windows so we must truncate file on close.
- When operating in unbuffered mode the last page is filled with zeros but we still want to truncate.
Previously, Close() would take care of it but now buffer management is shifted to the wrappers and the file has
no idea about the file true size.
This means that Close() on the wrapper level must always include
Truncate() as well as wrapper __dtor should call Close() and
against double Close().
Move buffered/unbuffered write logic to the wrapper.
Utilize Aligned buffer class.
Adjust tests and implement Truncate() where necessary.
Come up with reasonable defaults for new virtual interfaces.
Forward calls for RandomAccessReadAhead class to avoid double
buffering and locking (double locking in unbuffered mode on WIndows).
Summary:
The current build is failing on some platforms due to an __unused__ attribute.
This patch prevents the problem by using a pattern similar to MergeHelper
(assert not on the variable but inside a condition that uses the variable). We
should have better error handling in both cases in the future.
Test Plan: make clean all check
Reviewers: rven, anthony, yhchiang, sdong, igor, aekmekji
Reviewed By: aekmekji
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46623
Summary:
The test SuggestCompactRangeNoTwoLevel0Compactions in
DBCompactionTest fails when there are parallel L0-L1 compactions
taking place because the test makes sure that only one compaction
involving L0 takes place at any given time (since before having
parallel compactions this was impossible). I changed the test to only
run with DBOptions.max_subcompactions=1 so as to not hit this issue
which is not a correctness issue but just an inherent changing of
assumptions after introducing parallel compactions.
This failed after landing https://reviews.facebook.net/D43269#inline-321303
so now this should fix it
Test Plan: make all && make check
Reviewers: yhchiang, igor, anthony, noetzli, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D46617
Summary:
Builder and CompactionJob share a lot of fairly complex code. This patch
refactors this code into a separate class, the CompactionIterator. Because the
shared code is fairly complex, this patch hopefully improves maintainability.
While there are is a lot of potential for further improvements, the patch is
intentionally pretty close to the original structure because the change is
already complex enough.
Test Plan: make clean all check && ./db_stress
Reviewers: rven, anthony, yhchiang, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46197
Summary: ReadDroppedColumnFamily is consistently failing in Travis CI environment (can't repro locally). I suspect it might be failing with non-OK status. This diff will give us more info about the failure.
Test Plan: none
Reviewers: sdong, kradhakrishnan
Reviewed By: kradhakrishnan
Subscribers: kradhakrishnan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46611
Summary:
Up to this point, the subcompactions that make up a compaction
job have been divided based on the key range of the L1 files, and each
subcompaction has handled the key range of only one file. However
DBOption.max_subcompactions allows the user to designate how many
subcompactions at most to perform. This patch updates the
CompactionJob::GetSubcompactionBoundaries() to determine these
divisions accordingly based on that option and other input/system factors.
The current approach orders the starting and/or ending keys of certain
compaction input files and then generates a histogram to approximate the
size covered by the key range between each consecutive pair of keys. Then
it groups these ranges into groups so that the sizes are approximately equal
to one another. The approach has also been adapted to work for universal
compaction as well instead of just for level-based compaction as it was before.
These subcompactions are then executed in parallel by locally spawning
threads, one for each. The results are then aggregated and the compaction
completed.
Test Plan: make all && make check
Reviewers: yhchiang, anthony, igor, noetzli, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43269
Summary: The current code, considers data to be consistent if the record
checksum passes. We do have customer issues where the record checksum passed but
the data was incomprehensible. There is no way to get out of this error case
since all WAL recovery model will consider this error as unrelated to WAL.
Relaxing the definition and including errors while inserting to memtable as WAL
errors and handing them as per the recovery level.
Test Plan: Used customer dump to verify the fix for different level. The db
opens for kSkipAnyCorruptedRecords and kPointInTimeRecovery, but fails for
kAbsoluteConsistency and kTolerateCorruptedTailRecords.
Reviewers: sdon igor
CC: leveldb@
Task ID: #7918721
Blame Rev:
Summary: DBTest.ReadLatencyHistogramByLevel was not written as expected. After writes, reads aren't guaranteed to hit data written. It was not expected. Fix it.
Test Plan: Run the test multiple times
Reviewers: IslamAbdelRahman, rven, anthony, kradhakrishnan, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D46587
Summary: We should never set max_open_files to be bigger than the system's ulimit. Otherwise we will get "Too many open files" errors. See an example in this Travis run: https://travis-ci.org/facebook/rocksdb/jobs/79591566
Test Plan:
make check
I will also verify that max_max_open_files is reasonable.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46551
Summary: Currently, if users didn't set options.arena_block_size, we set "result.arena_block_size = result.write_buffer_size / 10". It makes result.arena_block_size not a multiplier of 4KB, even if options.write_buffer_size is a multiplier of MBs. When calling malloc to arena_block_size, we may waste a small amount of memory for it. We now make the default to be /8 or /16 and align it to 4KB.
Test Plan: unit tests
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46467
Summary:
This commit makes DBTest.OptimizeFiltersForHits more deterministic by:
(1) make key inserts more random
(2) make sure L0 has one file
(3) make file size smaller compared to level target so L1 will cover more range.
Test Plan: Run the test many times.
Reviewers: rven, IslamAbdelRahman, kradhakrishnan, igor, anthony
Reviewed By: anthony
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D46461
Summary:
It looks like in some cases an assert in SeekInternal failed when computing the
hints for the next level because user_key was the same as the largest key and
not strictly smaller. Relaxing the assert to expect smaller or equal keys.
Test Plan: make clean all check
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46443
Summary:
In some cases, equality comparisons can be done more efficiently than three-way
comparisons. There are quite a few places in the code where we only care about
equality. This patch adds an Equal() method that defaults to using the
Compare() method.
Test Plan: make clean all check
Reviewers: rven, anthony, yhchiang, igor, sdong
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46233
Summary: When computing the hint for GetNextLevelIndex(), ForwardIterator was doing a redundant comparison. This patch fixes the comparison (using https://github.com/facebook/rocksdb/blob/master/db/version_set.cc#L158 as a reference) and moves it inside an assert because we expect `level_files[f_idx]` to contain the next key after Seek(), so user_key should always be smaller than the largest key.
Test Plan: make clean all check
Reviewers: rven, anthony, yhchiang, igor, sdong
Reviewed By: sdong
Subscribers: tnovak, sdong, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46227
Summary:
This diff fixes a case when the forward iterator misses a new
insert when the mutable iterator is not current. The test is also
improved and the check for deleted iterators is made more informative.
Test Plan: DBTailingIteratorTest.*Trim
Reviewers: tnovak, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D46167
Summary:
This diff is a collection of cleanups that were initially part of D43179.
Additionally it adds a unified way of defining key-value maps that use a
Comparator for sorting (this was previously implemented in four different
places).
Test Plan: make clean check all
Reviewers: rven, anthony, yhchiang, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D45993
Summary: TableCache::Get() puts parameters in the wrong places so that table readers created by Get() will not have the histogram updated.
Test Plan: Will write a unit test for that.
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D46035
Summary:
I noticed that memtable iterator usually crosses the `iterate_upper_bound`
threshold when tailing. Changes introduced in D43833 made `NeedToSeekImmutable`
always return true in such case, even when `Seek()` only needs to rewind the
memtable iterator. In a test I ran, this caused the "tailing efficiency"
(ratio of calls to `Seek()` that only affect the memtable versus all seeks)
to drop almost to zero.
This diff attempts to fix the regression by using a different flag to indicate
that `current_` is over the limit instead of resetting `valid_` in
`UpdateCurrent()`.
Test Plan: `DBTestTailingIterator.TailingIteratorUpperBound`
Reviewers: sdong, rven
Reviewed By: rven
Subscribers: dhruba, march
Differential Revision: https://reviews.facebook.net/D45909
Summary:
Flushes in listener_test happened to early when ROCKSDB_MALLOC_USABLE_SIZE was
active (e.g. when compiling with ROCKSDB_FBCODE_BUILD_WITH_481=1) due to
malloc_usable_size() reporting a better estimate (similar to
https://reviews.facebook.net/D43317 ). This patch grows the write buffer size
slightly to compensate for this.
Test Plan: ROCKSDB_FBCODE_BUILD_WITH_481=1 make listener_test && ./listener_test
Reviewers: rven, anthony, yhchiang, igor, sdong
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D45921
Summary:
Pessimistic Transaction expiration time checking currently causes a performace regression, Lets disable it in db_bench by default.
Also, in order to be able to better tune how much contention we're simulating, added new optinos to set lock timeout and snapshot.
Test Plan: run db_bench randomtranansaction
Reviewers: sdong, igor, yhchiang, MarkCallaghan
Reviewed By: MarkCallaghan
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D45831
Summary:
Now that the approach to parallelizing L0-L1 level-based
compactions by breaking the compaction job into subcompactions is
being extended to apply to universal compactions as well, the unit
tests need to account for this and run the universal compaction
tests with subcompactions both enabled and disabled.
Test Plan: make all && make check
Reviewers: sdong, igor, noetzli, anthony, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D45657
Summary: malloc_usable_size() gets a better estimation of memory usage. It is already used to calculate block cache memory usage. Use it in arena too.
Test Plan: Run all unit tests
Reviewers: anthony, kradhakrishnan, rven, IslamAbdelRahman, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D43317
Summary:
MarkLogsSynced() was doing `logs_.erase(it++);`. The standard is saying:
```
all iterators and references are invalidated, unless the erased members are at an end (front or back) of the deque (in which case only iterators and references to the erased members are invalidated)
```
Because `it` is an iterator to the first element of the container, it is
invalidated, only one iteration is executed and `log.getting_synced = false;`
is not being done, so `while (logs_.front().getting_synced)` in `WriteImpl()`
is not terminating.
Test Plan: make db_bench && ./db_bench --benchmarks=fillsync
Reviewers: igor, rven, IslamAbdelRahman, anthony, kradhakrishnan, yhchiang, sdong, tnovak
Reviewed By: tnovak
Subscribers: kolmike, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D45807
Summary:
Just realized that after D45675, part of the code in
DBTest.ApproximateMemoryUsage, does not really test anything anymore, so I
removed it.
Test Plan: make clean all check
Reviewers: rven, igor, sdong, anthony, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D45783
Summary:
The immutable memtable iterators are allocated from an arena and there
is no benefit from deleting these. Also the immutable memtables
themselves will continue to be in memory until the version set
containing it is alive. We will not remove immutable memtable iterators
over the upper bound. We now add immutable iterators to the test.
Test Plan: db_tailing_iter_test.TailingIteratorTrimSeekToNext
Reviewers: tnovak, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D45597
Summary: Add ZSTD compression type. The same way as adding LZ4.
Test Plan: run all tests. Generate files in db_bench. Make sure reads succeed. But the SST files cannot be opened in older versions. Also some other adhoc tests.
Reviewers: rven, anthony, IslamAbdelRahman, kradhakrishnan, igor
Reviewed By: igor
Subscribers: MarkCallaghan, maykov, yoshinorim, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D45747
Summary:
This patch fixes two issues in DBTest.ApproximateMemoryUsage:
- It was possible that a flush happened between getting the two properties in
Phase 1, resulting in different numbers for the properties and failing the
assertion. This is fixed by waiting for the flush to finish before getting
the properties.
- There was a similar issue in Phase 2 and additionally there was an issue that
rocksdb.size-all-mem-tables was not monotonically increasing because it was
possible that a flush happened just after getting the properties and then
another flush just before getting the properties in the next round. In this
situation, the reported memory usage decreased. This is fixed by forcing a
flush before getting the properties.
Note: during testing, I found that kFlushesPerRound does not seem very
accurate. I added a TODO for this and it would be great to get some input on
what to do there.
Test Plan:
The first issue can be made more likely to trigger by inserting a
`usleep(10000);` between the calls to GetIntProperty() in Phase 1.
The second issue can be made more likely to trigger by inserting a
`if (r != 0) usleep(10000);` before the calls to GetIntProperty() and a
`usleep(10000);` after the calls.
Then execute make db_test && ./db_test --gtest_filter=DBTest.ApproximateMemoryUsage
Reviewers: rven, yhchiang, igor, sdong, anthony
Reviewed By: anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D45675
Summary:
Add argument --show_table_properties to db_bench
-show_table_properties (If true, then per-level table properties will be
printed on every stats-interval when stats_interval is set and
stats_per_interval is on.) type: bool default: false
Test Plan:
./db_bench --show_table_properties=1 --stats_interval=100000 --stats_per_interval=1
./db_bench --show_table_properties=1 --stats_interval=100000 --stats_per_interval=1 --num_column_families=2
Sample Output:
Compaction Stats [column_family_name_000001]
Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) Stall(cnt) KeyIn KeyDrop
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
L0 3/0 5 0.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 86.3 0 17 0.021 0 0 0
L1 5/0 9 0.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0 0
L2 9/0 16 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0 0
Sum 17/0 31 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 86.3 0 17 0.021 0 0 0
Int 0/0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 83.9 0 2 0.022 0 0 0
Flush(GB): cumulative 0.030, interval 0.004
Stalls(count): 0 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 0 leveln_slowdown_soft, 0 leveln_slowdown_hard
Level[0]: # data blocks=2571; # entries=84813; raw key size=2035512; raw average key size=24.000000; raw value size=8481300; raw average value size=100.000000; data block size=5690119; index block size=82415; filter block size=0; (estimated) table size=5772534; filter policy name=N/A;
Level[1]: # data blocks=4285; # entries=141355; raw key size=3392520; raw average key size=24.000000; raw value size=14135500; raw average value size=100.000000; data block size=9487353; index block size=137377; filter block size=0; (estimated) table size=9624730; filter policy name=N/A;
Level[2]: # data blocks=7713; # entries=254439; raw key size=6106536; raw average key size=24.000000; raw value size=25443900; raw average value size=100.000000; data block size=17077893; index block size=247269; filter block size=0; (estimated) table size=17325162; filter policy name=N/A;
Level[3]: # data blocks=0; # entries=0; raw key size=0; raw average key size=0.000000; raw value size=0; raw average value size=0.000000; data block size=0; index block size=0; filter block size=0; (estimated) table size=0; filter policy name=N/A;
Level[4]: # data blocks=0; # entries=0; raw key size=0; raw average key size=0.000000; raw value size=0; raw average value size=0.000000; data block size=0; index block size=0; filter block size=0; (estimated) table size=0; filter policy name=N/A;
Level[5]: # data blocks=0; # entries=0; raw key size=0; raw average key size=0.000000; raw value size=0; raw average value size=0.000000; data block size=0; index block size=0; filter block size=0; (estimated) table size=0; filter policy name=N/A;
Level[6]: # data blocks=0; # entries=0; raw key size=0; raw average key size=0.000000; raw value size=0; raw average value size=0.000000; data block size=0; index block size=0; filter block size=0; (estimated) table size=0; filter policy name=N/A;
Reviewers: anthony, IslamAbdelRahman, MarkCallaghan, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D45651
Summary:
ReadaheadRandomAccessFile acts as a transparent layer on top of RandomAccessFile. When a Read() request is issued, it issues a much bigger request to the OS and caches the result. When a new request comes in and we already have the data cached, it doesn't have to issue any requests to the OS.
We add ReadaheadRandomAccessFile layer only when file is read during compactions.
D45105 was incorrectly closed by Phabricator because I committed it to a separate branch (not master), so I'm resubmitting the diff.
Test Plan: make check
Reviewers: MarkCallaghan, sdong
Reviewed By: sdong
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D45123
Summary:
When DBIter changes iterating direction from forward to backward, it might see some much larger keys with higher sequence ID. With this commit, these rows will be actively filtered out. It should fix existing disabled tests in db_iter_test.
This may not be a perfect fix, but it introduces least impact on existing codes, in order to be safe.
Test Plan:
Enable existing tests and make sure they pass. Add a new test DBIterWithMergeIterTest.InnerMergeIteratorDataRace8.
Also run all existing tests.
Reviewers: yhchiang, rven, anthony, IslamAbdelRahman, kradhakrishnan, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D45567
Summary:
DBTest.GetProperty was failing occasionally (see task #8131266). The reason was
that the test closed the database before the compaction was done. When the test
reopened the database, RocksDB would schedule a compaction which in turn
created table readers and lead the test to fail the assertion that
rocksdb.estimate-table-readers-mem is 0. In most cases, GetIntProperty() of
rocksdb.estimate-table-readers-mem happened before the compaction created the
table readers, hiding the problem. This patch changes the
WaitForFlushMemTable() to WaitForCompact(). WaitForFlushMemTable() is not
necessary because it is already being called a couple of lines before without
any insertions in-between.
Test Plan:
Insert `usleep(10000);` just after `Reopen(options);` on line 2333 to make the issue more likely, then run:
make db_test && while ./db_test --gtest_filter=DBTest.GetProperty; do true; done
Reviewers: rven, yhchiang, anthony, igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D45603
Summary:
It was pointed out to me that the members of SubCompactionState
'earliest_snapshot', 'latest_snapshot' and 'visible_at_tip' are never
modified by the subcompactions, so they can stay as global varaibles
instead to make things simpler.
Test Plan: make all && make check
Reviewers: sdong, igor, noetzli, anthony, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D45477
Summary: There was a bad merge during refresh.
Test Plan: make -j all; make check
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D45555