Summary:
This exposes a transactions state through a public api rather than through a public member variable. I also do some name refactoring.
ExecutionStatus => TransactionState
exec_status_ => trx_state_
Test Plan: It compiles and transaction_test passes.
Reviewers: IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, mung, dhruba, sdong
Differential Revision: https://reviews.facebook.net/D64689
Summary:
When constructing a write batch a client may now call MarkWalTerminationPoint() on that batch. No batch operations after this call will be added written to the WAL but will still be inserted into the Memtable. This facility is used to remove one of the three WriteImpl calls in 2PC transactions. This produces a ~1% perf improvement.
```
RocksDB - unoptimized 2pc, sync_binlog=1, disable_2pc=off
INFO 2016-08-31 14:30:38,814 [main]: REQUEST PHASE COMPLETED. 75000000 requests done in 2619 seconds. Requests/second = 28628
RocksDB - optimized 2pc , sync_binlog=1, disable_2pc=off
INFO 2016-08-31 16:26:59,442 [main]: REQUEST PHASE COMPLETED. 75000000 requests done in 2581 seconds. Requests/second = 29054
```
Test Plan: Two unit tests added.
Reviewers: sdong, yiwu, IslamAbdelRahman
Reviewed By: yiwu
Subscribers: hermanlee4, dhruba, andrewkr
Differential Revision: https://reviews.facebook.net/D64599
* env_mirror: fix leak from LockFile
Signed-off-by: Sage Weil <sage@redhat.com>
* env_mirror: instruct EnvMirror whether mirrored Envs should be destroyed
The lifecycle rules for Env are frustrating and undocumented. Notably,
Env::Default() should *not* be freed, but any Env instances we created
should be.
Explicitly instruct EnvMirror whether to clean up child Env instances.
Default to false so that we do not affect existing callers.
Signed-off-by: Sage Weil <sage@redhat.com>
Summary:
- Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno`
- Update TableProperties to be aware of the offset of each property in the file
- Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file
Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks
Test Plan: unit tests
Reviewers: andrewkr, yhchiang, yiwu, sdong
Reviewed By: sdong
Subscribers: hcz, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D62523
Summary:
Currently there is no mechanism to create persistent cache from
headers. Adding a simple factory method to create a simple persistent cache with
default or NVM optimized settings.
note: Any idea to test this factory is appreciated.
Test Plan: None
Reviewers: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D64527
Summary:
This diff does 3 things:
Expose TransactionID so that we can identify transactions when we retrieve locking and lock wait information. This is exposed as `Transaction::GetID`.
Expose lock state information by locking all stripes in all column families and copying their contents to a data structure. This is exposed as `TransactionDB::GetLockStatusData`.
Adds support for tracking the transaction and the key being waited on, and exposes this as `Transaction::GetWaitingTxn`.
Test Plan: unit tests
Reviewers: horuff, sdong
Reviewed By: sdong
Subscribers: vasilep, hermanlee4, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D64413
Summary:
Add new Iterator API, `SeekForPrev`: find the last key that <= target key
support prefix_extractor
support prefix_same_as_start
support upper_bound
not supported in iterators without Prev()
Also add tests in db_iter_test and db_iterator_test
Pass all tests
Cheers!
Test Plan: make all check -j64
Reviewers: andrewkr, yiwu, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D64149
Summary: Use ImmutableDBOptions/MutableDBOptions internally and DBOptions only for user-facing APIs. MutableDBOptions is barely a placeholder for now. I'll start to move options to MutableDBOptions in following diffs.
Test Plan:
make all check
Reviewers: yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D64065
Summary: add ColumnFamilyHandleDeletionStarted listener which can be called when user deletes handler.
Test Plan: ./listener_test
Reviewers: yiwu, IslamAbdelRahman, sdong, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60717
Summary:
Add Env::RandomRWFile in env.h and implement it for POSIX
RandomRWFile is a file that allow us to read from / write to random offsets in the file
I will implement it for other Envs later after finishing the whole task for AddFile()
Test Plan: unit tests
Reviewers: andrewkr, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D62433
Summary:
ZSTD 1.0.0 is coming. We can finally add a support of ZSTD without worrying about compatibility.
Still keep ZSTDNotFinal for compatibility reason.
Test Plan: Run all tests. Run db_bench with ZSTD version with RocksDB built with ZSTD 1.0 and older.
Reviewers: andrewkr, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: cyan, igor, IslamAbdelRahman, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D63141
Summary: There's no reference to ImmutableCFOptions elsewhere in /include/rocksdb. ImmutableCFOptions was introduced in this commit (5665e5e285) but later its reference in /include/rocksdb/table.h is removed.
Test Plan:
make all check
Reviewers: IslamAbdelRahman, sdong, yhchiang
Reviewed By: yhchiang
Subscribers: yhchiang, andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D63177
* Fix StatsLevel so that kExceptTimeForMutex leaves compression stats enabled and kExceptDetailedTimers disables mutex lock stats. Also change default stats level to kExceptDetailedTimers (disabling both compression and mutex timing).
* Changed order of StatsLevel enum to simplify logic for determining what stats to record.
Summary: To reduce number of options, merge source_compaction_factor, max_grandparent_overlap_bytes and expanded_compaction_factor into max_compaction_bytes.
Test Plan: Add two new unit tests. Run all existing tests, including jtest.
Reviewers: yhchiang, igor, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59829
Summary: basically for SimCache stats. I find most times it is hard to pass Statistics* to SimCache constructor.
Test Plan: make all check
Reviewers: andrewkr, sdong, yiwu
Reviewed By: yiwu
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62193
Summary:
Add ReadOptions::read_amp_bytes_per_bit option which allow us to create a bitmap for every data block we read
the bitmap will contain (block_size / read_amp_bytes_per_bit) bits.
We will use this bitmap to mark which bytes have been used of the block so we can calculate the read amplification
Test Plan: added new tests
Reviewers: andrewkr, yhchiang, sdong
Reviewed By: sdong
Subscribers: yiwu, leveldb, march, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58707
Summary:
Make sure prefix extractor name is stored in SST files and if DB is opened with a prefix extractor of a different name, prefix bloom is skipped when read the file.
Also add unit tests for that.
Test Plan:
before change:
```
Note: Google Test filter = BlockBasedTableTest.SkipPrefixBloomFilter
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from BlockBasedTableTest
[ RUN ] BlockBasedTableTest.SkipPrefixBloomFilter
table/table_test.cc:1421: Failure
Value of: db_iter->Valid()
Actual: false
Expected: true
[ FAILED ] BlockBasedTableTest.SkipPrefixBloomFilter (1 ms)
[----------] 1 test from BlockBasedTableTest (1 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (1 ms total)
[ PASSED ] 0 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] BlockBasedTableTest.SkipPrefixBloomFilter
1 FAILED TEST
```
after:
```
Note: Google Test filter = BlockBasedTableTest.SkipPrefixBloomFilter
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from BlockBasedTableTest
[ RUN ] BlockBasedTableTest.SkipPrefixBloomFilter
[ OK ] BlockBasedTableTest.SkipPrefixBloomFilter (0 ms)
[----------] 1 test from BlockBasedTableTest (0 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (0 ms total)
[ PASSED ] 1 test.
```
Reviewers: sdong, andrewkr, yiwu, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61215
Summary: as title
Test Plan: make all check
Reviewers: yiwu
Reviewed By: yiwu
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62493
Summary:
This diff split ThreadPool to
-ThreadPool (abstract interface exposed in include/rocksdb/threadpool.h)
-ThreadPoolImpl (actual implementation in util/threadpool_imp.h)
This allow us to expose ThreadPool to the user so we can use it as an option later
Test Plan: existing unit tests
Reviewers: andrewkr, yiwu, yhchiang, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D62085
Summary:
Add option to block based table to insert index/filter blocks to block cache with priority. Combined with LRUCache with high_pri_pool_ratio, we can reserved space for index/filter blocks, make them less likely to be evicted.
Depends on D61977.
Test Plan: See unit test.
Reviewers: lightmark, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, march, leveldb
Differential Revision: https://reviews.facebook.net/D62241
Summary:
Add mid-point insertion functionality to LRU cache. Caller of `Cache::Insert()` can set an additional parameter to make a cache entry have higher priority. The LRU cache will reserve at most `capacity * high_pri_pool_pct` bytes for high-pri cache entries. If `high_pri_pool_pct` is zero, the cache degenerates to normal LRU cache.
Context: If we are to put index and filter blocks into RocksDB block cache, index/filter block can be swap out too early. We want to add an option to RocksDB to reserve some capacity in block cache just for index/filter blocks, to mitigate the issue.
In later diffs I'll update block based table reader to use the interface to cache index/filter blocks at high priority, and expose the option to `DBOptions` and make it dynamic changeable.
Test Plan: unit test.
Reviewers: IslamAbdelRahman, sdong, lightmark
Reviewed By: lightmark
Subscribers: andrewkr, dhruba, march, leveldb
Differential Revision: https://reviews.facebook.net/D61977
Summary: 1. Range Deletion Tombstone structure 2. Modify Add() in table_builder to make it usable for adding range del tombstones 3. Expose NewTombstoneIterator() API in table_reader
Test Plan: table_test.cc (now BlockBasedTableBuilder::Add() only accepts InternalKey. I make table_test only pass InternalKey to BlockBasedTableBuidler. Also test writing/reading range deletion tombstones in table_test )
Reviewers: sdong, IslamAbdelRahman, lightmark, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61473
Summary:
Clock-based cache implemenetation aim to have better concurreny than
default LRU cache. See inline comments for implementation details.
Test Plan:
Update cache_test to run on both LRUCache and ClockCache. Adding some
new tests to catch some of the bugs that I fixed while implementing the
cache.
Reviewers: kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61647
Summary:
Added 2 statistics in compaction job statistics, to
identify if single deletes are not meeting a matching key
(fallthrough) or single deletes are meeting a merge, delete or
another single delete (i.e. not the expected case of put).
Test Plan: Tested the statistics using write_stress and compaction_job_stats_test
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61749
Summary:
Add API to WriteBatch to store range deletions in its buffer
which are later added to memtable. In the WriteBatch buffer, a range
deletion is encoded as "<optype><CF ID (optional)><begin key><end key>".
With this diff, the range tombstones are stored inline with the data in
the memtable. It's useful for now because the test cases rely on the
data being accessible via memtable. My next step is to store range
tombstones in a separate area in the memtable.
Test Plan: unit tests
Reviewers: IslamAbdelRahman, sdong, wanning
Reviewed By: wanning
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61401
Summary: preparation for detecting Cache type. If SimCache, we then may trigger some command like "setSimCapacity()" with setOptions()
Test Plan: make all check
Reviewers: yiwu, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61953
Summary:
This diff improves the documentation for GetOptionsFromMap APIs and
fixes a bug in GetOptionsFromMap functions in convenience.h
where new_options will still be changed when the function
call is not successful.
Test Plan: options_test
Reviewers: IslamAbdelRahman, kradhakrishnan, andrewkr, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61731
Summary: make transactionDB working with StackableDB
Test Plan: make all check -j64
Reviewers: andrewkr, yiwu, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60705
Summary:
add SIM_BLOCK_CACHE_HIT and SIM_BLOCK_CACHE_MISS tickers.
maybe can be combined with Histograms like DB_GET to evaluate the current setting of the size of block cache.
Test Plan: make all check
Reviewers: sdong, andrewkr, IslamAbdelRahman, yiwu
Reviewed By: yiwu
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61803
Summary: RocksDB 4.11 has been branched out. Now need to move to 4.12.
Test Plan: N/A
Reviewers: kradhakrishnan, andrewkr, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61845
* Added check_snapshot option in the DB's AddFile function
* change check_snapshot to skip_snapshot_check
* add unit test for skip_snapshot_check
* Add skip_snapshot_check comment
Summary: Implement a time series database that supports DateTieredCompactionStrategy. It wraps a db object and separate SST files in different column families (time windows).
Test Plan: Add `date_tiered_test`.
Reviewers: dhruba, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61653
Summary: Add a utility function that trigger necessary full compaction and put output to the correct level by looking at new options and old options.
Test Plan: Add unit tests for it.
Reviewers: andrewkr, igor, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: muthu, sumeet, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60783
Summary:
patch for diff https://reviews.facebook.net/D58587
Also change StopWatch class to add a fifth param named overwrite which decides whether to overwrite *elapse or add on it.
Test Plan: make all check -j64
Reviewers: sdong, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61239
Summary: MyRocks is adding support for the user of the SstFileWriter which needs a comparator. It would be more convenient to get the comparator from the column family (which already has to have it) than to have caller keep track of it.
Test Plan: Standard tests (adding one for the new method)
Reviewers: IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61155
Summary: Extend the option memtable_prefix_bloom_huge_page_tlb_size from just putting memtable bloom filter to huge page to memtable itself too.
Test Plan: Run all existing tests.
Reviewers: IslamAbdelRahman, yhchiang, andrewkr
Reviewed By: andrewkr
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60513
Summary:
TickersNameMap is not consistent with Tickers enum.
this cause us to report wrong statistics and sometimes to access TickersNameMap outside it's boundary causing crashes (in Fb303 statistics)
Test Plan: added new unit test
Reviewers: sdong, kradhakrishnan
Reviewed By: kradhakrishnan
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61083
Summary: Bump master version to the next potential release version of 4.11
Test Plan: Compile
Reviewers: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60993
EnvLibrados is a customized RocksDB Env to use RADOS as the backend file system of RocksDB. It overrides all file system related API of default Env. The easiest way to use it is just like following:
std::string db_name = "test_db";
std::string config_path = "path/to/ceph/config";
DB* db;
Options options;
options.env = EnvLibrados(db_name, config_path);
Status s = DB::Open(options, kDBPath, &db);
Then EnvLibrados will forward all file read/write operation to the RADOS cluster assigned by config_path. Default pool is db_name+"_pool".
There are some options that users could set for EnvLibrados.
- write_buffer_size. This variable is the max buffer size for WritableFile. After reaching the buffer_max_size, EnvLibrados will sync buffer content to RADOS, then clear buffer.
- db_pool. Rather than using default pool, users could set their own db pool name
- wal_dir. The dir for WAL files. Because RocksDB only has 2-level structure (dir_name/file_name), the format of wal_dir is "/dir_name"(CAN'T be "/dir1/dir2"). Default wal_dir is "/wal".
- wal_pool. Corresponding pool name for WAL files. Default value is db_name+"_wal_pool"
The example of setting options looks like following:
db_name = "test_db";
db_pool = db_name+"_pool";
wal_dir = "/wal";
wal_pool = db_name+"_wal_pool";
write_buffer_size = 1 << 20;
env_ = new EnvLibrados(db_name, config, db_pool, wal_dir, wal_pool, write_buffer_size);
DB* db;
Options options;
options.env = env_;
// The last level dir name should match the dir name in prefix_pool_map
options.wal_dir = "/tmp/wal";
// open DB
Status s = DB::Open(options, kDBPath, &db);
Librados is required to compile EnvLibrados. Then use "$make LIBRADOS=1" to compile RocksDB. If you want to only compile EnvLibrados test, just run "$ make env_librados_test LIBRADOS=1". To run env_librados_test, you need to have a running RADOS cluster with the configure file located in "../ceph/src/ceph.conf" related to "rocksdb/".
Summary: In T8216281 we decided to disable prefetching the index and filter during opening table handlers during startup (max_open_files = -1).
Test Plan: Rely on `IndexAndFilterBlocksOfNewTableAddedToCache` to guarantee L0 indexes and filters are still cached and change `PinL0IndexAndFilterBlocksTest` to make sure other levels are not cached (maybe add one more test to test we don't cache other levels?)
Reviewers: sdong, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59913
* Added new statistics and refactored to allow ioptions to be passed around as required to access environment and statistics pointers (and, as a convenient side effect, info_log pointer).
* Prevent incrementing compression counter when compression is turned off in options.
* Prevent incrementing compression counter when compression is turned off in options.
* Added two more supported compression types to test code in db_test.cc
* Prevent incrementing compression counter when compression is turned off in options.
* Added new StatsLevel that excludes compression timing.
* Fixed casting error in coding.h
* Fixed CompressionStatsTest for new StatsLevel.
* Removed unused variable that was breaking the Linux build
Summary: Refactor cache.cc so that I can plugin clock cache (D55581). Mainly move `ShardedCache` to separate file, move `LRUHandle` back to cache.cc and rename it lru_cache.cc.
Test Plan:
make check -j64
Reviewers: lightmark, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D59655
Summary: If options.write_buffer_size is not set, nor options.write_buffer_manager, no need to update the bytes allocated counter in MemTableAllocator, which is expensive in parallel memtable insert case. Remove it can improve parallel memtable insert throughput by 10% with write batch size 128.
Test Plan:
Run benchmarks
TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom -disable_auto_compactions -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -num=10000000 --writes=1000000 -max_background_flushes=16 -max_write_buffer_number=16 --threads=32 --batch_size=128 -allow_concurrent_memtable_write -enable_write_thread_adaptive_yield
The throughput grows 10% with the benchmark.
Reviewers: andrewkr, yiwu, IslamAbdelRahman, igor, ngbronson
Reviewed By: ngbronson
Subscribers: ngbronson, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60465
Summary:
add DestroyColumnFamilyHandle(ColumnFamilyHandle**) to close column family instead of deleting cfh*
User should call this to close a cf and then we can detect the deletion in this function.
Test Plan: make all check -j64
Reviewers: andrewkr, yiwu, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60765
Summary:
DB::AddFile(std::string file_path) API that allow them to ingest an SST file created using SstFileWriter
We want to update this interface to be able to accept a list of files that will be ingested, DB::AddFile(std::vector<std::string> file_path_list).
Test Plan:
Add test case `AddExternalSstFileList` in `DBSSTTest`. To make sure:
1. files key ranges are not overlapping with each other
2. each file key range dont overlap with the DB key range
3. make sure no snapshots are held
Reviewers: andrewkr, sdong, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D58587
Summary: Add option write_buffer_manager to help users control total memory spent on memtables across multiple DB instances.
Test Plan: Add a new unit test.
Reviewers: yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: adela, benj, sumeet, muthu, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59925
Summary:
Overload RepairDB to take vector-of-ColumnFamilyDescriptor, which tells
us CF name + options. Also takes a ColumnFamilyOptions for unspecified column
families encountered during the repair.
One potentially confusing thing is that we store options in the constructor and
don't invoke AddColumnFamily() until discovering the CF in ScanTable. This is
because we don't know the CF ID until we find a table belonging to that CF.
Depends on D59781.
Test Plan:
$ ./repair_test
Reviewers: yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D59853
Summary:
Add a read option `background_purge_on_iterator_cleanup` to avoid deleting files in foreground when destroying iterators.
Instead, a job is scheduled in high priority queue and would be executed in a separate background thread.
Test Plan: Add a variant of PurgeObsoleteFileTest. Turn on background purge option in the new test, and use sleeping task to ensure files are deleted in background.
Reviewers: IslamAbdelRahman, sdong
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59499
Summary:
DB::AddFile() right now always add the ingested file to L0
update the logic to add the file to the lowest possible level
Test Plan: unit tests
Reviewers: jkedgar, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, yoshinorim
Differential Revision: https://reviews.facebook.net/D59637
Summary: filter_deltes is not a frequently used feature. Remove it.
Test Plan: Run all test suites.
Reviewers: igor, yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59427
Summary: char is not signed in some platforms. Having negative values confuse those compilers.
Test Plan: Run all existing tests.
Reviewers: andrewkr, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59619
Summary: allow_concurrent_memtable_write and enable_write_thread_adaptive_yield went through extensive stress test and benchmarks. Should be considered stable.
Test Plan: N/A
Reviewers: ngbronson, yhchiang, IslamAbdelRahman, dhruba
Reviewed By: dhruba
Subscribers: MarkCallaghan, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59415
Summary:
Add option to not flush memtable on open()
In case the option is enabled, don't delete existing log files by not updating log numbers to MANIFEST.
Will still flush if we need to (e.g. memtable full in the middle). In that case we also flush final memtable.
If wal_recovery_mode = kPointInTimeRecovery, do not halt immediately after encounter corruption. Instead, check if seq id of next log file is last_log_sequence + 1. In that case we continue recovery.
Test Plan: See unit test.
Reviewers: dhruba, horuff, sdong
Reviewed By: sdong
Subscribers: benj, yhchiang, andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57813
Summary:
Try to decompress compressed blocks when a special flag is set.
assert and crash in debug builds if we can't decompress the just-compressed input.
Test Plan: Run unit-tests.
Reviewers: dhruba, andrewkr, sdong, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59145
Summary:
memtable_prefix_bloom_probes is not a critical option. Remove it to reduce number of options.
It's easier for users to make mistakes with memtable_prefix_bloom_bits, turn it to memtable_prefix_bloom_bits_ratio
Test Plan: Run all existing tests
Reviewers: yhchiang, igor, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: gunnarku, yoshinorim, MarkCallaghan, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59199
Summary: AFIK, options builder is not used by anyone. Remove it.
Test Plan: Run all existing tests.
Reviewers: IslamAbdelRahman, andrewkr, igor
Reviewed By: igor
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59319
Summary:
Rocksdb backup and restore rate limiting is currently done per backup/restore.
So, it is difficult to control rate across multiple backup/restores. With this
change, a throttler can be provided. If a throttler is provided, it is used.
Otherwise, a new throttler is created based on the actual rate limits specified
in the options.
Test Plan: Added unit tests
Reviewers: ldemailly, andrewkr, sdong
Reviewed By: andrewkr
Subscribers: igor, yiwu, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56265
Summary: With `table_options.cache_index_and_filter_blocks = true`, index and filter blocks are stored in block cache. Then people are curious how much of the block cache total size is used by indexes and bloom filters. It will be nice we have a way to report that. It can help people tune performance and plan for optimized hardware setting. We add several enum values for db Statistics. BLOCK_CACHE_INDEX/FILTER_BYTES_INSERT - BLOCK_CACHE_INDEX/FILTER_BYTES_ERASE = current INDEX/FILTER total block size in bytes.
Test Plan:
write a test case called `DBBlockCacheTest.IndexAndFilterBlocksStats`. The result is:
```
[gzh@dev9927.prn1 ~/local/rocksdb] make db_block_cache_test -j64 && ./db_block_cache_test --gtest_filter=DBBlockCacheTest.IndexAndFilterBlocksStats
Makefile:101: Warning: Compiling in debug mode. Don't use the resulting binary in production
GEN util/build_version.cc
make: `db_block_cache_test' is up to date.
Note: Google Test filter = DBBlockCacheTest.IndexAndFilterBlocksStats
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBBlockCacheTest
[ RUN ] DBBlockCacheTest.IndexAndFilterBlocksStats
[ OK ] DBBlockCacheTest.IndexAndFilterBlocksStats (689 ms)
[----------] 1 test from DBBlockCacheTest (689 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (689 ms total)
[ PASSED ] 1 test.
```
Reviewers: IslamAbdelRahman, andrewkr, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D58677
Summary:
This enables configurable Envs without recompiling. For example, my
next diff will make env_test test an Env created by NewEnvFromUri(). Then,
users can determine which Env is tested simply by providing the URI for
NewEnvFromUri() (e.g., through a CLI argument or environment variable).
The registration process allows us to register any Env that is linked with the
RocksDB library, so we can register our internal Envs as well.
The registration code is inspired by our internal InitRegistry.
Test Plan: new unit test
Reviewers: IslamAbdelRahman, lightmark, ldemailly, sdong
Reviewed By: sdong
Subscribers: leveldb, dhruba, andrewkr
Differential Revision: https://reviews.facebook.net/D58449
* Create a callback for memtable becoming immutable
Create a callback for memtable becoming immutable
Create a callback for memtable becoming immutable
moved notification outside the lock
Move sealed notification to unlocked portion of SwitchMemtable
* fix lite build
Summary: The function wrapper for LDBCommand::SelectCommand is too long so that Windows build fails with warning "decorated name length exceeded, name was truncated". Shrink the length by using a struct.
Test Plan: Build on both of Linux and Windows and make sure the warning doesn't show in either platform.
Reviewers: andrewkr, adsharma, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58965
Summary:
Add under and over limits for rocksdb::PerfLevel enum
to allow us to do boundary checks before casting ints or unints
to this enum.
Test Plan: make all check -j32
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D58521
Summary: I don't think util/mutable_cf_options.h is needed inside include/rocksdb/sst_file_writer.h and it is not allowed. Remove it.
Test Plan: Run all tests
Reviewers: IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: lightmark, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58827
Summary:
This patch adds direct IO capability to RocksDB Env.
The direct IO capability is required for persistent cache since NVM is best
accessed as 4K direct IO. SSDs can leverage direct IO for reading.
Direct IO requires the offset and size be sector size aligned, and memory to
be kernel page aligned. Since neither RocksDB/Persistent read cache data
layout is aligned to sector size, the code can accommodate reading unaligned IO size
(or unaligned memory) at the cost of an alloc/copy.
The write code path expects the size and memory to be aligned.
Test Plan: Run RocksDB unit tests
Reviewers: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57393
Summary: Deprecate this one option and delete code and tests that are now superfluous.
Test Plan: all tests pass
Reviewers: igor, yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: msalib, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D55317
Summary:
A couple of notes from the diff:
- The namespace block I added at the top of table_properties_collector.cc was in reaction to an issue i was having with PutVarint64 and reusing the "val" string. I'm not sure this is the cleanest way of doing this, but abstracting this out at least results in the correct behavior.
- I chose "rocksdb.merge.operands" as the property name. I am open to suggestions for better names.
- The change to sst_dump_tool.cc seems a bit inelegant to me. Is there a better way to do the if-else block?
Test Plan:
I added a test case in table_properties_collector_test.cc. It adds two merge operands and checks to make sure that both of them are reflected by GetMergeOperands. It also checks to make sure the wasPropertyPresent bool is properly set in the method.
Running both of these tests should pass:
./table_properties_collector_test
./sst_dump_test
Reviewers: IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58119
Summary:
Expose a simple function to convert CompressionType to it's corresponding option string
This is for a diff @yoshinorim is working on for MyRocks
Test Plan: unittest
Reviewers: yhchiang, andrewkr, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, yoshinorim
Differential Revision: https://reviews.facebook.net/D58215
Summary:
Added a new abstraction to cache page to RocksDB designed for the read
cache use.
RocksDB current block cache is more of an object cache. For the persistent read cache
project, what we need is a page cache equivalent. This changes adds a cache
abstraction to RocksDB to cache pages called PersistentCache. PersistentCache can cache
uncompressed pages or raw pages (content as in filesystem). The user can
choose to operate PersistentCache either in COMPRESSED or UNCOMPRESSED mode.
Blame Rev:
Test Plan: Run unit tests
Reviewers: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D55707
Summary:
So a customized ldb tool can pass it's own Selector.
Such a selector is expected to call LDBCommand::SelectCommand
and then add some of its own customized commands
Test Plan: make ldb
Reviewers: sdong, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57249
Summary: Store SST file compression algorithm as a TableProperty.
Test Plan: Modified and ran the table_test UT that checks for TableProperties
Reviewers: IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: lgalanis, andrewkr, dhruba, IslamAbdelRahman
Differential Revision: https://reviews.facebook.net/D58017
Summary:
The implementation remains where it is. Only the
header is exported. This is so that a customized
ldb tool can print help along with its own
extra commands
Test Plan: make ldb
Reviewers: sdong, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57255
Summary:
This diff is built on top of WriteBatch modification: https://reviews.facebook.net/D54093 and adds the required functionality to rocksdb core necessary for rocksdb to support 2PC.
modfication of DBImpl::WriteImpl()
- added two arguments *uint64_t log_used = nullptr, uint64_t log_ref = 0;
- *log_used is an output argument which will return the log number which the incoming batch was inserted into, 0 if no WAL insert took place.
- log_ref is a supplied log_number which all memtables inserted into will reference after the batch insert takes place. This number will reside in 'FindMinPrepLogReferencedByMemTable()' until all Memtables insertinto have flushed.
- Recovery/writepath is now aware of prepared batches and commit and rollback markers.
Test Plan: There is currently no test on this diff. All testing of this functionality takes place in the Transaction layer/diff but I will add some testing.
Reviewers: IslamAbdelRahman, sdong
Subscribers: leveldb, santoshb, andrewkr, vasilep, dhruba, hermanlee4
Differential Revision: https://reviews.facebook.net/D56919