Summary:
Make blob_dump tool able to show uncompressed values if the blob file is compressed. Also show total compressed vs. raw size at the end if --show_summary is provided.
Closes https://github.com/facebook/rocksdb/pull/3633
Differential Revision: D7348926
Pulled By: yiwu-arbug
fbshipit-source-id: ca709cb4ed5cf6a550ff2987df8033df81516f8e
Summary:
This change models Optimistic Tx db after Pessimistic TX db. The motivation for this change is to make the ptr polymorphic so it can be held by the same raw or smart ptr.
Currently, due to the inheritance of the Opt Tx db not being rooted in the manner of Pess Tx from a single DB root it is more difficult to write clean code and have clear ownership of the database in cases when options dictate instantiate of plan DB, Pess Tx DB or Opt tx db.
Closes https://github.com/facebook/rocksdb/pull/3566
Differential Revision: D7184502
Pulled By: yiwu-arbug
fbshipit-source-id: 31d06efafd79497bb0c230e971857dba3bd962c3
Summary:
The is an optimization to reduce lookup in the CommitCache when querying IsInSnapshot. The optimization takes the smallest uncommitted data at the time that the snapshot was taken and if the sequence number of the read data is lower than that number it assumes the data as committed.
To implement this optimization two changes are required: i) The AddPrepared function must be called sequentially to avoid out of order insertion in the PrepareHeap (otherwise the top of the heap does not indicate the smallest prepare in future too), ii) non-2PC transactions also call AddPrepared if they do not commit in one step.
Closes https://github.com/facebook/rocksdb/pull/3649
Differential Revision: D7388630
Pulled By: maysamyabandeh
fbshipit-source-id: b79506238c17467d590763582960d4d90181c600
Summary:
Currently if the CommitTimeWriteBatch is set to be used only as a state that is required only for recovery , the user cannot see that in DB until it is restarted. This while the state is already inserted into the DB after the memtable flush. It would be useful for debugging if make this state visible to the user after the flush by committing it. The patch does it by a invoking a callback that does the commit on the recoverable state.
Closes https://github.com/facebook/rocksdb/pull/3661
Differential Revision: D7424577
Pulled By: maysamyabandeh
fbshipit-source-id: 137f9408662f0853938b33fa440f27f04c1bbf5c
Summary:
Current commit cache size is 2^21. This was due to a type. With 2^23 commit entries we can have transactions as long as 64s without incurring the cost of having them evicted from the commit cache before their commit. Here is the math:
2^23 / 2 (one out of two seq numbers are for commit) / 2^16 TPS = 2^6 = 64s
Closes https://github.com/facebook/rocksdb/pull/3657
Differential Revision: D7411211
Pulled By: maysamyabandeh
fbshipit-source-id: e7cacf40579f3acf940643d8a1cfe5dd201caa35
Summary:
Currently log_writer->AddRecord in WriteImpl is protected from concurrent calls via FlushWAL only if two_write_queues_ option is set. The patch fixes the problem by i) skip log_writer->AddRecord in FlushWAL if manual_wal_flush is not set, ii) protects log_writer->AddRecord in WriteImpl via log_write_mutex_ if manual_wal_flush_ is set but two_write_queues_ is not.
Fixes#3599
Closes https://github.com/facebook/rocksdb/pull/3656
Differential Revision: D7405608
Pulled By: maysamyabandeh
fbshipit-source-id: d6cc265051c77ae49c7c6df4f427350baaf46934
Summary:
Currently AddPrepared is performed only on the first sub-batch if there are duplicate keys in the write batch. This could cause a problem if the transaction takes too long to commit and the seq number of the first sub-patch moved to old_prepared_ but not the seq of the later ones. The patch fixes this by calling AddPrepared for all sub-patches.
Closes https://github.com/facebook/rocksdb/pull/3651
Differential Revision: D7388635
Pulled By: maysamyabandeh
fbshipit-source-id: 0ccd80c150d9bc42fe955e49ddb9d7ca353067b4
Summary:
This commit fixes a race condition on calling SetLastPublishedSequence. The function must be called only from the 2nd write queue when two_write_queues is enabled. However there was a bug that would also call it from the main write queue if CommitTimeWriteBatch is provided to the commit request and yet use_only_the_last_commit_time_batch_for_recovery optimization is not enabled. To fix that we penalize the commit request in such cases by doing an additional write solely to publish the seq number from the 2nd queue.
Closes https://github.com/facebook/rocksdb/pull/3641
Differential Revision: D7361508
Pulled By: maysamyabandeh
fbshipit-source-id: bf8f7a27e5cccf5425dccbce25eb0032e8e5a4d7
Summary:
The 10MB buffer in BackupEngineImpl::BackupMeta::StoreToFile can be corrupted with a large number of files. Added a check to determine current buffer length and append data to file if buffer becomes full.
Resolves https://github.com/facebook/rocksdb/issues/3228
Closes https://github.com/facebook/rocksdb/pull/3636
Differential Revision: D7354160
Pulled By: ajkr
fbshipit-source-id: eec12d38095a0d17551a4aaee52b99d30a555722
Summary:
* Fix BlobDBImpl::GCFileAndUpdateLSM doesn't close the new file, and the new file will not be able to be garbage collected later.
* Fix BlobDBImpl::GCFileAndUpdateLSM doesn't copy over metadata from old file to new file.
Closes https://github.com/facebook/rocksdb/pull/3639
Differential Revision: D7355092
Pulled By: yiwu-arbug
fbshipit-source-id: 4fa3594ac5ce376bed1af04a545c532cfc0088c4
Summary:
`Writer::WriteBuffer` was always called at the beginning of checkpoint/backup. But that log writer has no internal synchronization, which meant the same buffer could be flushed twice in a race condition case, causing a WAL entry to be duplicated. Then subsequent WAL entries would be at unexpected offsets, causing the 32KB block boundaries to be overlapped and manifesting as a corruption.
This PR fixes the behavior to only use `WriteBuffer` (via `FlushWAL`) in checkpoint/backup when manual WAL flush is enabled. In that case, users are responsible for providing synchronization between WAL flushes. We can also consider removing the call entirely.
Closes https://github.com/facebook/rocksdb/pull/3603
Differential Revision: D7277447
Pulled By: ajkr
fbshipit-source-id: 1b15bd7fd930511222b075418c10de0aaa70a35a
Summary:
This diff handles cases where compaction causes an ENOSPC error.
This does not handle corner cases where another background job is started while compaction is running, and the other background job triggers ENOSPC, although we do allow the user to provision for these background jobs with SstFileManager::SetCompactionBufferSize.
It also does not handle the case where compaction has finished and some other background job independently triggers ENOSPC.
Usage: Functionality is inside SstFileManager. In particular, users should set SstFileManager::SetMaxAllowedSpaceUsage, which is the reference highwatermark for determining whether to cancel compactions.
Closes https://github.com/facebook/rocksdb/pull/3449
Differential Revision: D7016941
Pulled By: amytai
fbshipit-source-id: 8965ab8dd8b00972e771637a41b4e6c645450445
Summary:
This patch addressed several issues.
Portability including db_test std::thread -> port::Thread Cc: @
and %z to ROCKSDB portable macro. Cc: maysamyabandeh
Implement Env::AreFilesSame
Make the implementation of file unique number more robust
Get rid of C-runtime and go directly to Windows API when dealing
with file primitives.
Implement GetSectorSize() and aling unbuffered read on the value if
available.
Adjust Windows Logger for the new interface, implement CloseImpl() Cc: anand1976
Fix test running script issue where $status var was of incorrect scope
so the failures were swallowed and not reported.
DestroyDB() creates a logger and opens a LOG file in the directory
being cleaned up. This holds a lock on the folder and the cleanup is
prevented. This fails one of the checkpoin tests. We observe the same in production.
We close the log file in this change.
Fix DBTest2.ReadAmpBitmapLiveInCacheAfterDBClose failure where the test
attempts to open a directory with NewRandomAccessFile which does not
work on Windows.
Fix DBTest.SoftLimit as it is dependent on thread timing. CC: yiwu-arbug
Closes https://github.com/facebook/rocksdb/pull/3552
Differential Revision: D7156304
Pulled By: siying
fbshipit-source-id: 43db0a757f1dfceffeb2b7988043156639173f5b
Summary:
Improving blob db FIFO eviction with the following changes,
* Change blob_dir_size to max_db_size. Take into account SST file size when computing DB size.
* FIFO now only take into account live sst files and live blob files. It is normal for disk usage to go over max_db_size because there are obsolete sst files and blob files pending deletion.
* FIFO eviction now also evict TTL blob files that's still open. It doesn't evict non-TTL blob files.
* If FIFO is triggered, it will pass an expiration and the current sequence number to compaction filter. Compaction filter will then filter inlined keys to evict those with an earlier expiration and smaller sequence number. So call LSM FIFO.
* Compaction filter also filter those blob indexes where corresponding blob file is gone.
* Add an event listener to listen compaction/flush event and update sst file size.
* Implement DB::Close() to make sure base db, as well as event listener and compaction filter, destruct before blob db.
* More blob db statistics around FIFO.
* Fix some locking issue when accessing a blob file.
Closes https://github.com/facebook/rocksdb/pull/3556
Differential Revision: D7139328
Pulled By: yiwu-arbug
fbshipit-source-id: ea5edb07b33dfceacb2682f4789bea61de28bbfa
Summary:
Move DuplicateDetector and SetComparator to its own header file in util. It would also address a complaint in the unity test.
Closes https://github.com/facebook/rocksdb/pull/3567
Differential Revision: D7163268
Pulled By: maysamyabandeh
fbshipit-source-id: 6ddf82773473646dbbc1284ae601a78c4907c778
Summary:
Fix the following bugs:
- During recovery a duplicate key was inserted twice into the write batch of the recovery transaction,
once when the memtable returns false (because it was duplicates) and once for the 2nd attempt. This would result into different SubBatch count measured when the recovered transactions is committing.
- If a cf is flushed during recovery the memtable is not available to assist in detecting the duplicate key. This could result into not advancing the sequence number when iterating over duplicate keys of a flushed cf and hence inserting the next key with the wrong sequence number.
- SubBacthCounter would reset the comparator to default comparator after the first duplicate key. The 2nd duplicate key hence would have gone through a wrong comparator and not being detected.
Closes https://github.com/facebook/rocksdb/pull/3562
Differential Revision: D7149440
Pulled By: maysamyabandeh
fbshipit-source-id: 91ec317b165f363f5d11ff8b8c47c81cebb8ed77
Summary:
Red diff to remove existing implementation of garbage collection. The current approach is reference counting kind of approach and require a lot of effort to get the size counter right on compaction and deletion. I'm going to go with a simple mark-sweep kind of approach and will send another PR for that.
CompactionEventListener was added solely for blob db and it adds complexity and overhead to compaction iterator. Removing it as well.
Closes https://github.com/facebook/rocksdb/pull/3551
Differential Revision: D7130190
Pulled By: yiwu-arbug
fbshipit-source-id: c3a375ad2639a3f6ed179df6eda602372cc5b8df
Summary:
The zeroed entries were not removed from prepared_section_completed_ map. This patch adds a unit test to show the problem and fixes that by refactoring the code. The new code is more efficient since i) it uses two separate mutex to avoid contention between commit and prepare threads, ii) it uses a sorted vector for maintaining uniq log entires with prepare which avoids a very large heap with many duplicate entries.
Closes https://github.com/facebook/rocksdb/pull/3545
Differential Revision: D7106071
Pulled By: maysamyabandeh
fbshipit-source-id: b3ae17cb6cd37ef10b6b35e0086c15c758768a48
Summary:
Make use of the index in WriteBatchWithIndex to also count the number of sub-batches. This eliminates the need to separately scan the batch to count the number of sub-batches once a duplicate key is detected.
Closes https://github.com/facebook/rocksdb/pull/3529
Differential Revision: D7049947
Pulled By: maysamyabandeh
fbshipit-source-id: 81cbf12c4e662541c772c7265a8f91631e25c7cd
Summary:
Use the rsync tempfile naming convention in our `BackupEngine`. The temp file follows the format, `.<filename>.<suffix>`, which is later renamed to `<filename>`. We fix `tmp` as the `<suffix>` as we don't need to use random bytes for now. The benefit is gluster treats this tempfile naming convention specially and applies hashing only to `<filename>`, so the file won't need to be linked or moved when it's renamed. Our gluster team suggested this will make things operationally easier.
Closes https://github.com/facebook/rocksdb/pull/3463
Differential Revision: D6893333
Pulled By: ajkr
fbshipit-source-id: fd7622978f4b2487fce33cde40dd3124f16bcaa8
Summary:
Under a certain sequence of accessing PreparedHeap, there was a bug that would not successfully empty the heap. This would result in performance issues when the heap content is moved to old_prepared_ after max_evicted_seq_ advances the orphan prepared sequence numbers. The patch fixed the bug and add more unit tests. It also does more logging when the unlikely scenarios are faced
Closes https://github.com/facebook/rocksdb/pull/3526
Differential Revision: D7038486
Pulled By: maysamyabandeh
fbshipit-source-id: f1e40bea558f67b03d2a29131fcb8734c65fce97
Summary:
In case it is found that a key is already marked as locked in a
stripe's map of locked keys, it is not necessary to look it up
again using `std::unordered_map<std::string, ...>::at(size_t)`.
Instead, we can use the already found position using the iterator
produced by the previous `find` operation. Reusing the iterator
will avoid having to hash the key again and do additional "random"
memory lookups in the map of keys (though the data will very
likely sit available in caches here already due to the previous
find operation)
Closes https://github.com/facebook/rocksdb/pull/3505
Differential Revision: D7036446
Pulled By: sagar0
fbshipit-source-id: cced51547b2bd2d49394f6bc8c5896f09fa80f68
Summary:
- made `CreateCheckpoint` properly return `InvalidArgument` when called with an empty directory. Previously it triggered an assertion failure due to a bug in the logic.
- made `ldb` set empty `checkpoint_dir` if that's what the user specifies, so that we can use it to properly test `CreateCheckpoint` in the future.
Differential Revision: D6874562
fbshipit-source-id: dcc1bd41768261d9338987fa7711444289707ed7
Summary:
Add a static cast to perform the left shift as with an unsigned type.
make ubsan_check
Closes https://github.com/facebook/rocksdb/pull/3517
Reviewed By: sagar0
Differential Revision: D7016044
Pulled By: igorsugak
fbshipit-source-id: baf72f6197edd8f7220d010b15a23d6de6a72c49
Summary:
These are optimization that we applied to improve sysbech's update_noindex performance.
1. Make use of LIKELY compiler hint
2. Move std::atomic so the subclass
3. Make use of skip_prepared in non-2pc transactions.
Closes https://github.com/facebook/rocksdb/pull/3512
Differential Revision: D7000075
Pulled By: maysamyabandeh
fbshipit-source-id: 1ab8292584df1f6305a4992973fb1b7933632181
Summary:
- removed a few unneeded variables
- fused some variable declarations and their assignments
- fixed right-trimming code in string_util.cc to not underflow
- simplifed an assertion
- move non-nullptr check assertion before dereferencing of that pointer
- pass an std::string function parameter by const reference instead of by value (avoiding potential copy)
Closes https://github.com/facebook/rocksdb/pull/3507
Differential Revision: D7004679
Pulled By: sagar0
fbshipit-source-id: 52944952d9b56dfcac3bea3cd7878e315bb563c4
Summary:
Somehow the indentation was incorrect in this file.
The only change in this PR is to get it right again in order to make the code more readable.
Please reject if you think it's not worth it.
Closes https://github.com/facebook/rocksdb/pull/3504
Differential Revision: D6996011
Pulled By: miasantreble
fbshipit-source-id: 060514a3a8c910d34bad795b36eb4d278512b154
Summary:
TransactionDB::Write can receive some optimization hints from the user. One is to skip the concurrency control mechanism. WritePreparedTxnDB is currently ignoring such hints. This patch optimizes WritePreparedTxnDB::Write for skip_concurrency_control and skip_duplicate_key_check hints.
Closes https://github.com/facebook/rocksdb/pull/3496
Differential Revision: D6971784
Pulled By: maysamyabandeh
fbshipit-source-id: cbab10ad538fa2b8bcb47e37c77724afe6e30f03
Summary:
WritePreparedTransactionTest has the UBSAN error because the wrong order of its parent class construction. Fix it.
Closes https://github.com/facebook/rocksdb/pull/3478
Differential Revision: D6928975
Pulled By: siying
fbshipit-source-id: 13edfd5cb9cf73f1ac5ae3b6f53061d32783733d
Summary:
Compared to DB::Write, TransactionDB::Write has the additional overhead of creating and initializing an internal transaction object, as well as the overhead of locking/unlocking the keys. This patch extends the TransactionDB::Write with an skip_cc option to allow the users to indicate that the write batch do not conflict with others and the concurrency control and its overhead can be skipped. TransactionDB::Write by default calls DB::Write when skip_cc is set, which works for WriteCommitted WritePolicy. Any other flavor of TransactionDB that is not compatible with this default behavior (such as WritePreparedTxnDB) can extend ::Write and implement their own approach for taking into account the skip_cc optimization.
Closes https://github.com/facebook/rocksdb/pull/3457
Differential Revision: D6877318
Pulled By: maysamyabandeh
fbshipit-source-id: 56f4e21db87ff71492db4e376fb7c2b03dfeab6b
Summary:
Deletes the transaction object at the end of the test.
Verified by:
- COMPILE_WITH_ASAN=1 make -j32 transaction_test
- ./transaction_test --gtest_filter="DBA**Duplicate*"
Closes https://github.com/facebook/rocksdb/pull/3470
Differential Revision: D6916473
Pulled By: maysamyabandeh
fbshipit-source-id: 8303df25408635d5d3ac2b25f309a3d15957c937
Summary:
This patch takes advantage of memtable being able to detect duplicate <key,seq> and returning TryAgain to handle duplicate keys in WritePrepared Txns. Through WriteBatchWithIndex's index it detects existence of at least a duplicate key in the write batch. If duplicate key was reported, it then pays the cost of counting the number of sub-patches by iterating over the write batch and pass it to DBImpl::Write. DB will make use of the provided batch_count to assign proper sequence numbers before sending them to the WAL. When later inserting the batch to the memtable, it increases the seq each time memtbale reports a duplicate (a sub-patch in our counting) and tries again.
Closes https://github.com/facebook/rocksdb/pull/3455
Differential Revision: D6873699
Pulled By: maysamyabandeh
fbshipit-source-id: db8487526c3a5dc1ddda0ea49f0f979b26ae648d
Summary:
Without this patch, ubsan_check is currently failing with this error:
```
utilities/transactions/write_prepared_transaction_test.cc:369:63: runtime error: member call on address 0x0000051649f8 which does not point to an object of type 'WithParamInterface'
0x0000051649f8: note: object has invalid vptr
```
Tested by `COMPILE_WITH_UBSAN=1 make -j32 transaction_test` and running `./write_prepared_transaction_test --gtest_filter=TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccessTest1/0`
Closes https://github.com/facebook/rocksdb/pull/3444
Differential Revision: D6850087
Pulled By: maysamyabandeh
fbshipit-source-id: 5b254da8504b8757f7aec8a820ad464154da1a1d
Summary:
previously if `checkpoint_dir` contained a trailing slash, we'd attempt to create the `.tmp` directory under `checkpoint_dir` due to simply concatenating `checkpoint_dir + ".tmp"`. This failed because `checkpoint_dir` hadn't been created yet and our directory creation is non-recursive. This PR fixes the issue by always creating the `.tmp` directory in the same parent as `checkpoint_dir` by stripping trailing slashes before concatenating.
Closes https://github.com/facebook/rocksdb/pull/3275
Differential Revision: D6574952
Pulled By: ajkr
fbshipit-source-id: a6daa6777a901eac2460cd0140c9515f7241aefc
Summary:
SnapshotConcurrentAccessTest sometimes times out when running on the test infra. This patch splits the test into smaller sub-tests to avoid the timeout. It also benefits from lower run-time of each sub-test and increases the coverage of the test. The overall run-time of each final sub-test is at most half of the original test so we should no longer see a timeout.
Closes https://github.com/facebook/rocksdb/pull/3435
Differential Revision: D6839427
Pulled By: maysamyabandeh
fbshipit-source-id: d53fdb157109e2438ca7fe447d0cf4b71f304bd8
Summary: Grandfather in super old lint issues to make a clean slate for moving forward that allows us to have stronger enforcement on new issues.
Reviewed By: yiwu-arbug
Differential Revision: D6821806
fbshipit-source-id: 22797d31ec58e9eb0255d3b66fedfcfcb0dc127c
Summary:
When blob_files is empty, std::min_element will return blobfiles.end(), which cannot be dereference. Fixing it.
Closes https://github.com/facebook/rocksdb/pull/3387
Differential Revision: D6764927
Pulled By: yiwu-arbug
fbshipit-source-id: 86f78700132be95760d35ac63480dfd3a8bbe17a
Summary:
We have seen cases where it could be good to change TTL on already open DB.
Change ttl in TtlCompactionFilterFactory on open db.
Next time a filter is created, it will filter accroding to the set TTL.
Is this something that could be useful for others?
Any downsides?
Closes https://github.com/facebook/rocksdb/pull/3292
Differential Revision: D6731993
Pulled By: miasantreble
fbshipit-source-id: 73b94d69237b11e8730734389052429d621a6b1e
Summary:
- Change directory name from "db_test" to "checkpoint_test". Previously it used the same directory as `db_test`
- Systematically cleanup snapshot and snapshot staging directories before each test. Previously a failed test run caused subsequent runs to fail, particularly when the first failure caused "snapshot.tmp" to not be cleaned up.
Closes https://github.com/facebook/rocksdb/pull/3351
Differential Revision: D6691015
Pulled By: ajkr
fbshipit-source-id: 4fc2ac2e21ff2617ea0e96297c5132b5f2eefd79
Summary:
This patch addresses a couple of minor TODOs for WritePrepared Txn such as double checking some assert statements at runtime as well, skip extra AddPrepared in non-2pc transactions, and safety check for infinite loops.
Closes https://github.com/facebook/rocksdb/pull/3302
Differential Revision: D6617002
Pulled By: maysamyabandeh
fbshipit-source-id: ef6673c139cb49f64c0879508d2f573b78609aca
Summary:
Previously on a blob db read, we are making a read of the blob value, and then make another read to get CRC checksum. I'm combining the two read into one.
readrandom db_bench with 1G database with base db size of 13M, value size 1k:
`./db_bench --db=/home/yiwu/tmp/db_bench --use_blob_db --value_size=1024 --num=1000000 --benchmarks=readrandom --use_existing_db --cache_size=32000000`
master: throughput 234MB/s, get micros p50 5.984 p95 9.998 p99 20.817 p100 787
this PR: throughput 261MB/s, get micros p50 5.157 p95 9.928 p99 20.724 p100 190
Closes https://github.com/facebook/rocksdb/pull/3301
Differential Revision: D6615950
Pulled By: yiwu-arbug
fbshipit-source-id: 052410c6d8539ec0cc305d53793bbc8f3616baa3
Summary:
DestroyDB that is used in tests loops over the files returned by ::GetChildren and delete them one by one. Such files might be already deleted in the file system (during DeleteObsoleteFileImpl for example) but will get actually deleted with a delay sometimes before ::DeleteFile is called on the file name. We have some test failures where FaultInjectionTestEnv::DeleteFile fails on assert(s.ok()) during DestroyDB. This patch removes the assert statement to fix that.
Closes https://github.com/facebook/rocksdb/pull/3324
Differential Revision: D6659545
Pulled By: maysamyabandeh
fbshipit-source-id: 4c9552fbcd494dcf3e61d475c11fc965c4388b2c
Summary:
We dump blob db options on blob db open, but it was removed by mistake in #3246. Adding it back.
Closes https://github.com/facebook/rocksdb/pull/3298
Differential Revision: D6607177
Pulled By: yiwu-arbug
fbshipit-source-id: 2a4aacbfa52fd8f1878dc9e1fbb95fe48faf80c0
Summary:
Previously, if blob_db_options.bytes_per_sync, there is a background job to call fsync() for every bytes_per_sync bytes written to a blob file. With the change we simply pass bytes_per_sync as env_options_ to blob files so that sync_file_range() will be used instead.
Closes https://github.com/facebook/rocksdb/pull/3297
Differential Revision: D6606994
Pulled By: yiwu-arbug
fbshipit-source-id: 452424be52e32ba92f5ea603b564e9b88929af47
Summary:
A proper implementation of Iterator::Refresh() for WritePreparedTxnDB would require release and acquire another snapshot. Since MyRocks don't make use of Iterator::Refresh(), we just simply mark it as not supported.
Closes https://github.com/facebook/rocksdb/pull/3290
Differential Revision: D6599931
Pulled By: yiwu-arbug
fbshipit-source-id: 4e1632d967316431424f6e458254ecf9a97567cf
Summary:
* Include `unistd.h` for `sleep(3)`
* Include `sys/time.h` for `gettimeofday(3)`
* Include `utils/random.h` for `Random64`
Error messages:
utilities/persistent_cache/hash_table_bench.cc: In constructor ‘rocksdb::HashTableBenchmark::HashTableBenchmark(rocksdb::HashTableImpl<long unsigned int, std::__cxx11::basic_string<char> >*, size_t, size_t, size_t, size_t)’:
utilities/persistent_cache/hash_table_bench.cc:76:28: error: ‘sleep’ was not declared in this scope
/* sleep override */ sleep(1);
^~~~~
utilities/persistent_cache/hash_table_bench.cc:76:28: note: suggested alternative: ‘strsep’
/* sleep override */ sleep(1);
^~~~~
strsep
utilities/persistent_cache/hash_table_bench.cc: In member function ‘void rocksdb::HashTableBenchmark::RunRead()’:
utilities/persistent_cache/hash_table_bench.cc:107:5: error: ‘Random64’ was not declared in this scope
Random64 rgen(time(nullptr));
^~~~~~~~
utilities/persistent_cache/hash_table_bench.cc:107:5: note: suggested alternative: ‘random_r’
Random64 rgen(time(nullptr));
^~~~~~~~
random_r
utilities/persistent_cache/hash_table_bench.cc:110:18: error: ‘rgen’ was not declared in this scope
size_t k = rgen.Next() % max_prepop_key;
^~~~
utilities/persistent_cache/hash_table_bench.cc: In static member function ‘static uint64_t rocksdb::HashTableBenchmark::NowInMillSec()’:
utilities/persistent_cache/hash_table_bench.cc:153:5: error: ‘gettimeofday’ was not declared in this scope
gettimeofday(&tv, /*tz=*/nullptr);
^~~~~~~~~~~~
make[2]: *** [CMakeFiles/hash_table_bench.dir/build.make:63: CMakeFiles/hash_table_bench.dir/utilities/persistent_cache/hash_table_bench.cc.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:3346: CMakeFiles/hash_table_bench.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
Closes https://github.com/facebook/rocksdb/pull/3283
Differential Revision: D6594850
Pulled By: ajkr
fbshipit-source-id: fd83957338c210cdfd253763347aafd39476824f
Summary:
Currently non-2pc writes do the 2nd dummy write to actually commit the transaction. This was necessary to ensure that publishing the commit sequence number will be done only from one queue (the queue that does not write to memtable). This is however not necessary when we have only one write queue, which is actually the setup that would be used by non-2pc writes. This patch eliminates the 2nd write when two_write_queues are disabled by updating the commit map in the 1st write.
Closes https://github.com/facebook/rocksdb/pull/3277
Differential Revision: D6575392
Pulled By: maysamyabandeh
fbshipit-source-id: 8ab458f7ca506905962f9166026b2ec81e749c46
Summary:
Previously we store sequence number range of each blob files, and use the sequence number range to check if the file can be possibly visible by a snapshot. But it adds complexity to the code, since the sequence number is only available after a write. (The current implementation get sequence number by calling GetLatestSequenceNumber(), which is wrong.) With the patch, we are not storing sequence number range, and check if snapshot_sequence < obsolete_sequence to decide if the file is visible by a snapshot (previously we check if first_sequence <= snapshot_sequence < obsolete_sequence).
Closes https://github.com/facebook/rocksdb/pull/3274
Differential Revision: D6571497
Pulled By: yiwu-arbug
fbshipit-source-id: ca06479dc1fcd8782f6525b62b7762cd47d61909
Summary:
- check most times after calling snprintf that the buffer didn't fill up. Previously we'd proceed and use `buf_size - len` as the length in subsequent calls, which underflowed as those are unsigned size_t.
- replace some memcpys with snprintf for consistency
Closes https://github.com/facebook/rocksdb/pull/3255
Differential Revision: D6541464
Pulled By: ajkr
fbshipit-source-id: 8610ea6a24f38e0a37c6d17bc65b7c712da6d932
Summary:
There were a few places where MSVC's implicit truncation warnings were getting triggered, which was causing the MSVC build to fail due to warnings being treated as errors. This resolves the issues by making the truncations in some places explicit, and by making it so there are no truncations of literals.
Fixes#3239
Supersedes #3259
Closes https://github.com/facebook/rocksdb/pull/3273
Reviewed By: yiwu-arbug
Differential Revision: D6569204
Pulled By: Orvid
fbshipit-source-id: c188cf1cf98d9acb6d94b71875041cc81f8ff088
Summary:
Refactor BlobDB open logic. List of changes:
Major:
* On reopen, mark blob files found as immutable, do not use them for writing new keys.
* Not to scan the whole file to find file footer. Instead just seek to the end of the file and try to read footer.
Minor:
* Move most of the real logic from blob_db.cc to blob_db_impl.cc.
* Not to hold shared_ptr of event listeners in global maps in blob_db.cc
* Some changes to BlobFile interface.
* Improve logging and error handling.
Closes https://github.com/facebook/rocksdb/pull/3246
Differential Revision: D6526147
Pulled By: yiwu-arbug
fbshipit-source-id: 9dc4cdd63359a2f9b696af817086949da8d06952
Summary:
I started adding gflags support for cmake on linux and got frustrated that I'd need to duplicate the build_detect_platform logic, which determines namespace based on attempting compilation. We can do it differently -- use the GFLAGS_NAMESPACE macro if available, and if not, that indicates it's an old gflags version without configurable namespace so we can simply hardcode "google".
Closes https://github.com/facebook/rocksdb/pull/3212
Differential Revision: D6456973
Pulled By: ajkr
fbshipit-source-id: 3e6d5bde3ca00d4496a120a7caf4687399f5d656
Summary:
Add PreReleaseCallback to be called at the end of WriteImpl but before publishing the sequence number. The callback is used in WritePrepareTxn to i) update the commit map, ii) update the last published sequence number in the 2nd write queue. It also ensures that all the commits will go to the 2nd queue.
These changes will ensure that the commit map is updated before the sequence number is published and used by reading snapshots. If we use two write queues, the snapshots will use the seq number published by the 2nd queue. If we use one write queue (the default, the snapshots will use the last seq number in the memtable, which also indicates the last published seq number.
Closes https://github.com/facebook/rocksdb/pull/3205
Differential Revision: D6438959
Pulled By: maysamyabandeh
fbshipit-source-id: f8b6c434e94bc5f5ab9cb696879d4c23e2577ab9
Summary:
1. Class BackupMeta
```
52 : timestamp_(0), size_(0), meta_filename_(meta_filename),
CID 1168103 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
2. uninit_member: Non-static class member sequence_number_ is not initialized in this constructor nor in any functions that it calls.
153 file_infos_(file_infos), env_(env) {}
```
2. class BackupEngineImpl
```
513 }
7. uninit_member: Non-static class member latest_backup_id_ is not initialized in this constructor nor in any functions that it calls.
CID 1322803 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
9. uninit_member: Non-static class member latest_valid_backup_id_ is not initialized in this constructor nor in any functions that it calls.
514}
```
3. struct BackupAfterCopyOrCreateWorkItem
```
368 struct BackupAfterCopyOrCreateWorkItem {
369 std::future<CopyOrCreateResult> result;
1. member_decl: Class member declaration for shared.
370 bool shared;
3. member_decl: Class member declaration for needed_to_copy.
371 bool needed_to_copy;
5. member_decl: Class member declaration for backup_env.
372 Env* backup_env;
373 std::string dst_path_tmp;
374 std::string dst_path;
375 std::string dst_relative;
2. uninit_member: Non-static class member shared is not initialized in this constructor nor in any functions that it calls.
4. uninit_member: Non-static class member needed_to_copy is not initialized in this constructor nor in any functions that it calls.
CID 1396122 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
6. uninit_member: Non-static class member backup_env is not initialized in this constructor nor in any functions that it calls.
376 BackupAfterCopyOrCreateWorkItem() {}
```
4. struct CopyOrCreateWorkItem
```
318 struct CopyOrCreateWorkItem {
319 std::string src_path;
320 std::string dst_path;
321 std::string contents;
1. member_decl: Class member declaration for src_env.
322 Env* src_env;
3. member_decl: Class member declaration for dst_env.
323 Env* dst_env;
5. member_decl: Class member declaration for sync.
324 bool sync;
7. member_decl: Class member declaration for rate_limiter.
325 RateLimiter* rate_limiter;
9. member_decl: Class member declaration for size_limit.
326 uint64_t size_limit;
327 std::promise<CopyOrCreateResult> result;
328 std::function<void()> progress_callback;
329
2. uninit_member: Non-static class member src_env is not initialized in this constructor nor in any functions that it calls.
4. uninit_member: Non-static class member dst_env is not initialized in this constructor nor in any functions that it calls.
6. uninit_member: Non-static class member sync is not initialized in this constructor nor in any functions that it calls.
8. uninit_member: Non-static class member rate_limiter is not initialized in this constructor nor in any functions that it calls.
CID 1396123 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
10. uninit_member: Non-static class member size_limit is not initialized in this constructor nor in any functions that it calls.
330 CopyOrCreateWorkItem() {}
```
5. struct RestoreAfterCopyOrCreateWorkItem
```
struct RestoreAfterCopyOrCreateWorkItem {
410 std::future<CopyOrCreateResult> result;
1. member_decl: Class member declaration for checksum_value.
411 uint32_t checksum_value;
CID 1396153 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
2. uninit_member: Non-static class member checksum_value is not initialized in this constructor nor in any functions that it calls.
412 RestoreAfterCopyOrCreateWorkItem() {}
```
Closes https://github.com/facebook/rocksdb/pull/3131
Differential Revision: D6428556
Pulled By: sagar0
fbshipit-source-id: a86675444543eff028e3cae6942197a143a112c4
Summary:
```
utilities/persistent_cache/block_cache_tier_file.cc
64struct CacheRecordHeader {
2. uninit_member: Non-static class member magic_ is not initialized in this constructor nor in any functions that it calls.
4. uninit_member: Non-static class member crc_ is not initialized in this constructor nor in any functions that it calls.
6. uninit_member: Non-static class member key_size_ is not initialized in this constructor nor in any functions that it calls.
CID 1396161 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
8. uninit_member: Non-static class member val_size_ is not initialized in this constructor nor in any functions that it calls.
65 CacheRecordHeader() {}
66 CacheRecordHeader(const uint32_t magic, const uint32_t key_size,
67 const uint32_t val_size)
68 : magic_(magic), crc_(0), key_size_(key_size), val_size_(val_size) {}
69
1. member_decl: Class member declaration for magic_.
70 uint32_t magic_;
3. member_decl: Class member declaration for crc_.
71 uint32_t crc_;
5. member_decl: Class member declaration for key_size_.
72 uint32_t key_size_;
7. member_decl: Class member declaration for val_size_.
73 uint32_t val_size_;
74};
utilities/simulator_cache/sim_cache.cc:
157 miss_times_(0),
CID 1396124 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
2. uninit_member: Non-static class member stats_ is not initialized in this constructor nor in any functions that it calls.
158 hit_times_(0) {}
159
```
Closes https://github.com/facebook/rocksdb/pull/3155
Differential Revision: D6427237
Pulled By: sagar0
fbshipit-source-id: 97e493da5fc043c5b9a3e0d33103442cffb75aad
Summary:
utilities/blob_db/blob_db_impl.cc
265 : bdb_options_.blob_dir;
3. uninit_member: Non-static class member env_ is not initialized in this constructor nor in any functions that it calls.
5. uninit_member: Non-static class member ttl_extractor_ is not initialized in this constructor nor in any functions that it calls.
7. uninit_member: Non-static class member open_p1_done_ is not initialized in this constructor nor in any functions that it calls.
CID 1418245 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
9. uninit_member: Non-static class member debug_level_ is not initialized in this constructor nor in any functions that it calls.
266}
4. past_the_end: Function end creates an iterator.
CID 1418258 (#1 of 1): Using invalid iterator (INVALIDATE_ITERATOR)
5. deref_iterator: Dereferencing iterator file_nums.end() though it is already past the end of its container.
utilities/col_buf_decoder.h:
nullable_(nullable),
2. uninit_member: Non-static class member remain_runs_ is not initialized in this constructor nor in any functions that it calls.
4. uninit_member: Non-static class member run_val_ is not initialized in this constructor nor in any functions that it calls.
CID 1396134 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
6. uninit_member: Non-static class member last_val_ is not initialized in this constructor nor in any functions that it calls.
46 big_endian_(big_endian) {}
Closes https://github.com/facebook/rocksdb/pull/3134
Differential Revision: D6340607
Pulled By: sagar0
fbshipit-source-id: 25c52566e2ff979fe6c7abb0f40c27fc16597054
Summary:
Adding a list of blob db counters.
Also remove WaStats() which doesn't expose the stats and can be substitute by (BLOB_DB_BYTES_WRITTEN / BLOB_DB_BLOB_FILE_BYTES_WRITTEN).
Closes https://github.com/facebook/rocksdb/pull/3193
Differential Revision: D6394216
Pulled By: yiwu-arbug
fbshipit-source-id: 017508c8ff3fcd7ea7403c64d0f9834b24816803
Summary:
This patch implements MultiGet API for WritePreparedTxnDB and update the existing unit tests.
Closes https://github.com/facebook/rocksdb/pull/3196
Differential Revision: D6401493
Pulled By: maysamyabandeh
fbshipit-source-id: 51501a1e32645fc2da8680e77a50035f6530f2cc
Summary:
Garbage collection checks if the offset in blob index matches the offset of the blob value in the file. If it is a mismatch, the value is the current version. However it failed to check if the blob index is an inlined type, which don't even have an offset. Fixing it.
Closes https://github.com/facebook/rocksdb/pull/3194
Differential Revision: D6394270
Pulled By: yiwu-arbug
fbshipit-source-id: 7c2b9d795f1116f55f4d728086980f9b6e88ea78
Summary:
Saw some redundant log lines when trying to benchmark blob db. So, removed the lines from blob_file.cc, and let the lines in blob_db_impl.cc take the lead.
Closes https://github.com/facebook/rocksdb/pull/3189
Differential Revision: D6381726
Pulled By: sagar0
fbshipit-source-id: 5f0b1e56fe4bc3b715d89ea9b5749bd935cd0606
Summary:
The sequence number was not properly advanced after a rollback marker. The patch extends the existing unit tests to detect the bug and also fixes it.
Closes https://github.com/facebook/rocksdb/pull/3157
Differential Revision: D6304291
Pulled By: maysamyabandeh
fbshipit-source-id: 1b519c44a5371b802da49c9e32bd00087a8da401
Summary:
The current implementation of PinnableSlice move assignment have an issue #3163. We are moving away from it instead of try to get the move assignment right, since it is too tricky.
Closes https://github.com/facebook/rocksdb/pull/3164
Differential Revision: D6319201
Pulled By: yiwu-arbug
fbshipit-source-id: 8f3279021f3710da4a4caa14fd238ed2df902c48
Summary:
This patch clarifies and refactors the logic around tracked keys in transactions.
Closes https://github.com/facebook/rocksdb/pull/3140
Differential Revision: D6290258
Pulled By: maysamyabandeh
fbshipit-source-id: 03b50646264cbcc550813c060b180fc7451a55c1
Summary:
Add tests to ensure that WritePrepared and WriteCommitted policies are cross compatible when the db WAL is empty. This is important when the admin want to switch between the policies. In such case, before the switch the admin needs to empty the WAL by i) committing/rollbacking all the pending transactions, ii) FlushMemTables
Closes https://github.com/facebook/rocksdb/pull/3118
Differential Revision: D6227247
Pulled By: maysamyabandeh
fbshipit-source-id: bcde3d92c1e89cda3b9cfa69f6a20af5d8993db7
Summary:
Add per-exe execution capability
Add fix parsing of groups/tests
Add timer test exclusion
Fix unit tests
Ifdef threadpool specific tests that do not pass on Vista threadpool.
Remove spurious outout from prefix_test so test case listing works
properly.
Fix not using standard test directories results in file creation errors
in sst_dump_test.
BlobDb fixes:
In C++ end() iterators can not be dereferenced. They are not valid.
When deleting blob_db_ set it to nullptr before any other code executes.
Not fixed:. On Windows you can not delete a file while it is open.
[ RUN ] BlobDBTest.ReadWhileGC
d:\dev\rocksdb\rocksdb\utilities\blob_db\blob_db_test.cc(75): error: DestroyBlobDB(dbname_, options, bdb_options)
IO error: Failed to delete: d:/mnt/db\testrocksdb-17444/blob_db_test/blob_dir/000001.blob: Permission denied
d:\dev\rocksdb\rocksdb\utilities\blob_db\blob_db_test.cc(75): error: DestroyBlobDB(dbname_, options, bdb_options)
IO error: Failed to delete: d:/mnt/db\testrocksdb-17444/blob_db_test/blob_dir/000001.blob: Permission denied
write_batch
Should not call front() if there is a chance the container is empty
Closes https://github.com/facebook/rocksdb/pull/3152
Differential Revision: D6293274
Pulled By: sagar0
fbshipit-source-id: 318c3717c22087fae13b18715dffb24565dbd956
Summary:
A race condition will happen when:
* a user thread writes a value, but it hits the write stop condition because there are too many un-flushed memtables, while holding blob_db_impl.write_mutex_.
* Flush is triggered and call flush begin listener and try to acquire blob_db_impl.write_mutex_.
Fixing it.
Closes https://github.com/facebook/rocksdb/pull/3149
Differential Revision: D6279805
Pulled By: yiwu-arbug
fbshipit-source-id: 0e3c58afb78795ebe3360a2c69e05651e3908c40
Summary:
`compression` shadow the method name in `BlobFile`. Rename it.
Closes https://github.com/facebook/rocksdb/pull/3148
Differential Revision: D6274498
Pulled By: yiwu-arbug
fbshipit-source-id: 7d293596530998b23b6b8a8940f983f9b6343a98
Summary:
To fix the issue of failing to decompress existing value after reopen DB with a different compression settings.
Closes https://github.com/facebook/rocksdb/pull/3142
Differential Revision: D6267260
Pulled By: yiwu-arbug
fbshipit-source-id: c7cf7f3e33b0cd25520abf4771cdf9180cc02a5f
Summary:
Adds two new counters:
`key_lock_wait_count` counts how many times a lock was blocked by another transaction and had to wait, instead of being granted the lock immediately.
`key_lock_wait_time` counts the time spent acquiring locks.
Closes https://github.com/facebook/rocksdb/pull/3107
Differential Revision: D6217332
Pulled By: lth
fbshipit-source-id: 55d4f46da5550c333e523263422fd61d6a46deb9
Summary:
After move assignment, we need to re-initialized the moved PinnableSlice.
Also update blob_db_impl.cc to not reuse the moved PinnableSlice since it is supposed to be in an undefined state after move.
Closes https://github.com/facebook/rocksdb/pull/3127
Differential Revision: D6238585
Pulled By: yiwu-arbug
fbshipit-source-id: bd99f2e37406c4f7de160c7dee6a2e8126bc224e
Summary:
Fix unreleased snapshot at the end of the test.
Closes https://github.com/facebook/rocksdb/pull/3126
Differential Revision: D6232867
Pulled By: yiwu-arbug
fbshipit-source-id: 651ca3144fc573ea2ab0ab20f0a752fb4a101d26
Summary:
After adding expiration to blob index in #3066, we are now able to add a compaction filter to cleanup expired blob index entries.
Closes https://github.com/facebook/rocksdb/pull/3090
Differential Revision: D6183812
Pulled By: yiwu-arbug
fbshipit-source-id: 9cb03267a9702975290e758c9c176a2c03530b83
Summary:
Blob db will keep blob file if data in the file is visible to an active snapshot. Before this patch it checks whether there is an active snapshot has sequence number greater than the earliest sequence in the file. This is problematic since we take snapshot on every read, if it keep having reads, old blob files will not be cleanup. Change to check if there is an active snapshot falls in the range of [earliest_sequence, obsolete_sequence) where obsolete sequence is
1. if data is relocated to another file by garbage collection, it is the latest sequence at the time garbage collection finish
2. otherwise, it is the latest sequence of the file
Closes https://github.com/facebook/rocksdb/pull/3087
Differential Revision: D6182519
Pulled By: yiwu-arbug
fbshipit-source-id: cdf4c35281f782eb2a9ad6a87b6727bbdff27a45
Summary:
Add an option to enable/disable auto garbage collection, where we keep counting how many keys have been evicted by either deletion or compaction and decide whether to garbage collect a blob file.
Default disable auto garbage collection for now since the whole logic is not fully tested and we plan to make major change to it.
Closes https://github.com/facebook/rocksdb/pull/3117
Differential Revision: D6224756
Pulled By: yiwu-arbug
fbshipit-source-id: cdf53bdccec96a4580a2b3a342110ad9e8864dfe
Summary:
The test intent to wait until key being overwritten until proceed with garbage collection. It failed to wait for `PutUntil` finally finish. Fixing it.
Closes https://github.com/facebook/rocksdb/pull/3116
Differential Revision: D6222833
Pulled By: yiwu-arbug
fbshipit-source-id: fa9b57a772b92a66cf250b44e7975c43f62f45c5
Summary:
Evict oldest blob file and put it in obsolete_files list when close to blob db size limit. The file will be delete when the `DeleteObsoleteFiles` background job runs next time.
For now I set `kEvictOldestFileAtSize` constant, which controls when to evict the oldest file, at 90%. It could be tweaked or made into an option if really needed; I didn't want to expose it as an option pre-maturely as there are already too many :) .
Closes https://github.com/facebook/rocksdb/pull/3094
Differential Revision: D6187340
Pulled By: sagar0
fbshipit-source-id: 687f8262101b9301bf964b94025a2fe9d8573421
Summary:
Move WritePreparedTxnDB from pessimistic_transaction_db.h to its own header, write_prepared_txn_db.h
Closes https://github.com/facebook/rocksdb/pull/3114
Differential Revision: D6220987
Pulled By: maysamyabandeh
fbshipit-source-id: 18893fb4fdc6b809fe117dabb544080f9b4a301b
Summary:
Implements ValidateSnapshot for WritePrepared txns and also adds a unit test to clarify the contract of this function.
Closes https://github.com/facebook/rocksdb/pull/3101
Differential Revision: D6199405
Pulled By: maysamyabandeh
fbshipit-source-id: ace509934c307ea5d26f4bbac5f836d7c80fd240
Summary:
GetCommitTimeWriteBatch is currently used to store some state as part of commit in 2PC. In MyRocks it is specifically used to store some data that would be needed only during recovery. So it is not need to be stored in memtable right after each commit.
This patch enables an optimization to write the GetCommitTimeWriteBatch only to the WAL. The batch will be written to memtable during recovery when the WAL is replayed. To cover the case when WAL is deleted after memtable flush, the batch is also buffered and written to memtable right before each memtable flush.
Closes https://github.com/facebook/rocksdb/pull/3071
Differential Revision: D6148023
Pulled By: maysamyabandeh
fbshipit-source-id: 2d09bae5565abe2017c0327421010d5c0d55eaa7
Summary:
The collapse of duplicate keys in write batch needs to sort the indexes of duplicate keys since it only checks the index in the batch with the head of the list of duplicate keys.
Closes https://github.com/facebook/rocksdb/pull/3093
Differential Revision: D6186800
Pulled By: maysamyabandeh
fbshipit-source-id: abc9ae8c2f1840445a5584f925cf86ecc6f37154
Summary:
* cleanup num_concurrent_simple_blobs. We don't do concurrent writes (by taking write_mutex_) so it doesn't make sense to have multiple non TTL files open. We can revisit later when we want to improve writes.
* cleanup eviction callback. we don't have plan to use it now.
* rename s/open_simple_blob_files_/open_non_ttl_file_/ and s/open_blob_files_/open_ttl_files_/ to avoid confusion.
Closes https://github.com/facebook/rocksdb/pull/3088
Differential Revision: D6182598
Pulled By: yiwu-arbug
fbshipit-source-id: 99e6f5e01fa66d31309cdb06ce48502464bac6ad
Summary:
Changing blob file format and some code cleanup around the change. The change with blob log format are:
* Remove timestamp field in blob file header, blob file footer and blob records. The field is not being use and often confuse with expiration field.
* Blob file header now come with column family id, which always equal to default column family id. It leaves room for future support of column family.
* Compression field in blob file header now is a standalone byte (instead of compact encode with flags field)
* Blob file footer now come with its own crc.
* Key length now being uint64_t instead of uint32_t
* Blob CRC now checksum both key and value (instead of value only).
* Some reordering of the fields.
The list of cleanups:
* Better inline comments in blob_log_format.h
* rename ttlrange_t and snrange_t to ExpirationRange and SequenceRange respectively.
* simplify blob_db::Reader
* Move crc checking logic to inside blob_log_format.cc
Closes https://github.com/facebook/rocksdb/pull/3081
Differential Revision: D6171304
Pulled By: yiwu-arbug
fbshipit-source-id: e4373e0d39264441b7e2fbd0caba93ddd99ea2af
Summary:
Adding the `min_blob_size` option to allow storing small values in base db (in LSM tree) together with the key. The goal is to improve performance for small values, while taking advantage of blob db's low write amplification for large values.
Also adding expiration timestamp to blob index. It will be useful to evict stale blob indexes in base db by adding a compaction filter. I'll work on the compaction filter in future patches.
See blob_index.h for the new blob index format. There are 4 cases when writing a new key:
* small value w/o TTL: put in base db as normal value (i.e. ValueType::kTypeValue)
* small value w/ TTL: put (type, expiration, value) to base db.
* large value w/o TTL: write value to blob log and put (type, file, offset, size, compression) to base db.
* large value w/TTL: write value to blob log and put (type, expiration, file, offset, size, compression) to base db.
Closes https://github.com/facebook/rocksdb/pull/3066
Differential Revision: D6142115
Pulled By: yiwu-arbug
fbshipit-source-id: 9526e76e19f0839310a3f5f2a43772a4ad182cd0