Summary:
This patch splits Commit and Prepare into lock-related logic and db-write-related logic. It moves lock-related logic to PessimisticTransaction to be reused by all children classes and movies the existing impl of db-write-related to PrepareInternal, CommitSingleInternal, and CommitInternal in WriteCommittedTxnImpl.
Closes https://github.com/facebook/rocksdb/pull/2691
Differential Revision: D5569464
Pulled By: maysamyabandeh
fbshipit-source-id: d1b8698e69801a4126c7bc211745d05c636f5325
Summary:
This opens space for the new implementations of TransactionDBImpl such as WritePreparedTxnDBImpl that has a different policy of how to write to DB.
Closes https://github.com/facebook/rocksdb/pull/2689
Differential Revision: D5568918
Pulled By: maysamyabandeh
fbshipit-source-id: f7eac866e175daf3793ae79da108f65cc7dc7b25
Summary:
This patch refactors TransactionImpl by separating the logic for pessimistic concurrency control from the implementation of how to write the data to rocksdb. The existing implementation is named WriteCommittedTxnImpl as it writes committed data to the db. A template named WritePreparedTxnImpl is also added which will be later completed to provide a an alternative implementation.
Closes https://github.com/facebook/rocksdb/pull/2676
Differential Revision: D5549998
Pulled By: maysamyabandeh
fbshipit-source-id: 16298e86b43ca4849324c1f35c731913c6d17bec
Summary:
* Dump blob db options to info log
* Remove BlobDBOptionsImpl to disallow dynamic cast *BlobDBOptions into *BlobDBOptionsImpl. Move options there to be constants or into BlobDBOptions. The dynamic cast is broken after #2645
* Change some of the default options
* Remove blob_db_options.min_blob_size, which is unimplemented. Will implement it soon.
Closes https://github.com/facebook/rocksdb/pull/2671
Differential Revision: D5529912
Pulled By: yiwu-arbug
fbshipit-source-id: dcd58ca981db5bcc7f123b65a0d6f6ae0dc703c7
Summary:
Introducing blob_db::TTLExtractor to replace extract_ttl_fn. The TTL
extractor can be use to extract TTL from keys insert with Put or
WriteBatch. Change over existing extract_ttl_fn are:
* If value is changed, it will be return via std::string* (rather than Slice*). With Slice* the new value has to be part of the existing value. With std::string* the limitation is removed.
* It can optionally return TTL or expiration.
Other changes in this PR:
* replace `std::chrono::system_clock` with `Env::NowMicros` so that I can mock time in tests.
* add several TTL tests.
* other minor naming change.
Closes https://github.com/facebook/rocksdb/pull/2659
Differential Revision: D5512627
Pulled By: yiwu-arbug
fbshipit-source-id: 0dfcb00d74d060b8534c6130c808e4d5d0a54440
Summary:
Major changes in this PR:
* Implement CassandraCompactionFilter to remove expired columns and rows (if all column expired)
* Move cassandra related code from utilities/merge_operators/cassandra to utilities/cassandra/*
* Switch to use shared_ptr<> from uniqu_ptr for Column membership management in RowValue. Since columns do have multiple owners in Merge and GC process, use shared_ptr helps make RowValue immutable.
* Rename cassandra_merge_test to cassandra_functional_test and add two TTL compaction related tests there.
Closes https://github.com/facebook/rocksdb/pull/2588
Differential Revision: D5430010
Pulled By: wpc
fbshipit-source-id: 9566c21e06de17491d486a68c70f52d501f27687
Summary:
target `utilities/merge_operators/cassandra/test_utils.d' given more than once in the same rule
remove the duplicate one
Closes https://github.com/facebook/rocksdb/pull/2542
Differential Revision: D5379570
Pulled By: lightmark
fbshipit-source-id: 353f38d085e9f627c1b3c53acef7a2c4bc71014d
Summary:
1. The buckifier script assume each test "foo" comes with a .cc file of the same name (i.e. foo.cc). Update cassandra tests to follow this pattern so that the buckifier script can recognize them.
2. add blob_db_test
Closes https://github.com/facebook/rocksdb/pull/2506
Differential Revision: D5331517
Pulled By: yiwu-arbug
fbshipit-source-id: 86f3eba471fc621186ab44cbd073b6162cde8e57
Summary:
This PR adds support for encrypting data stored by RocksDB when written to disk.
It adds an `EncryptedEnv` override of the `Env` class with matching overrides for sequential&random access files.
The encryption itself is done through a configurable `EncryptionProvider`. This class creates is asked to create `BlockAccessCipherStream` for a file. This is where the actual encryption/decryption is being done.
Currently there is a Counter mode implementation of `BlockAccessCipherStream` with a `ROT13` block cipher (NOTE the `ROT13` is for demo purposes only!!).
The Counter operation mode uses an initial counter & random initialization vector (IV).
Both are created randomly for each file and stored in a 4K (default size) block that is prefixed to that file. The `EncryptedEnv` implementation is such that clients of the `Env` class do not see this prefix (nor data, nor in filesize).
The largest part of the prefix block is also encrypted, and there is room left for implementation specific settings/values/keys in there.
To test the encryption, the `DBTestBase` class has been extended to consider a new environment variable called `ENCRYPTED_ENV`. If set, the test will setup a encrypted instance of the `Env` class to use for all tests.
Typically you would run it like this:
```
ENCRYPTED_ENV=1 make check_some
```
There is also an added test that checks that some data inserted into the database is or is not "visible" on disk. With `ENCRYPTED_ENV` active it must not find plain text strings, with `ENCRYPTED_ENV` unset, it must find the plain text strings.
Closes https://github.com/facebook/rocksdb/pull/2424
Differential Revision: D5322178
Pulled By: sdwilsh
fbshipit-source-id: 253b0a9c2c498cc98f580df7f2623cbf7678a27f
Summary:
Improve write buffer manager in several ways:
1. Size is tracked when arena block is allocated, rather than every allocation, so that it can better track actual memory usage and the tracking overhead is slightly lower.
2. We start to trigger memtable flush when 7/8 of the memory cap hits, instead of 100%, and make 100% much harder to hit.
3. Allow a cache object to be passed into buffer manager and the size allocated by memtable can be costed there. This can help users have one single memory cap across block cache and memtable.
Closes https://github.com/facebook/rocksdb/pull/2350
Differential Revision: D5110648
Pulled By: siying
fbshipit-source-id: b4238113094bf22574001e446b5d88523ba00017
Summary:
This is a manual commit of this PR:
Retire InMemoryEnv in favor of MockEnv #2082
With MockEnv doing the same yet being more mature, InMemoryEnv is redundant.
Reviewed By: IslamAbdelRahman
Differential Revision: D5162323
fbshipit-source-id: 59fd0082a891dc99cc531e4da9d68bf891eae3f5
Summary:
This fixes a compilation failure on Linux when the system libc is not
glibc. jemalloc's configure script incorrectly assumes that glibc is
always used on Linux systems, producing glibc-style signatures; when
the system libc is e.g. musl, the following error is observed:
```
[ 0%] Building CXX object CMakeFiles/rocksdb.dir/db/db_impl.cc.o
In file included from /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb.src/table/block.h:19:0,
from /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb.src/db/db_impl.cc:77:
/x-tools/x86_64-unknown-linux-musl/x86_64-unknown-linux-musl/sysroot/usr/include/malloc.h:19:8: error: declaration of 'size_t malloc_usable_size(void*)' has a different exception specifier
size_t malloc_usable_size(void *);
^~~~~~~~~~~~~~~~~~
In file included from /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb.src/db/db_impl.cc:20:0:
/go/native/x86_64-unknown-linux-musl/jemalloc/include/jemalloc/jemalloc.h:78:33: note: from previous declaration 'size_t malloc_usable_size(void*) throw ()'
# define je_malloc_usable_size malloc_usable_size
^
/go/native/x86_64-unknown-linux-musl/jemalloc/include/jemalloc/jemalloc.h:239:41: note: in expansion of macro 'je_malloc_usable_size'
JEMALLOC_EXPORT size_t JEMALLOC_NOTHROW je_malloc_usable_size(
^~~~~~~~~~~~~~~~~~~~~
CMakeFiles/rocksdb.dir/build.make:350: recipe for target 'CMakeFiles/rocksdb.dir/db/db_impl.cc.o' failed
```
This works around the issue by rearranging the sources such that
jemalloc's headers are never in the same scope as the system's malloc
header. The jemalloc issue has been reported as well, see:
https://github.com/jemalloc/jemalloc/issues/778.
cc tschottdorf
Closes https://github.com/facebook/rocksdb/pull/2188
Differential Revision: D5163048
Pulled By: siying
fbshipit-source-id: c553125458892def175c1be5682b0330d80b2a0d
Summary:
Blob db rely on base db returning sequence number through write batch after DB::Write(). However after recent changes to the write path, DB::Writ()e no longer return sequence number in some cases. Fixing it by have WriteBatchInternal::InsertInto() always encode sequence number into write batch.
Stacking on #2375.
Closes https://github.com/facebook/rocksdb/pull/2385
Differential Revision: D5148358
Pulled By: yiwu-arbug
fbshipit-source-id: 8bda0aa07b9334ed03ed381548b39d167dc20c33
Summary:
Re-enable blob_db_test with some update:
* Commented out delay at the end of GC tests. Will update the logic later with sync point to properly trigger GC.
* Added some helper functions.
Also update make files to include blob_dump tool.
Closes https://github.com/facebook/rocksdb/pull/2375
Differential Revision: D5133793
Pulled By: yiwu-arbug
fbshipit-source-id: 95470b26d0c1f9592ba4b7637e027fdd263f425c
Summary:
Previously the Java implementation of `RocksDB#addFile` was both incomplete and not inline with the C++ API.
Rather than fix it, as I see that `rocksdb::DB::AddFile` is now deprecated in favour of `rocksdb::DB::IngestExternalFile`, I have removed the old broken implementation and implemented `RocksDB#ingestExternalFile`.
Closes https://github.com/facebook/rocksdb/issues/2261
Closes https://github.com/facebook/rocksdb/pull/2291
Differential Revision: D5061264
Pulled By: sagar0
fbshipit-source-id: 85df0899fa1b1fc3535175cac4f52353511d4104
Summary:
snprintf is in <stdio.h> and not in namespace std.
Closes https://github.com/facebook/rocksdb/pull/2287
Reviewed By: anirbanr-fb
Differential Revision: D5054752
Pulled By: yiwu-arbug
fbshipit-source-id: 356807ec38f3c7d95951cdb41f31a3d3ae0714d4
Summary:
- Introduced an include/ file dedicated to db-related debug functions to avoid making db.h more complex
- Added debugging function, `GetAllKeyVersions()`, to return a listing of internal data for a range of user keys. The new `struct KeyVersion` exposes data similar to internal key without exposing any internal type.
- Migrated the "ldb idump" subcommand to use this function
- The API takes an inclusive-exclusive range to match behavior of "ldb idump". This will be quite annoying for users who want to query a single user key's versions :(.
Closes https://github.com/facebook/rocksdb/pull/2232
Differential Revision: D4976007
Pulled By: ajkr
fbshipit-source-id: cab375da53a7595d6575af2b7e3b776aa3ad793e
Summary:
I'm going to add more DB tests for statistics as currently we have very few. I started a file dedicated to this purpose and moved the existing stats-specific tests there.
Closes https://github.com/facebook/rocksdb/pull/2211
Differential Revision: D4951558
Pulled By: ajkr
fbshipit-source-id: 05d11c35079c40ecabdfd2cf5556ccb761f694a4
Summary:
These code paths forked when checkpoint was introduced by copy/pasting the core backup logic. Over time they diverged and bug fixes were sometimes applied to one but not the other (like fix to include all relevant WALs for 2PC), or it required extra effort to fix both (like fix to forge CURRENT file). This diff reunites the code paths by extracting the core logic into a function, CreateCustomCheckpoint(), that is customizable via callbacks to implement both checkpoint and backup.
Related changes:
- flush_before_backup is now forcibly enabled when 2PC is enabled
- Extracted CheckpointImpl class definition into a header file. This is so the function, CreateCustomCheckpoint(), can be called by internal rocksdb code but not exposed to users.
- Implemented more functions in DummyDB/DummyLogFile (in backupable_db_test.cc) that are used by CreateCustomCheckpoint().
Closes https://github.com/facebook/rocksdb/pull/1932
Differential Revision: D4622986
Pulled By: ajkr
fbshipit-source-id: 157723884236ee3999a682673b64f7457a7a0d87
Summary:
1. Move universal compaction picker to separate files compaction_picker_universal.cc and compaction_picker_universal.h.
2. Rename some functions to make the code easier to understand.
3. Move leveled compaction picking code to a dedicated class, so that we we don't need to pass some common variable around when calling functions. It also allowed us to break down LevelCompactionPicker::PickCompaction() to smaller functions.
Closes https://github.com/facebook/rocksdb/pull/2100
Differential Revision: D4845948
Pulled By: siying
fbshipit-source-id: efa0ab4
Summary:
This is an effort to club all string related utility functions into one common place, in string_util, so that it is easier for everyone to know what string processing functions are available. Right now they seem to be spread out across multiple modules, like logging and options_helper.
Check the sub-commits for easier reviewing.
Closes https://github.com/facebook/rocksdb/pull/2094
Differential Revision: D4837730
Pulled By: sagar0
fbshipit-source-id: 344278a
Summary:
Move some files under util/ to new directories env/, monitoring/ options/ and cache/
Closes https://github.com/facebook/rocksdb/pull/2090
Differential Revision: D4833681
Pulled By: siying
fbshipit-source-id: 2fd8bef
Summary:
db_impl.cc is too large to manage. Divide db_impl.cc into db/db_impl.cc, db/db_impl_compaction_flush.cc, db/db_impl_files.cc, db/db_impl_open.cc and db/db_impl_write.cc.
Closes https://github.com/facebook/rocksdb/pull/2095
Differential Revision: D4838188
Pulled By: siying
fbshipit-source-id: c5f3059
Summary:
I've needed Env timing measurements a few times now, so finally built something for it.
Closes https://github.com/facebook/rocksdb/pull/2073
Differential Revision: D4811231
Pulled By: ajkr
fbshipit-source-id: 218a249
Summary:
It is confusing to have auto_roll_logger to stay under db/, which has nothing to do with database. Move filename together as it is a dependency.
Closes https://github.com/facebook/rocksdb/pull/2080
Differential Revision: D4821141
Pulled By: siying
fbshipit-source-id: ca7d768
Summary:
This adds almost all missing options to RocksJava
Closes https://github.com/facebook/rocksdb/pull/2039
Differential Revision: D4779991
Pulled By: siying
fbshipit-source-id: 4a1bf28
Summary:
Separate the platform dependent tests from external_sst_file_test. Only those tests need to run on platforms like OSX
Closes https://github.com/facebook/rocksdb/pull/1923
Differential Revision: D4622461
Pulled By: siying
fbshipit-source-id: d2d6f04
Summary:
Separate a smal subset of tests in DBTest to DBBasicTest. Tests in DBTest don't have to run in CI tests on platforms like OSX, as long as they are covered by Linux.
Closes https://github.com/facebook/rocksdb/pull/1924
Differential Revision: D4616702
Pulled By: siying
fbshipit-source-id: 13e6549
Summary:
merger.h was always a confusing name for me, simply give the file a better name
Closes https://github.com/facebook/rocksdb/pull/1836
Differential Revision: D4505357
Pulled By: IslamAbdelRahman
fbshipit-source-id: 07b28d8
Summary:
The Env registration framework supports registering client Envs and selecting which one to instantiate according to a text field. This enabled things like adding the -env_uri argument to db_bench, so the same binary could be reused with different Envs just by changing CLI config.
Now this problem has come up again in a non-Env context, as I want to instantiate a client Statistics implementation from db_bench, which is configured entirely via text parameters. Also, in the future we may wish to use it for deserializing client objects when loading OPTIONS file.
This diff generalizes the Env registration logic to work with arbitrary types.
- Generalized registration and instantiation code by templating them
- The entire implementation is in a header file as that's Google style guide's recommendation for template definitions
- Pattern match with std::regex_match rather than checking prefix, which was the previous behavior
- Rename functions/files to be non-Env-specific
Closes https://github.com/facebook/rocksdb/pull/1776
Differential Revision: D4421933
Pulled By: ajkr
fbshipit-source-id: 34647d1
Summary:
Iterator should be in corrupted status if merge operator return false.
Also add test to make sure if max_successive_merges is hit during write,
data will not be lost.
Closes https://github.com/facebook/rocksdb/pull/1665
Differential Revision: D4322695
Pulled By: yiwu-arbug
fbshipit-source-id: b327b05
Summary:
Now that we have userspace persisted cache, we don't need flashcache anymore.
Closes https://github.com/facebook/rocksdb/pull/1588
Differential Revision: D4245114
Pulled By: igorcanadi
fbshipit-source-id: e2c1c72
Summary:
Changes in the diff
API changes:
- Introduce IngestExternalFile to replace AddFile (I think this make the API more clear)
- Introduce IngestExternalFileOptions (This struct will encapsulate the options for ingesting the external file)
- Deprecate AddFile() API
Logic changes:
- If our file overlap with the memtable we will flush the memtable
- We will find the first level in the LSM tree that our file key range overlap with the keys in it
- We will find the lowest level in the LSM tree above the the level we found in step 2 that our file can fit in and ingest our file in it
- We will assign a global sequence number to our new file
- Remove AddFile restrictions by using global sequence numbers
Other changes:
- Refactor all AddFile logic to be encapsulated in ExternalSstFileIngestionJob
Test Plan:
unit tests (still need to add more)
addfile_stress (https://reviews.facebook.net/D65037)
Reviewers: yiwu, andrewkr, lightmark, yhchiang, sdong
Reviewed By: sdong
Subscribers: jkedgar, hcz, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D65061
Summary:
This diff introduces RangeDelAggregator, which takes ownership of iterators
provided to it via AddTombstones(). The tombstones are organized in a two-level
map (snapshot stripe -> begin key -> tombstone). Tombstone creation avoids data
copy by holding Slices returned by the iterator, which remain valid thanks to pinning.
For compaction, we create a hierarchical range tombstone iterator with structure
matching the iterator over compaction input data. An aggregator based on that
iterator is used by CompactionIterator to determine which keys are covered by
range tombstones. In case of merge operand, the same aggregator is used by
MergeHelper. Upon finishing each file in the compaction, relevant range tombstones
are added to the output file's range tombstone metablock and file boundaries are
updated accordingly.
To check whether a key is covered by range tombstone, RangeDelAggregator::ShouldDelete()
considers tombstones in the key's snapshot stripe. When this function is used outside of
compaction, it also checks newer stripes, which can contain covering tombstones. Currently
the intra-stripe check involves a linear scan; however, in the future we plan to collapse ranges
within a stripe such that binary search can be used.
RangeDelAggregator::AddToBuilder() adds all range tombstones in the table's key-range
to a new table's range tombstone meta-block. Since range tombstones may fall in the gap
between files, we may need to extend some files' key-ranges. The strategy is (1) first file
extends as far left as possible and other files do not extend left, (2) all files extend right
until either the start of the next file or the end of the last range tombstone in the gap,
whichever comes first.
One other notable change is adding release/move semantics to ScopedArenaIterator
such that it can be used to transfer ownership of an arena-allocated iterator, similar to
how unique_ptr is used for malloc'd data.
Depends on D61473
Test Plan: compaction_iterator_test, mock_table, end-to-end tests in D63927
Reviewers: sdong, IslamAbdelRahman, wanning, yhchiang, lightmark
Reviewed By: lightmark
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62205
Summary: Use ImmutableDBOptions/MutableDBOptions internally and DBOptions only for user-facing APIs. MutableDBOptions is barely a placeholder for now. I'll start to move options to MutableDBOptions in following diffs.
Test Plan:
make all check
Reviewers: yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D64065
Summary: There's no reference to ImmutableCFOptions elsewhere in /include/rocksdb. ImmutableCFOptions was introduced in this commit (5665e5e285) but later its reference in /include/rocksdb/table.h is removed.
Test Plan:
make all check
Reviewers: IslamAbdelRahman, sdong, yhchiang
Reviewed By: yhchiang
Subscribers: yhchiang, andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D63177