rocksdb

Author	SHA1	Message	Date
Andrew Kryczka	1c84660457	prototype status check enforcement (#6798 ) Summary: Tried making Status object enforce that it is checked in some way. In cases it is not checked, `PermitUncheckedError()` must be called explicitly. Added a way to run tests (`ASSERT_STATUS_CHECKED=1 make -j48 check`) on a whitelist. The effort appears significant to get each test to pass with this assertion, so I only fixed up enough to get one test (`options_test`) working and added it to the whitelist. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6798 Reviewed By: pdillinger Differential Revision: D21377404 Pulled By: ajkr fbshipit-source-id: 73236f9c8df38f01cf24ecac4a6d1661b72d077e	2020-05-08 12:40:43 -07:00
mrambacher	394f2bbd13	Add OptionTypeInfo::Enum and related methods (#6423 ) Summary: Add methods and constructors for handling enums to the OptionTypeInfo. This change allows enums to be converted/compared without adding a special "type" to the OptionType. This change addresses a couple of issues: - It allows new enumerated types to be added to the options without editing the OptionType base class (and related methods) - It standardizes the procedure for adding enumerated types to the options, reducing potential mistakes - It moves the enum maps to the location where they are used, allowing them to be static file members rather than global values - It reduces the number of types and cases that need to be handled in the various OptionType methods Pull Request resolved: https://github.com/facebook/rocksdb/pull/6423 Reviewed By: siying Differential Revision: D21408713 Pulled By: zhichao-cao fbshipit-source-id: fc492af285d011822578b95d186a0fce25d35626	2020-05-05 15:04:04 -07:00
mrambacher	618bf638aa	Add Functions to OptionTypeInfo (#6422 ) Summary: Added functions for parsing, serializing, and comparing elements to OptionTypeInfo. These functions allow all of the special cases that could not be handled directly in the map of OptionTypeInfo to be moved into the map. Using these functions, every type can be handled via the map rather than special cased. By adding these functions, the code for handling options can become more standardized (fewer special cases) and (eventually) handled completely by common classes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6422 Test Plan: pass make check Reviewed By: siying Differential Revision: D21269005 Pulled By: zhichao-cao fbshipit-source-id: 9ba71c721a38ebf9ee88259d60bd81b3282b9077	2020-04-28 18:04:26 -07:00
Cheng Chang	4cd859edf1	Fix build under LITE (#6758 ) Summary: GetSupportedCompressions needs to be defined under LITE. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6758 Test Plan: build under LITE Reviewed By: zhichao-cao Differential Revision: D21247937 Pulled By: cheng-chang fbshipit-source-id: 880e59d3e107cdd736d16427a68c5641d1318fb4	2020-04-27 16:55:14 -07:00
mrambacher	4cbc19d2a1	Add a ConfigOptions for use in comparing objects and converting to/from strings (#6389 ) Summary: The methods in convenience.h are used to compare/convert objects to/from strings. There is a mishmash of parameters in use here with more needed in the future. This PR replaces those parameters with a single structure. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6389 Reviewed By: siying Differential Revision: D21163707 Pulled By: zhichao-cao fbshipit-source-id: f807b4cc7e2b0af3871536b69546b2604dfa81bd	2020-04-21 17:38:17 -07:00
Akanksha Mahajan	03a1d95db0	Set max_background_flushes dynamically (#6701 ) Summary: 1. Add changes so that max_background_flushes can be set dynamically. 2. Add a testcase DBOptionsTest.SetBackgroundFlushThreads which set the max_background_flushes dynamically using SetDBOptions. TestPlan: 1. make -j64 check 2. Using new testcase DBOptionsTest.SetBackgroundFlushThreads Pull Request resolved: https://github.com/facebook/rocksdb/pull/6701 Reviewed By: ajkr Differential Revision: D21028010 Pulled By: akankshamahajan15 fbshipit-source-id: 5f949e4a8fd3c32537b637947b7ee09a69cfc7c1	2020-04-20 16:19:02 -07:00
sdong	94f90ac6bc	compression related options are not copied back from MutableCFOptions… (#6668 ) Summary: … to CFOptions https://github.com/facebook/rocksdb/pull/6615 made several compression related options dynamically changeable. They are moved to MutableCFOptions. However, they are not copied back to ColumnFamilyOptions, so the changed values are not written to option files and for some other uses. Fix it by copying them back. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6668 Test Plan: Add a unit test to make sure that when a MutableCFOptions is converted to CFOptions and back to MutableCFOptions, they stay the same. This test would fail without the fix. Reviewed By: ajkr Differential Revision: D20923999 fbshipit-source-id: c3bccd6923b00d677764e2269bed6a95ad7ed780	2020-04-08 14:40:46 -07:00
mrambacher	259b6ec8da	Move the OptionTypeMap code closer to home (#6198 ) Summary: This is a predecessor to the Configurable PR. This change moves the OptionTypeInfo maps closer to where they will be used. When the Configurable changes are adopted, these values will become static and not associated with the OptionsHelper. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6198 Reviewed By: siying Differential Revision: D20778108 Pulled By: zhichao-cao fbshipit-source-id: a9f85fc73bc53503656e1958ecc1e764052fd1aa	2020-04-03 10:52:38 -07:00
Ziyue Yang	03a781a90c	Add pipelined & parallel compression optimization (#6262 ) Summary: This PR adds support for pipelined & parallel compression optimization for `BlockBasedTableBuilder`. This optimization makes block building, block compression and block appending a pipeline, and uses multiple threads to accelerate block compression. Users can set `CompressionOptions::parallel_threads` greater than 1 to enable compression parallelism. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6262 Reviewed By: ajkr Differential Revision: D20651306 fbshipit-source-id: 62125590a9c15b6d9071def9dc72589c1696a4cb	2020-04-01 16:40:18 -07:00
sdong	80979f81c7	Make options.bottommost_compression, compression_opts and bottommost_compression_opts dynamically changeable. (#6615 ) Summary: These three options should be made dynamically changeable. Simply add them to MutableCFOptions and made the change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6615 Test Plan: Add a unit test to make sure that SetOptions() can change the options. Reviewed By: riversand963 Differential Revision: D20755951 fbshipit-source-id: 8165f4fd7a7a665cc7fb049698935022a5d2e7ff	2020-03-31 12:11:42 -07:00
Zhichao Cao	e8d332d97e	Use FileChecksumGenFactory for SST file checksum (#6600 ) Summary: In the current implementation, sst file checksum is calculated by a shared checksum function object, which may make some checksum function hard to be applied here such as SHA1. In this implementation, each sst file will have its own checksum generator obejct, created by FileChecksumGenFactory. User needs to implement its own FilechecksumGenerator and Factory to plugin the in checksum calculation method. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6600 Test Plan: tested with make asan_check Reviewed By: riversand963 Differential Revision: D20717670 Pulled By: zhichao-cao fbshipit-source-id: 2a74c1c280ac11a07a1980185b43b671acaa71c6	2020-03-29 15:58:46 -07:00
anand76	a9d168cfd7	Simplify migration to FileSystem API (#6552 ) Summary: The current Env/FileSystem API separation has a couple of issues - 1. It requires the user to specify 2 options - ```Options::env``` and ```Options::file_system``` - which means they have to make code changes to benefit from the new APIs. Furthermore, there is a risk of accessing the same APIs in two different ways, through Env in the old way and through FileSystem in the new way. The two may not always match, for example, if env is ```PosixEnv``` and FileSystem is a custom implementation. Any stray RocksDB calls to env will use the ```PosixEnv``` implementation rather than the file_system implementation. 2. There needs to be a simple way for the FileSystem developer to instantiate an Env for backward compatibility purposes. This PR solves the above issues and simplifies the migration in the following ways - 1. Embed a shared_ptr to the ```FileSystem``` in the ```Env```, and remove ```Options::file_system``` as a configurable option. This way, no code changes will be required in application code to benefit from the new API. The default Env constructor uses a ```LegacyFileSystemWrapper``` as the embedded ```FileSystem```. 1a. - This also makes it more robust by ensuring that even if RocksDB has some stray calls to Env APIs rather than FileSystem, they will go through the same object and thus there is no risk of getting out of sync. 2. Provide a ```NewCompositeEnv()``` API that can be used to construct a PosixEnv with a custom FileSystem implementation. This eliminates an indirection to call Env APIs, and relieves the FileSystem developer of the burden of having to implement wrappers for the Env APIs. 3. Add a couple of missing FileSystem APIs - ```SanitizeEnvOptions()``` and ```NewLogger()``` Tests: 1. New unit tests 2. make check and make asan_check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6552 Reviewed By: riversand963 Differential Revision: D20592038 Pulled By: anand1976 fbshipit-source-id: c3801ad4153f96d21d5a3ae26c92ba454d1bf1f7	2020-03-23 21:54:21 -07:00
Yanqin Jin	fb09ef05dc	Attempt to recover from db with missing table files (#6334 ) Summary: There are situations when RocksDB tries to recover, but the db is in an inconsistent state due to SST files referenced in the MANIFEST being missing. In this case, previous RocksDB will just fail the recovery and return a non-ok status. This PR enables another possibility. During recovery, RocksDB checks possible MANIFEST files, and try to recover to the most recent state without missing table file. `VersionSet::Recover()` applies version edits incrementally and "materializes" a version only when this version does not reference any missing table file. After processing the entire MANIFEST, the version created last will be the latest version. `DBImpl::Recover()` calls `VersionSet::Recover()`. Afterwards, WAL replay will not be performed. To use this capability, set `options.best_efforts_recovery = true` when opening the db. Best-efforts recovery is currently incompatible with atomic flush. Test plan (on devserver): ``` $make check $COMPILE_WITH_ASAN=1 make all && make check ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/6334 Reviewed By: anand1976 Differential Revision: D19778960 Pulled By: riversand963 fbshipit-source-id: c27ea80f29bc952e7d3311ecf5ee9c54393b40a8	2020-03-20 19:30:48 -07:00
sdong	fdf882ded2	Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433 ) Summary: When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433 Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag. Differential Revision: D19977691 fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e	2020-02-20 12:09:57 -08:00
Zhichao Cao	4369f2c7bb	Checksum for each SST file and stores in MANIFEST (#6216 ) Summary: In the current code base, RocksDB generate the checksum for each block and verify the checksum at usage. Current PR enable SST file checksum. After a SST file is generated by Flush or Compaction, RocksDB generate the SST file checksum and store the checksum value and checksum method name in the vs_info and MANIFEST as part for the FileMetadata. Added the enable_sst_file_checksum to Options to enable or disable file checksum. Added sst_file_checksum to Options such that user can plugin their own SST file checksum calculate method via overriding the SstFileChecksum class. The checksum information inlcuding uint32_t checksum value and a checksum name (string). A new tool is added to LDB such that user can dump out a list of file checksum information from MANIFEST. If user enables the file checksum but does not provide the sst_file_checksum instance, RocksDB will use the default crc32checksum implemented in table/sst_file_checksum_crc32c.h Pull Request resolved: https://github.com/facebook/rocksdb/pull/6216 Test Plan: Added the testing case in table_test and ldb_cmd_test to verify checksum is correct in different level. Pass make asan_check. Differential Revision: D19171461 Pulled By: zhichao-cao fbshipit-source-id: b2e53479eefc5bb0437189eaa1941670e5ba8b87	2020-02-10 15:52:52 -08:00
Mike Kolupaev	637e64b9ac	Add an option to prevent DB::Open() from querying sizes of all sst files (#6353 ) Summary: When paranoid_checks is on, DBImpl::CheckConsistency() iterates over all sst files and calls Env::GetFileSize() for each of them. As far as I could understand, this is pretty arbitrary and doesn't affect correctness - if filesystem doesn't corrupt fsynced files, the file sizes will always match; if it does, it may as well corrupt contents as well as sizes, and rocksdb doesn't check contents on open. If there are thousands of sst files, getting all their sizes takes a while. If, on top of that, Env is overridden to use some remote storage instead of local filesystem, it can be really slow and overload the remote storage service. This PR adds an option to not do GetFileSize(); instead it does GetChildren() for parent directory to check that all the expected sst files are at least present, but doesn't check their sizes. We can't just disable paranoid_checks instead because paranoid_checks do a few other important things: make the DB read-only on write errors, print error messages on read errors, etc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6353 Test Plan: ran the added sanity check unit test. Will try it out in a LogDevice test cluster where the GetFileSize() calls are causing a lot of trouble. Differential Revision: D19656425 Pulled By: al13n321 fbshipit-source-id: c2c421b367633033760d1f56747bad206d1fbf82	2020-02-04 01:27:26 -08:00
anand76	afa2420c2b	Introduce a new storage specific Env API (#5761 ) Summary: The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc. This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO. The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before. This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection. The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761 Differential Revision: D18868376 Pulled By: anand1976 fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f	2019-12-13 14:48:41 -08:00
Maysam Yabandeh	6ec6a4a9a4	Remove snap_refresh_nanos option (#5826 ) Summary: The snap_refresh_nanos option didn't bring much benefit. Remove the feature to simplify the code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5826 Differential Revision: D17467147 Pulled By: maysamyabandeh fbshipit-source-id: 4f950b046990d0d1292d7fc04c2ccafaf751c7f0	2019-09-18 20:26:04 -07:00
Ronak Sisodia	d05c0fe4d1	Option to make write group size configurable (#5759 ) Summary: The max batch size that we can write to the WAL is controlled by a static manner. So if the leader write is less than 128 KB we will have the batch size as leader write size + 128 KB else the limit will be 1 MB. Both of them are statically defined. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5759 Differential Revision: D17329298 fbshipit-source-id: a3d910629d8d8ca84ea39ad89c2b2d284571ded5	2019-09-11 18:28:33 -07:00
Vijay Nadimpalli	979fbdc696	Persistent globally unique DB ID in manifest (#5725 ) Summary: Each DB has a globally unique ID. A DB can be physically copied around, or backed-up and restored, and the users should be identify the same DB. This unique ID right now is stored as plain text in file IDENTITY under the DB directory. This approach introduces at least two problems: (1) the file is not checksumed; (2) the source of truth of a DB is the manifest file, which can be copied separately from IDENTITY file, causing the DB ID to be wrong. The goal of this PR is solve this problem by moving the DB ID to manifest. To begin with we will write to both identity file and manifest. Write to Manifest is controlled via the flag write_dbid_to_manifest in Options and default is false. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5725 Test Plan: Added unit tests. Differential Revision: D16963840 Pulled By: vjnadimpalli fbshipit-source-id: 8a86a4c8c82c716003c40fd6b9d2d758030d92e9	2019-09-03 08:52:24 -07:00
Zhongyi Xie	2f41ecfe75	Refactor trimming logic for immutable memtables (#5022 ) Summary: MyRocks currently sets `max_write_buffer_number_to_maintain` in order to maintain enough history for transaction conflict checking. The effectiveness of this approach depends on the size of memtables. When memtables are small, it may not keep enough history; when memtables are large, this may consume too much memory. We are proposing a new way to configure memtable list history: by limiting the memory usage of immutable memtables. The new option is `max_write_buffer_size_to_maintain` and it will take precedence over the old `max_write_buffer_number_to_maintain` if they are both set to non-zero values. The new option accounts for the total memory usage of flushed immutable memtables and mutable memtable. When the total usage exceeds the limit, RocksDB may start dropping immutable memtables (which is also called trimming history), starting from the oldest one. The semantics of the old option actually works both as an upper bound and lower bound. History trimming will start if number of immutable memtables exceeds the limit, but it will never go below (limit-1) due to history trimming. In order the mimic the behavior with the new option, history trimming will stop if dropping the next immutable memtable causes the total memory usage go below the size limit. For example, assuming the size limit is set to 64MB, and there are 3 immutable memtables with sizes of 20, 30, 30. Although the total memory usage is 80MB > 64MB, dropping the oldest memtable will reduce the memory usage to 60MB < 64MB, so in this case no memtable will be dropped. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5022 Differential Revision: D14394062 Pulled By: miasantreble fbshipit-source-id: 60457a509c6af89d0993f988c9b5c2aa9e45f5c5	2019-08-23 13:55:34 -07:00
Mark Rambacher	cfcf045acc	The ObjectRegistry class replaces the Registrar and NewCustomObjects.… (#5293 ) Summary: The ObjectRegistry class replaces the Registrar and NewCustomObjects. Objects are registered with the registry by Type (the class must implement the static const char *Type() method). This change is necessary for a few reasons: - By having a class (rather than static template instances), the class can be passed between compilation units, meaning that objects could be registered and shared from a dynamic library with an executable. - By having a class with instances, different units could have different objects registered. This could be useful if, for example, one Option allowed for a dynamic library and one did not. When combined with some other PRs (being able to load shared libraries, a Configurable interface to configure objects to/from string), this code will allow objects in external shared libraries to be added to a RocksDB image at run-time, rather than requiring every new extension to be built into the main library and called explicitly by every program. Test plan (on riversand963's devserver) ``` $COMPILE_WITH_ASAN=1 make -j32 all && sleep 1 && make check ``` All tests pass. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5293 Differential Revision: D16363396 Pulled By: riversand963 fbshipit-source-id: fbe4acb615bfc11103eef40a0b288845791c0180	2019-07-23 17:13:05 -07:00
Eli Pozniansky	c129c75fb7	Added log_readahead_size option to control prefetching for Log::Reader (#5592 ) Summary: Added log_readahead_size option to control prefetching for Log::Reader. This is mostly useful for reading a remotely located log, as it can save the number of round-trips when reading it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5592 Differential Revision: D16362989 Pulled By: elipoz fbshipit-source-id: c5d4d5245a44008cd59879640efff70c091ad3e8	2019-07-19 12:00:19 -07:00
Mike Kolupaev	b4d7209428	Add an option to put first key of each sst block in the index (#5289 ) Summary: The first key is used to defer reading the data block until this file gets to the top of merging iterator's heap. For short range scans, most files never make it to the top of the heap, so this change can reduce read amplification by a lot sometimes. Consider the following workload. There are a few data streams (we'll be calling them "logs"), each stream consisting of a sequence of blobs (we'll be calling them "records"). Each record is identified by log ID and a sequence number within the log. RocksDB key is concatenation of log ID and sequence number (big endian). Reads are mostly relatively short range scans, each within a single log. Writes are mostly sequential for each log, but writes to different logs are randomly interleaved. Compactions are disabled; instead, when we accumulate a few tens of sst files, we create a new column family and start writing to it. So, a typical sst file consists of a few ranges of blocks, each range corresponding to one log ID (we use FlushBlockPolicy to cut blocks at log boundaries). A typical read would go like this. First, iterator Seek() reads one block from each sst file. Then a series of Next()s move through one sst file (since writes to each log are mostly sequential) until the subiterator reaches the end of this log in this sst file; then Next() switches to the next sst file and reads sequentially from that, and so on. Often a range scan will only return records from a small number of blocks in small number of sst files; in this case, the cost of initial Seek() reading one block from each file may be bigger than the cost of reading the actually useful blocks. Neither iterate_upper_bound nor bloom filters can prevent reading one block from each file in Seek(). But this PR can: if the index contains first key from each block, we don't have to read the block until this block actually makes it to the top of merging iterator's heap, so for short range scans we won't read any blocks from most of the sst files. This PR does the deferred block loading inside value() call. This is not ideal: there's no good way to report an IO error from inside value(). As discussed with siying offline, it would probably be better to change InternalIterator's interface to explicitly fetch deferred value and get status. I'll do it in a separate PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5289 Differential Revision: D15256423 Pulled By: al13n321 fbshipit-source-id: 750e4c39ce88e8d41662f701cf6275d9388ba46a	2019-06-24 20:54:04 -07:00
Zhongyi Xie	671d15cbdd	Persistent Stats: persist stats history to disk (#5046 ) Summary: This PR continues the work in https://github.com/facebook/rocksdb/pull/4748 and https://github.com/facebook/rocksdb/pull/4535 by adding a new DBOption `persist_stats_to_disk` which instructs RocksDB to persist stats history to RocksDB itself. When statistics is enabled, and both options `stats_persist_period_sec` and `persist_stats_to_disk` are set, RocksDB will periodically write stats to a built-in column family in the following form: key -> (timestamp in microseconds)#(stats name), value -> stats value. The existing API `GetStatsHistory` will detect the current value of `persist_stats_to_disk` and either read from in-memory data structure or from the hidden column family on disk. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5046 Differential Revision: D15863138 Pulled By: miasantreble fbshipit-source-id: bb82abdb3f2ca581aa42531734ac799f113e931b	2019-06-17 15:21:50 -07:00
Siying Dong	8843129ece	Move some memory related files from util/ to memory/ (#5382 ) Summary: Move arena, allocator, and memory tools under util to a separate memory/ directory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5382 Differential Revision: D15564655 Pulled By: siying fbshipit-source-id: 9cd6b5d0d3d52b39606e19221fa154596e5852a5	2019-05-30 17:44:09 -07:00
Vijay Nadimpalli	50e470791d	Organizing rocksdb/table directory by format Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5373 Differential Revision: D15559425 Pulled By: vjnadimpalli fbshipit-source-id: 5d6d6d615582bedd96a4b879bb25d429a6de8b55	2019-05-30 14:51:11 -07:00
Zhongyi Xie	87fe4bcab8	Fix FIFO dynamic options sanitization (#5367 ) Summary: When dynamically setting options, we check the option type info and skip options that are marked deprecated. However this check is only done at top level, which results in bugs where SetOptions will corrupt option values and cause unexpected system behavior iff a deprecated second level option is set dynamically. For exmaple, the following call: ``` dbfull()->SetOptions( {{"compaction_options_fifo", "{allow_compaction=true;max_table_files_size=1024;ttl=731;}"}}); ``` was from pre 6.0 release when `ttl` was part of `compaction_options_fifo`. Now that it got moved out of `compaction_options_fifo`, this call will incorrectly set `compaction_options_fifo.max_table_files_size` to 731 (as `max_table_files_size` is the first one in `OptionsHelper::fifo_compaction_options_type_info` struct) and cause files to gett evicted much faster than expected. This PR adds verification to second level options like `compaction_options_fifo.ttl` or `compaction_options_fifo.max_table_files_size` when set dynamically, and filter out those marked as deprecated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5367 Differential Revision: D15530998 Pulled By: miasantreble fbshipit-source-id: 818258be5c3abe09cd82d62f3c083572d70fecdd	2019-05-30 10:46:28 -07:00
Dave Rigby	8149bb9d6a	Pass OptionTypeInfo maps by const& (#5295 ) Summary: In options_helper.cc various functions take a const unordered_map of string -> TypeInfo for options handling. These functions pass by-value the (const) maps, resulting in unnecessary copies. Change to pass by reference. This results in a noticable reduction in the amount of time spent parsing options - in my case a set of unit tests using RocksDB which call SetOptions() to modify options see a ~25% runtime reduction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5295 Differential Revision: D15296334 Pulled By: riversand963 fbshipit-source-id: 4d4be3db635264943607911b296dda27fd7ce1a7	2019-05-15 14:25:57 -07:00
Maysam Yabandeh	f383641a1d	Unordered Writes (#5218 ) Summary: Performing unordered writes in rocksdb when unordered_write option is set to true. When enabled the writes to memtable are done without joining any write thread. This offers much higher write throughput since the upcoming writes would not have to wait for the slowest memtable write to finish. The tradeoff is that the writes visible to a snapshot might change over time. If the application cannot tolerate that, it should implement its own mechanisms to work around that. Using TransactionDB with WRITE_PREPARED write policy is one way to achieve that. Doing so increases the max throughput by 2.2x without however compromising the snapshot guarantees. The patch is prepared based on an original by siying Existing unit tests are extended to include unordered_write option. Benchmark Results: ``` TEST_TMPDIR=/dev/shm/ ./db_bench_unordered --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions --unordered_write=1 ``` With WAL - Vanilla RocksDB: 78.6 MB/s - WRITER_PREPARED with unordered_write: 177.8 MB/s (2.2x) - unordered_write: 368.9 MB/s (4.7x with relaxed snapshot guarantees) Without WAL - Vanilla RocksDB: 111.3 MB/s - WRITER_PREPARED with unordered_write: 259.3 MB/s MB/s (2.3x) - unordered_write: 645.6 MB/s (5.8x with relaxed snapshot guarantees) - WRITER_PREPARED with unordered_write disable concurrency control: 185.3 MB/s MB/s (2.35x) Limitations: - The feature is not yet extended to `max_successive_merges` > 0. The feature is also incompatible with `enable_pipelined_write` = true as well as with `allow_concurrent_memtable_write` = false. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5218 Differential Revision: D15219029 Pulled By: maysamyabandeh fbshipit-source-id: 38f2abc4af8780148c6128acdba2b3227bc81759	2019-05-13 17:47:21 -07:00
Maysam Yabandeh	6a40ee5eb1	Refresh snapshot list during long compactions (2nd attempt) (#5278 ) Summary: Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list. For simplicity, to avoid the feature is disabled in two cases: i) When more than one sub-compaction are sharing the same snapshot list, ii) when Range Delete is used in which the range delete aggregator has its own copy of snapshot list. This fixes the reverted https://github.com/facebook/rocksdb/pull/5099 issue with range deletes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5278 Differential Revision: D15203291 Pulled By: maysamyabandeh fbshipit-source-id: fa645611e606aa222c7ce53176dc5bb6f259c258	2019-05-03 17:30:22 -07:00
Maysam Yabandeh	521d234bda	Revert snap_refresh_nanos feature (#5269 ) Summary: Our daily stress tests are failing after this feature. Reverting temporarily until we figure the reason for test failures. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5269 Differential Revision: D15151285 Pulled By: maysamyabandeh fbshipit-source-id: e4002b99690a97df30d4b4b58bf0f61e9591bc6e	2019-05-01 10:07:30 -07:00
Maysam Yabandeh	506e8448be	Refresh snapshot list during long compactions (#5099 ) Summary: Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5099 Differential Revision: D15086710 Pulled By: maysamyabandeh fbshipit-source-id: 7649f56c3b6b2fb334962048150142a3bf9c1a12	2019-04-25 18:17:22 -07:00
Andrew Kryczka	6eb317bb4c	Option string/map/file can set env from object registry (#5237 ) Summary: - By providing the "env" field in any text-based options (i.e., string, map, or file), we can use `NewCustomObject` to deserialize the text value into an actual `Env` object. - Currently factory functions for `Env` registered with object registry should only return pointer to static `Env` objects. That's because `DBOptions::env` is a raw pointer so we cannot easily delegate cleanup. - Note I did not add `env` to `db_option_type_info`. It wasn't needed for (de)serialization, and I believe we don't want to do verification on `env`, even by checking name. That's because the user should be able to copy their DB from Linux to Windows, change envs, and not see an option verification error. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5237 Differential Revision: D15056360 Pulled By: siying fbshipit-source-id: 4b5f0b83297a5058f8949ec955dbf27d98d73d7e	2019-04-25 11:35:09 -07:00
Andrew Kryczka	8272a6de57	Optionally wait on bytes_per_sync to smooth I/O (#5183 ) Summary: The existing implementation does not guarantee bytes reach disk every `bytes_per_sync` when writing SST files, or every `wal_bytes_per_sync` when writing WALs. This can cause confusing behavior for users who enable this feature to avoid large syncs during flush and compaction, but then end up hitting them anyways. My understanding of the existing behavior is we used `sync_file_range` with `SYNC_FILE_RANGE_WRITE` to submit ranges for async writeback, such that we could continue processing the next range of bytes while that I/O is happening. I believe we can preserve that benefit while also limiting how far the processing can get ahead of the I/O, which prevents huge syncs from happening when the file finishes. Consider this `sync_file_range` usage: `sync_file_range(fd_, 0, static_cast<off_t>(offset + nbytes), SYNC_FILE_RANGE_WAIT_BEFORE \| SYNC_FILE_RANGE_WRITE)`. Expanding the range to start at 0 and adding the `SYNC_FILE_RANGE_WAIT_BEFORE` flag causes any pending writeback (like from a previous call to `sync_file_range`) to finish before it proceeds to submit the latest `nbytes` for writeback. The latest `nbytes` are still written back asynchronously, unless processing exceeds I/O speed, in which case the following `sync_file_range` will need to wait on it. There is a second change in this PR to use `fdatasync` when `sync_file_range` is unavailable (determined statically) or has some known problem with the underlying filesystem (determined dynamically). The above two changes only apply when the user enables a new option, `strict_bytes_per_sync`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5183 Differential Revision: D14953553 Pulled By: siying fbshipit-source-id: 445c3862e019fb7b470f9c7f314fc231b62706e9	2019-04-22 11:51:39 -07:00
Mike Kolupaev	df38c1ce66	Add BlockBasedTableOptions::index_shortening (#5174 ) Summary: Introduce BlockBasedTableOptions::index_shortening to give users control on which key shortening techniques to be used in building index blocks. Before this patch, both separators and successor keys where shortened in indexes. With this patch, the default is set to kShortenSeparators to only shorten the separators. Since each index block has many separators and only one successor (last key), the change should not have negative impact on index block size. However it should prevent many unnecessary block loads where due to approximation introduced by shorted successor, seek would land us to the previous block and then fix it by moving to the next one. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5174 Differential Revision: D14884185 Pulled By: al13n321 fbshipit-source-id: 1b08bc8c03edcf09b6b8c16e9a7eea08ad4dd534	2019-04-22 08:20:35 -07:00
Sagar Vemuri	d3d20dcdca	Periodic Compactions (#5166 ) Summary: Introducing Periodic Compactions. This feature allows all the files in a CF to be periodically compacted. It could help in catching any corruptions that could creep into the DB proactively as every file is constantly getting re-compacted. And also, of course, it helps to cleanup data older than certain threshold. - Introduced a new option `periodic_compaction_time` to control how long a file can live without being compacted in a CF. - This works across all levels. - The files are put in the same level after going through the compaction. (Related files in the same level are picked up as `ExpandInputstoCleanCut` is used). - Compaction filters, if any, are invoked as usual. - A new table property, `file_creation_time`, is introduced to implement this feature. This property is set to the time at which the SST file was created (and that time is given by the underlying Env/OS). This feature can be enabled on its own, or in conjunction with `ttl`. It is possible to set a different time threshold for the bottom level when used in conjunction with ttl. Since `ttl` works only on 0 to last but one levels, you could set `ttl` to, say, 1 day, and `periodic_compaction_time` to, say, 7 days. Since `ttl < periodic_compaction_time` all files in last but one levels keep getting picked up based on ttl, and almost never based on periodic_compaction_time. The files in the bottom level get picked up for compaction based on `periodic_compaction_time`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5166 Differential Revision: D14884441 Pulled By: sagar0 fbshipit-source-id: 408426cbacb409c06386a98632dcf90bfa1bda47	2019-04-10 19:31:18 -07:00
Mike Kolupaev	120bc4715b	Add DBOptions. avoid_unnecessary_blocking_io to defer file deletions (#5043 ) Summary: Just like ReadOptions::background_purge_on_iterator_cleanup but for ColumnFamilyHandle instead of Iterator. In our use case we sometimes call ColumnFamilyHandle's destructor from low-latency threads, and sometimes it blocks the thread for a few seconds deleting the files. To avoid that, we can either offload ColumnFamilyHandle's destruction to a background thread on our side, or add this option on rocksdb side. This PR does the latter, to be consistent with how we solve exactly the same problem for iterators using background_purge_on_iterator_cleanup option. (EDIT: It's avoid_unnecessary_blocking_io now, and affects both CF drops and iterator destructors.) I'm not quite comfortable with having two separate options (background_purge_on_iterator_cleanup and background_purge_on_cf_cleanup) for such a rarely used thing. Maybe we should merge them? Rename background_purge_on_cf_cleanup to something like delete_files_on_background_threads_only or avoid_blocking_io_in_unexpected_places, and make iterators use it instead of the one in ReadOptions? I can do that here if you guys think it's better. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5043 Differential Revision: D14339233 Pulled By: al13n321 fbshipit-source-id: ccf7efa11c85c9a5b91d969bb55627d0fb01e7b8	2019-04-01 17:10:40 -07:00
Siying Dong	a98317f555	Option string/map can set merge operator from object registry (#5123 ) Summary: Allow customized merge operator to be loaded from option file/map/string by allowing users to pre-regiester merge operators to object registry. Also update HISTORY.md and header files for the same feature for comparator. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5123 Differential Revision: D14658488 Pulled By: siying fbshipit-source-id: 86ea2fbd2a0a04632d8ea9fceaffefd041f6ae61	2019-03-28 14:54:29 -07:00
Siying Dong	4774a9409b	Allow option string to get comparator from object registry (#5106 ) Summary: Even customized ldb may not be able to read data from some databases if comparator is not standard. We modify option helper to get comparator from object registry so that we can use customized ldb to read non-standard comparator. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5106 Differential Revision: D14622107 Pulled By: siying fbshipit-source-id: 151dcb295a35a4c7d54f919cd4e322a89dc601c9	2019-03-26 14:23:51 -07:00
Shobhit Dayal	b45b1cde3e	Feature for sampling and reporting compressibility (#4842 ) Summary: This is a feature to sample data-block compressibility and and report them as stats. 1 in N (tunable) blocks is sampled for compressibility using two algorithms: 1. lz4 or snappy for fast compression 2. zstd or zlib for slow but higher compression. The stats are reported to the caller as raw-bytes and compressed-bytes. The block continues to be compressed for storage using the specified CompressionType. The db_bench_tool how has a command line option for specifying the sampling rate. It's default value is 0 (no sampling). To test the overhead for a certain value, users can compare the performance of db_bench_tool, varying the sampling rate. It is unlikely to have a noticeable impact for high values like 20. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4842 Differential Revision: D13629011 Pulled By: shobhitdayal fbshipit-source-id: 14ca668bcab6499b2a1734edf848eb62a4f4fafa	2019-03-18 12:15:34 -07:00
Zhongyi Xie	fdc72a5c5d	add OptionType kInt32T and kInt64T Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5061 Differential Revision: D14418581 Pulled By: miasantreble fbshipit-source-id: be7f90e16586666ddd0cce36971e403782ab0892	2019-03-12 13:49:52 -07:00
Zhongyi Xie	c4f5d0aa15	add GetStatsHistory to retrieve stats snapshots (#4748 ) Summary: This PR adds public `GetStatsHistory` API to retrieve stats history in the form of an std map. The key of the map is the timestamp in microseconds when the stats snapshot is taken, the value is another std map from stats name to stats value (stored in std string). Two DBOptions are introduced: `stats_persist_period_sec` (default 10 minutes) controls the intervals between two snapshots are taken; `max_stats_history_count` (default 10) controls the max number of history snapshots to keep in memory. RocksDB will stop collecting stats snapshots if `stats_persist_period_sec` is set to 0. (This PR is the in-memory part of https://github.com/facebook/rocksdb/pull/4535) Pull Request resolved: https://github.com/facebook/rocksdb/pull/4748 Differential Revision: D13961471 Pulled By: miasantreble fbshipit-source-id: ac836d401ecb84ea92216bf9966f969dedf4ad04	2019-02-20 15:52:54 -08:00
Zhongyi Xie	ed995c6a69	add whole key bloom filter support in memtables (#4985 ) Summary: MyRocks calls `GetForUpdate` on `INSERT`, for unique key check, and in almost all cases GetForUpdate returns empty result. For such cases, whole key bloom filter is helpful. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4985 Differential Revision: D14118257 Pulled By: miasantreble fbshipit-source-id: d35cb7109c62fd5ad541a26968e3a3e16d3e85ea	2019-02-19 12:15:39 -08:00
Aubin Sanyal	3231a2e581	Deprecate ttl option from CompactionOptionsFIFO (#4965 ) Summary: We introduced ttl option in CompactionOptionsFIFO when ttl-based file deletion (compaction) was supported only as part of FIFO Compaction. But with the extension of ttl semantics even to Level compaction, CompactionOptionsFIFO.ttl can now be deprecated. Instead we will start using ColumnFamilyOptions.ttl for FIFO compaction as well. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4965 Differential Revision: D14072960 Pulled By: sagar0 fbshipit-source-id: c98cc2ae695a28136295787cd88d36a220fc219e	2019-02-15 09:51:41 -08:00
Yanqin Jin	05dec0c7c7	Remove redundant member var and set options (#4631 ) Summary: In the past, both `DBImpl::atomic_flush_` and `DBImpl::immutable_db_options_.atomic_flush` exist. However, we fail to set `immutable_db_options_.atomic_flush`, but use `DBImpl::atomic_flush_` which is set correctly. This does not lead to incorrect behavior, but is a duplicate of information. Since `immutable_db_options_` is always there and has `atomic_flush`, we should use it as source of truth and remove `DBImpl::atomic_flush_`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4631 Differential Revision: D12928371 Pulled By: riversand963 fbshipit-source-id: f85a811959d3828aad4a3a1b05f71facf19c636d	2018-11-12 12:24:26 -08:00
Bo Hou	cd9404bb77	xxhash 64 support Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4607 Reviewed By: siying Differential Revision: D12836696 Pulled By: jsjhoubo fbshipit-source-id: 7122ccb712d0b0f1cd998aa4477e0da1401bd870	2018-11-01 15:44:06 -07:00
Yanqin Jin	5b4c709fad	Enable atomic flush (#4023 ) Summary: Adds a DB option `atomic_flush` to control whether to enable this feature. This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752). Pull Request resolved: https://github.com/facebook/rocksdb/pull/4023 Differential Revision: D8518381 Pulled By: riversand963 fbshipit-source-id: 1e3bb33e99bb102876a31b378d93b0138ff6634f	2018-10-26 15:08:43 -07:00
Fenggang Wu	19ec44fd39	Improve point-lookup performance using a data block hash index (#4174 ) Summary: Add hash index support to data blocks, which helps to reduce the CPU utilization of point-lookup operations. This feature is backward compatible with the data block created without the hash index. It is disabled by default unless `BlockBasedTableOptions::data_block_index_type` is set to `data_block_index_type = kDataBlockBinaryAndHash.` The DB size would be bigger with the hash index option as a hash table is added at the end of each data block. If the hash utilization ratio is 1:1, the space overhead is one byte per key. The hash table utilization ratio is adjustable using `BlockBasedTableOptions::data_block_hash_table_util_ratio`. A lower utilization ratio will improve more on the point-lookup efficiency, but take more space too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4174 Differential Revision: D8965914 Pulled By: fgwu fbshipit-source-id: 1c6bae5d1fc39c80282d8890a72e9e67bc247198	2018-08-15 14:30:03 -07:00
Fenggang Wu	a11df583ec	Add DataBlockIndexType option in BlockBasedTableOptions (#4150 ) Summary: Added DataBlockIndexType option in BlockBasedTableOptions. ``` enum DataBlockIndexType : char { kDataBlockBinarySearch = 0, // traditional block type kDataBlockHashIndex = 1, // additional hash index appended to the end. }; ``` The default type is the traditional binary seek option: `kDataBlockBinarySearch`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4150 Differential Revision: D8895958 Pulled By: fgwu fbshipit-source-id: 480adef48104cf11d30db3bad9a73f98b4a80c10	2018-07-27 15:42:27 -07:00
Sagar Vemuri	991120fa10	Allow ttl to be changed dynamically (#4133 ) Summary: Allow ttl to be changed dynamically. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4133 Differential Revision: D8845440 Pulled By: sagar0 fbshipit-source-id: c8c87ae643b3a8c4123e4c037c4645efc094a2d3	2018-07-16 14:27:53 -07:00
Zhichao Cao	1f6efabe23	Add bottommost_compression_opts to for bottommost_compression (#3985 ) Summary: …ression For `CompressionType` we have options `compression` and `bottommost_compression`. Thus, to make the compression options consitent with the compression type when bottommost_compression is enabled, we add the bottommost_compression_opts Closes https://github.com/facebook/rocksdb/pull/3985 Reviewed By: riversand963 Differential Revision: D8385911 Pulled By: zhichao-cao fbshipit-source-id: 07bc533dd61bcf1cef5927d8d62901c13d38d5fc	2018-06-27 17:42:38 -07:00
Zhongyi Xie	408205a36b	use user_key and iterate_upper_bound to determine compatibility of bloom filters (#3899 ) Summary: Previously in https://github.com/facebook/rocksdb/pull/3601 bloom filter will only be checked if `prefix_extractor` in the mutable_cf_options matches the one found in the SST file. This PR relaxes the requirement by checking if all keys in the range [user_key, iterate_upper_bound) all share the same prefix after transforming using the BF in the SST file. If so, the bloom filter is considered compatible and will continue to be looked at. Closes https://github.com/facebook/rocksdb/pull/3899 Differential Revision: D8157459 Pulled By: miasantreble fbshipit-source-id: 18d17cba56a1005162f8d5db7a27aba277089c41	2018-06-26 15:57:26 -07:00
Zhongyi Xie	c3ebc75843	Move prefix_extractor to MutableCFOptions Summary: Currently it is not possible to change bloom filter config without restart the db, which is causing a lot of operational complexity for users. This PR aims to make it possible to dynamically change bloom filter config. Closes https://github.com/facebook/rocksdb/pull/3601 Differential Revision: D7253114 Pulled By: miasantreble fbshipit-source-id: f22595437d3e0b86c95918c484502de2ceca120c	2018-05-21 14:43:11 -07:00
Maysam Yabandeh	718c1c9c1f	Pass manual_wal_flush also to the first wal file Summary: Currently manual_wal_flush if set in the options will be used only for the wal files created during wal switch. The configuration thus does not affect the first wal file. The patch fixes that and also update the related unit tests. This PR is built on top of https://github.com/facebook/rocksdb/pull/3756 Closes https://github.com/facebook/rocksdb/pull/3824 Differential Revision: D7909153 Pulled By: maysamyabandeh fbshipit-source-id: 024ed99d2555db06bf096c902b998e432bb7b9ce	2018-05-14 10:57:56 -07:00
Huachao Huang	cee138c7d7	Add missing options in BuildColumnfamilyOptions Summary: soft_pending_compaction_bytes_limit and hard_pending_compaction_bytes_limit are added to BuildColumnfamilyOptions. Closes https://github.com/facebook/rocksdb/pull/3823 Differential Revision: D7909246 Pulled By: maysamyabandeh fbshipit-source-id: 89032efbf6b5bd302ea50cbd7a234977984a1fca	2018-05-08 12:13:18 -07:00
Andrew Kryczka	019d7894eb	fix calling SetOptions on deprecated options Summary: In `cf_options_type_info`, the deprecated options are all considered to have offset zero in the `MutableCFOptions` struct. Previously we weren't checking in `GetMutableOptionsFromStrings` whether the provided option was deprecated or not and simply writing the provided value to the offset specified by `cf_options_type_info`. That meant setting any deprecated option would overwrite the first element in the struct, which is `write_buffer_size`. `db_stress` hit this often since it calls `SetOptions` with `soft_rate_limit=0` and `hard_rate_limit=0`, which are both deprecated so cause `write_buffer_size` to be set to zero, which causes it to crash on the following assertion: ``` db_stress: db/memtable.cc:106: rocksdb::MemTable::MemTable(const rocksdb::InternalKeyComparator&, const rocksdb::ImmutableCFOptions&, const rocksdb::MutableCFOptions&, rocksdb::WriteBufferManager*, rocksdb::SequenceNumber, uint32_t): Assertion `!ShouldScheduleFlush()' failed. ``` We fix it by skipping deprecated options (and logging a warning) when users provide them to `SetOptions`. I didn't want to fail the call for compatibility reasons. Closes https://github.com/facebook/rocksdb/pull/3700 Differential Revision: D7572596 Pulled By: ajkr fbshipit-source-id: bd5d84e14c0c39f30c5d4c6df7c1503d2c28ecf1	2018-04-10 19:02:09 -07:00
Phani Shekhar Mantripragada	446b32cfc3	Support for Column family specific paths. Summary: In this change, an option to set different paths for different column families is added. This option is set via cf_paths setting of ColumnFamilyOptions. This option will work in a similar fashion to db_paths setting. Cf_paths is a vector of Dbpath values which contains a pair of the absolute path and target size. Multiple levels in a Column family can go to different paths if cf_paths has more than one path. To maintain backward compatibility, if cf_paths is not specified for a column family, db_paths setting will be used. Note that, if db_paths setting is also not specified, RocksDB already has code to use db_name as the only path. Changes : 1) A new member "cf_paths" is added to ImmutableCfOptions. This is set, based on cf_paths setting of ColumnFamilyOptions and db_paths setting of ImmutableDbOptions. This member is used to identify the path information whenever files are accessed. 2) Validation checks are added for cf_paths setting based on existing checks for db_paths setting. 3) DestroyDB, PurgeObsoleteFiles etc. are edited to support multiple cf_paths. 4) Unit tests are added appropriately. Closes https://github.com/facebook/rocksdb/pull/3102 Differential Revision: D6951697 Pulled By: ajkr fbshipit-source-id: 60d2262862b0a8fd6605b09ccb0da32bb331787d	2018-04-05 19:58:20 -07:00
Sagar Vemuri	04c11b867d	Level Compaction with TTL Summary: Level Compaction with TTL. As of today, a file could exist in the LSM tree without going through the compaction process for a really long time if there are no updates to the data in the file's key range. For example, in certain use cases, the keys are not actually "deleted"; instead they are just set to empty values. There might not be any more writes to this "deleted" key range, and if so, such data could remain in the LSM for a really long time resulting in wasted space. Introducing a TTL could solve this problem. Files (and, in turn, data) older than TTL will be scheduled for compaction when there is no other background work. This will make the data go through the regular compaction process and get rid of old unwanted data. This also has the (good) side-effect of all the data in the non-bottommost level being newer than ttl, and all data in the bottommost level older than ttl. It could lead to more writes while reducing space. This functionality can be controlled by the newly introduced column family option -- ttl. TODO for later: - Make ttl mutable - Extend TTL to Universal compaction as well? (TTL is already supported in FIFO) - Maybe deprecate CompactionOptionsFIFO.ttl in favor of this new ttl option. Closes https://github.com/facebook/rocksdb/pull/3591 Differential Revision: D7275442 Pulled By: sagar0 fbshipit-source-id: dcba484717341200d419b0953dafcdf9eb2f0267	2018-04-02 22:14:28 -07:00
Andrew Kryczka	620823f88b	parse CompressionOptions::zstd_max_train_bytes in options string Summary: Closes https://github.com/facebook/rocksdb/pull/3588 Differential Revision: D7208087 Pulled By: ajkr fbshipit-source-id: 688f7a7c447cb17bee1b410d1fd891c0bf966617	2018-03-22 15:13:27 -07:00
Chinmay Kamat	7153153e4b	Fix enable_pipelined_write output in OPTIONS file Summary: enable_pipelined_write was not set in BuildDBOptions() causing its default value to be dumped in the OPTIONS file Closes https://github.com/facebook/rocksdb/pull/3585 Differential Revision: D7226395 Pulled By: yiwu-arbug fbshipit-source-id: 45a659a48d18103ac9ee74bb8805dd0a6ec12474	2018-03-13 11:59:02 -07:00
Yi Wu	edc258127e	DB::DumpSupportInfo should log all supported compression types Summary: DB::DumpSupportInfo should log all supported compression types. Closes #3146 Closes https://github.com/facebook/rocksdb/pull/3396 Differential Revision: D6777019 Pulled By: yiwu-arbug fbshipit-source-id: 5b17f1ffb2d71224e52f7d9c045434746c789fb0	2018-01-23 14:44:12 -08:00
Sagar Vemuri	0ea7170d7d	Remove old misleading comments Summary: FIFO and Universal compaction options were recently made dynamic, but I forgot to update these comments. These would mislead anyone who is reading the code. Closes https://github.com/facebook/rocksdb/pull/3399 Differential Revision: D6786358 Pulled By: sagar0 fbshipit-source-id: 57cfc412f63deaee29bbd82b863304821d60057d	2018-01-23 10:27:41 -08:00
Zhongyi Xie	fcc8a6574d	Make Universal compaction options dynamic Summary: Let me know if more test coverage is needed Closes https://github.com/facebook/rocksdb/pull/3213 Differential Revision: D6457165 Pulled By: miasantreble fbshipit-source-id: 3f944abff28aa7775237f1c4f61c64ccbad4eea9	2017-12-11 13:27:06 -08:00
Phani Shekhar Mantripragada	4b65cfc723	Support for block_cache num_shards and other config via option string. Summary: Problem: Option string accepts only cache_size as parameter for block_cache which is specified as "block_cache=1M". It doesn't accept other parameters like num_shards etc. Changes : 1) ParseBlockBasedTableOption in block_based_table_factory is edited to accept cache options in the format "block_cache=<cache_size>:<num_shard_bits>:<strict_capacity_limit>:<high_pri_pool_ratio>". Options other than cache_size are optional to maintain backward compatibility. The changes are valid for block_cache_compressed as well. For example, "block_cache=1M:6:true:0.5", "block_cache=1M:6:true", "block_cache=1M:6" and "block_cache=1M" are all valid option strings. 2) Corresponding unit tests are added. Closes https://github.com/facebook/rocksdb/pull/3108 Differential Revision: D6420997 Pulled By: sagar0 fbshipit-source-id: cdea8b785688d2802907974af27225ccc1c0cd43	2017-11-28 10:48:53 -08:00
Maysam Yabandeh	0213990b3a	Move static variables out of the header file Summary: Static variables in header files will be instantiated in every file that includes the header file. This patch moves some of them from options_helper.h to its .cc files. It also moves the static variable out of the offset_of since the template function could also lead to multiple instantiation perhaps due to inlining. Fixes https://github.com/facebook/rocksdb/issues/3176 Closes https://github.com/facebook/rocksdb/pull/3178 Differential Revision: D6363794 Pulled By: maysamyabandeh fbshipit-source-id: d0a07f061b4d992ab4e0de2706e622131d258fdd	2017-11-17 17:12:27 -08:00
Zhongyi Xie	32e31d49d1	Make DBOption compaction_readahead_size dynamic Summary: Closes https://github.com/facebook/rocksdb/pull/3004 Differential Revision: D6056141 Pulled By: miasantreble fbshipit-source-id: 56df1630f464fd56b07d25d38161f699e0528b7f	2017-11-16 17:57:25 -08:00
Mikhail Antonov	7fe3b32896	Added support for differential snapshots Summary: The motivation for this PR is to add to RocksDB support for differential (incremental) snapshots, as snapshot of the DB changes between two points in time (one can think of it as diff between to sequence numbers, or the diff D which can be thought of as an SST file or just set of KVs that can be applied to sequence number S1 to get the database to the state at sequence number S2). This feature would be useful for various distributed storages layers built on top of RocksDB, as it should help reduce resources (time and network bandwidth) needed to recover and rebuilt DB instances as replicas in the context of distributed storages. From the API standpoint that would like client app requesting iterator between (start seqnum) and current DB state, and reading the "diff". This is a very draft PR for initial review in the discussion on the approach, i'm going to rework some parts and keep updating the PR. For now, what's done here according to initial discussions: Preserving deletes: - We want to be able to optionally preserve recent deletes for some defined period of time, so that if a delete came in recently and might need to be included in the next incremental snapshot it would't get dropped by a compaction. This is done by adding new param to Options (preserve deletes flag) and new variable to DB Impl where we keep track of the sequence number after which we don't want to drop tombstones, even if they are otherwise eligible for deletion. - I also added a new API call for clients to be able to advance this cutoff seqnum after which we drop deletes; i assume it's more flexible to let clients control this, since otherwise we'd need to keep some kind of timestamp < -- > seqnum mapping inside the DB, which sounds messy and painful to support. Clients could make use of it by periodically calling GetLatestSequenceNumber(), noting the timestamp, doing some calculation and figuring out by how much we need to advance the cutoff seqnum. - Compaction codepath in compaction_iterator.cc has been modified to avoid dropping tombstones with seqnum > cutoff seqnum. Iterator changes: - couple params added to ReadOptions, to optionally allow client to request internal keys instead of user keys (so that client can get the latest value of a key, be it delete marker or a put), as well as min timestamp and min seqnum. TableCache changes: - I modified table_cache code to be able to quickly exclude SST files from iterators heep if creation_time on the file is less then iter_start_ts as passed in ReadOptions. That would help a lot in some DB settings (like reading very recent data only or using FIFO compactions), but not so much for universal compaction with more or less long iterator time span. What's left: - Still looking at how to best plug that inside DBIter codepath. So far it seems that FindNextUserKeyInternal only parses values as UserKeys, and iter->key() call generally returns user key. Can we add new API to DBIter as internal_key(), and modify this internal method to optionally set saved_key_ to point to the full internal key? I don't need to store actual seqnum there, but I do need to store type. Closes https://github.com/facebook/rocksdb/pull/2999 Differential Revision: D6175602 Pulled By: mikhail-antonov fbshipit-source-id: c779a6696ee2d574d86c69cec866a3ae095aa900	2017-11-01 18:56:43 -07:00
Shaohua Li	33c7d4ccd9	Make writable_file_max_buffer_size dynamic Summary: The DBOptions::writable_file_max_buffer_size can be changed dynamically. Closes https://github.com/facebook/rocksdb/pull/3053 Differential Revision: D6152720 Pulled By: shligit fbshipit-source-id: aa0c0cfcfae6a54eb17faadb148d904797c68681	2017-10-31 13:56:35 -07:00
Sagar Vemuri	f0804db7f7	Make FIFO compaction options dynamically configurable Summary: ColumnFamilyOptions::compaction_options_fifo and all its sub-fields can be set dynamically now. Some of the ways in which the fifo compaction options can be set are: - `SetOptions({{"compaction_options_fifo", "{max_table_files_size=1024}"}})` - `SetOptions({{"compaction_options_fifo", "{ttl=600;}"}})` - `SetOptions({{"compaction_options_fifo", "{max_table_files_size=1024;ttl=600;}"}})` - `SetOptions({{"compaction_options_fifo", "{max_table_files_size=51;ttl=49;allow_compaction=true;}"}})` Most of the code has been made generic enough so that it could be reused later to make universal options (and other such nested defined-types) dynamic with very few lines of parsing/serializing code changes. Introduced a few new functions like `ParseStruct`, `SerializeStruct` and `GetStringFromStruct`. The duplicate code in `GetStringFromDBOptions` and `GetStringFromColumnFamilyOptions` has been moved into `GetStringFromStruct`. So they become just simple wrappers now. Closes https://github.com/facebook/rocksdb/pull/3006 Differential Revision: D6058619 Pulled By: sagar0 fbshipit-source-id: 1e8f78b3374ca5249bb4f3be8a6d3bb4cbc52f92	2017-10-19 15:26:36 -07:00
Manuel Ung	88ed1f6ea6	Allow upgrades from nullptr to some merge operator Summary: Currently, RocksDB does not allow reopening a preexisting DB with no merge operator defined, with a merge operator defined. This means that if a DB ever want to add a merge operator, there's no way to do so currently. Fix this by adding a new verification type `kByNameAllowFromNull` which will allow old values to be nullptr, and new values to be non-nullptr. Closes https://github.com/facebook/rocksdb/pull/2958 Differential Revision: D5961131 Pulled By: lth fbshipit-source-id: 06179bebd0d90db3d43690b5eb7345e2d5bab1eb	2017-10-04 09:57:23 -07:00
Quinn Jarrell	6a541afcc4	Make bytes_per_sync and wal_bytes_per_sync mutable Summary: SUMMARY Moves the bytes_per_sync and wal_bytes_per_sync options from immutableoptions to mutable options. Also if wal_bytes_per_sync is changed, the wal file and memtables are flushed. TEST PLAN ran make check all passed Two new tests SetBytesPerSync, SetWalBytesPerSync check that after issuing setoptions with a new value for the var, the db options have the new value. Closes https://github.com/facebook/rocksdb/pull/2893 Reviewed By: yiwu-arbug Differential Revision: D5845814 Pulled By: TheRushingWookie fbshipit-source-id: 93b52d779ce623691b546679dcd984a06d2ad1bd	2017-09-27 17:49:45 -07:00
Siying Dong	21696ba502	Replace dynamic_cast<> Summary: Replace dynamic_cast<> so that users can choose to build with RTTI off, so that they can save several bytes per object, and get tiny more memory available. Some nontrivial changes: 1. Add Comparator::GetRootComparator() to get around the internal comparator hack 2. Add the two experiemental functions to DB 3. Add TableFactory::GetOptionString() to avoid unnecessary casting to get the option string 4. Since 3 is done, move the parsing option functions for table factory to table factory files too, to be symmetric. Closes https://github.com/facebook/rocksdb/pull/2645 Differential Revision: D5502723 Pulled By: siying fbshipit-source-id: fd13cec5601cf68a554d87bfcf056f2ffa5fbf7c	2017-07-28 16:27:16 -07:00
Sagar Vemuri	72502cf227	Revert "comment out unused parameters" Summary: This reverts the previous commit `1d7048c598`, which broke the build. Did a `git revert 1d7048c`. Closes https://github.com/facebook/rocksdb/pull/2627 Differential Revision: D5476473 Pulled By: sagar0 fbshipit-source-id: 4756ff5c0dfc88c17eceb00e02c36176de728d06	2017-07-21 18:26:26 -07:00
Victor Gao	1d7048c598	comment out unused parameters Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually. Reviewed By: igorsugak Differential Revision: D5454343 fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2	2017-07-21 14:57:44 -07:00
Siying Dong	3c327ac2d0	Change RocksDB License Summary: Closes https://github.com/facebook/rocksdb/pull/2589 Differential Revision: D5431502 Pulled By: siying fbshipit-source-id: 8ebf8c87883daa9daa54b2303d11ce01ab1f6f75	2017-07-15 16:11:23 -07:00
Sagar Vemuri	89ad9f3adb	Allow ignoring unknown options when loading options from a file Summary: Added a flag, `ignore_unknown_options`, to skip unknown options when loading an options file (using `LoadLatestOptions`/`LoadOptionsFromFile`) or while verifying options (using `CheckOptionsCompatibility`). This will help in downgrading the db to an older version. Also added `--ignore_unknown_options` flag to ldb Example Use case: In MyRocks, if copying from newer version to older version, it is often impossible to start because of new RocksDB options that don't exist in older version, even though data format is compatible. MyRocks uses these load and verify functions in [ha_rocksdb.cc::check_rocksdb_options_compatibility](`e004fd9f41/storage/rocksdb/ha_rocksdb.cc (L3348-L3401)`). Test Plan: Updated the unit tests. `make check` ldb: $ ./ldb --db=/tmp/test_db --create_if_missing put a1 b1 OK Now edit /tmp/test_db/<OPTIONS-file> and add an unknown option. Try loading the options now, and it fails: $ ./ldb --db=/tmp/test_db --try_load_options get a1 Failed: Invalid argument: Unrecognized option DBOptions:: abcd Passes with the new --ignore_unknown_options flag $ ./ldb --db=/tmp/test_db --try_load_options --ignore_unknown_options get a1 b1 Closes https://github.com/facebook/rocksdb/pull/2423 Differential Revision: D5212091 Pulled By: sagar0 fbshipit-source-id: 2ec17636feb47dc0351b53a77e5f15ef7cbf2ca7	2017-06-13 16:58:01 -07:00
Andrew Kryczka	bb01c1880c	Introduce max_background_jobs mutable option Summary: - `max_background_flushes` and `max_background_compactions` are still supported for backwards compatibility - `base_background_compactions` is completely deprecated. Now we just throttle to one background compaction when there's no pressure. - `max_background_jobs` is added to automatically partition the concurrent background jobs into flushes vs compactions. Currently it's very simple as we just allocate one-fourth of the jobs to flushes, and the remaining can be used for compactions. - The test cases that set `base_background_compactions > 1` needed to be updated. I just grab the pressure token such that the desired number of compactions can be scheduled. Closes https://github.com/facebook/rocksdb/pull/2205 Differential Revision: D4937461 Pulled By: ajkr fbshipit-source-id: df52cbbd497e13bbc9a60560a5ac2a2526b3f1f9	2017-05-24 11:29:08 -07:00
Mikhail Antonov	ba685a472a	Support ingest_behind for IngestExternalFile Summary: First cut for early review; there are few conceptual points to answer and some code structure issues. For conceptual points - - restriction-wise, we're going to disallow ingest_behind if (use_seqno_zero_out=true \|\| disable_auto_compaction=false), the user is responsible to properly open and close DB with required params - we wanted to ingest into reserved bottom most level. Should we fail fast if bottom level isn't empty, or should we attempt to ingest if file fits there key-ranges-wise? - Modifying AssignLevelForIngestedFile seems the place we we'd handle that. On code structure - going to refactor GenerateAndAddExternalFile call in the test class to allow passing instance of IngestionOptions, that's just going to incur lots of changes at callsites. Closes https://github.com/facebook/rocksdb/pull/2144 Differential Revision: D4873732 Pulled By: lightmark fbshipit-source-id: 81cb698106b68ef8797f564453651d50900e153a	2017-05-17 11:42:42 -07:00
Leonidas Galanis	e7ae4a3a02	Max open files mutable Summary: Makes max_open_files db option dynamically set-able by SetDBOptions. During the call of SetDBOptions we call SetCapacity on the table cache, which is a LRUCache. Closes https://github.com/facebook/rocksdb/pull/2185 Differential Revision: D4979189 Pulled By: yiwu-arbug fbshipit-source-id: ca7e8dc5e3619c79434f579be4847c0f7e56afda	2017-05-03 21:13:14 -07:00
Siying Dong	d616ebea23	Add GPLv2 as an alternative license. Summary: Closes https://github.com/facebook/rocksdb/pull/2226 Differential Revision: D4967547 Pulled By: siying fbshipit-source-id: dd3b58ae1e7a106ab6bb6f37ab5c88575b125ab4	2017-04-27 18:06:12 -07:00
Tomas Kolda	04d58970cb	AIX and Solaris Sparc Support Summary: Replacement of #2147 The change was squashed due to a lot of conflicts. Closes https://github.com/facebook/rocksdb/pull/2194 Differential Revision: D4929799 Pulled By: siying fbshipit-source-id: 5cd49c254737a1d5ac13f3c035f128e86524c581	2017-04-21 20:48:04 -07:00
Maysam Yabandeh	afff9951e2	Respect deprecated flag in table options Summary: Closes https://github.com/facebook/rocksdb/pull/2197 Differential Revision: D4932434 Pulled By: maysamyabandeh fbshipit-source-id: 6c83c12d6d47e3f0640ab84954944215968f266f	2017-04-21 17:42:10 -07:00
Aaron Gao	44fa8ece9b	change use_direct_writes to use_direct_io_for_flush_and_compaction Summary: Replace Options::use_direct_writes with Options::use_direct_io_for_flush_and_compaction Now if Options::use_direct_io_for_flush_and_compaction = true, we will enable direct io for both reads and writes for flush and compaction job. Whereas Options::use_direct_reads controls user reads like iterator and Get(). Closes https://github.com/facebook/rocksdb/pull/2117 Differential Revision: D4860912 Pulled By: lightmark fbshipit-source-id: d93575a8a5e780cf7e40797287edc425ee648c19	2017-04-13 16:12:04 -07:00
Sagar Vemuri	343b59d6ee	Move various string utility functions into string_util Summary: This is an effort to club all string related utility functions into one common place, in string_util, so that it is easier for everyone to know what string processing functions are available. Right now they seem to be spread out across multiple modules, like logging and options_helper. Check the sub-commits for easier reviewing. Closes https://github.com/facebook/rocksdb/pull/2094 Differential Revision: D4837730 Pulled By: sagar0 fbshipit-source-id: 344278a	2017-04-06 14:54:12 -07:00
Siying Dong	d2dce5611a	Move some files under util/ to separate dirs Summary: Move some files under util/ to new directories env/, monitoring/ options/ and cache/ Closes https://github.com/facebook/rocksdb/pull/2090 Differential Revision: D4833681 Pulled By: siying fbshipit-source-id: 2fd8bef	2017-04-05 19:09:16 -07:00

1 2 3

136 Commits