Summary:
1. The buckifier script assume each test "foo" comes with a .cc file of the same name (i.e. foo.cc). Update cassandra tests to follow this pattern so that the buckifier script can recognize them.
2. add blob_db_test
Closes https://github.com/facebook/rocksdb/pull/2506
Differential Revision: D5331517
Pulled By: yiwu-arbug
fbshipit-source-id: 86f3eba471fc621186ab44cbd073b6162cde8e57
Summary:
Throughput: 46k tps in our sysbench settings (filling the details later)
The idea is to have the simplest change that gives us a reasonable boost
in 2PC throughput.
Major design changes:
1. The WAL file internal buffer is not flushed after each write. Instead
it is flushed before critical operations (WAL copy via fs) or when
FlushWAL is called by MySQL. Flushing the WAL buffer is also protected
via mutex_.
2. Use two sequence numbers: last seq, and last seq for write. Last seq
is the last visible sequence number for reads. Last seq for write is the
next sequence number that should be used to write to WAL/memtable. This
allows to have a memtable write be in parallel to WAL writes.
3. BatchGroup is not used for writes. This means that we can have
parallel writers which changes a major assumption in the code base. To
accommodate for that i) allow only 1 WriteImpl that intends to write to
memtable via mem_mutex_--which is fine since in 2PC almost all of the memtable writes
come via group commit phase which is serial anyway, ii) make all the
parts in the code base that assumed to be the only writer (via
EnterUnbatched) to also acquire mem_mutex_, iii) stat updates are
protected via a stat_mutex_.
Note: the first commit has the approach figured out but is not clean.
Submitting the PR anyway to get the early feedback on the approach. If
we are ok with the approach I will go ahead with this updates:
0) Rebase with Yi's pipelining changes
1) Currently batching is disabled by default to make sure that it will be
consistent with all unit tests. Will make this optional via a config.
2) A couple of unit tests are disabled. They need to be updated with the
serial commit of 2PC taken into account.
3) Replacing BatchGroup with mem_mutex_ got a bit ugly as it requires
releasing mutex_ beforehand (the same way EnterUnbatched does). This
needs to be cleaned up.
Closes https://github.com/facebook/rocksdb/pull/2345
Differential Revision: D5210732
Pulled By: maysamyabandeh
fbshipit-source-id: 78653bd95a35cd1e831e555e0e57bdfd695355a4
Summary:
With c7004840d2, it's safe to open a DB with different prefix extractor. So it's safe to skip prefix extractor check.
Closes https://github.com/facebook/rocksdb/pull/2474
Differential Revision: D5294700
Pulled By: siying
fbshipit-source-id: eeb500da795eecb29b8c9c56a14cfd4afda12ecc
Summary:
When we create a column based on the `string::c_str()`, we need to make sure that char array doesn't get deleted when calls to `string::append()` cause the string to expand.
Closes https://github.com/facebook/rocksdb/pull/2470
Differential Revision: D5285049
Pulled By: ajkr
fbshipit-source-id: f918dd426ff3c024e7a293dcb10448f10b6c98e8
Summary:
Make default impl return NoSupported so the db_blob
tests exist in a meaningful manner.
Replace std::thread to port::Thread
Closes https://github.com/facebook/rocksdb/pull/2465
Differential Revision: D5275563
Pulled By: yiwu-arbug
fbshipit-source-id: cedf1a18a2c05e20d768c1308b3f3224dbd70ab6
Summary:
`CompressBlock()` will return the uncompressed slice (i.e. `Slice(value_unc)`) if compression ratio is not good enough. This is undesired. We need to always assign the compressed slice to `value`.
Closes https://github.com/facebook/rocksdb/pull/2447
Differential Revision: D5244682
Pulled By: yiwu-arbug
fbshipit-source-id: 6989dd8852c9622822ba9acec9beea02007dff09
Summary:
I'm trying to improve unit test of blob db. I'm rewriting blob db test. In this patch:
* Rewrite tests of basic put/write/delete operations.
* Add disable_background_tasks to BlobDBOptionsImpl to allow me not running any background job for basic unit tests.
* Move DestroyBlobDB out from BlobDBImpl to be a standalone function.
* Remove all garbage collection related tests. Will rewrite them in following patch.
* Disabled compression test since it is failing. Will fix in a followup patch.
Closes https://github.com/facebook/rocksdb/pull/2446
Differential Revision: D5243306
Pulled By: yiwu-arbug
fbshipit-source-id: 157c71ad3b699307cb88baa3830e9b6e74f8e939
Summary:
Added a flag, `ignore_unknown_options`, to skip unknown options when loading an options file (using `LoadLatestOptions`/`LoadOptionsFromFile`) or while verifying options (using `CheckOptionsCompatibility`). This will help in downgrading the db to an older version.
Also added `--ignore_unknown_options` flag to ldb
**Example Use case:**
In MyRocks, if copying from newer version to older version, it is often impossible to start because of new RocksDB options that don't exist in older version, even though data format is compatible.
MyRocks uses these load and verify functions in [ha_rocksdb.cc::check_rocksdb_options_compatibility](e004fd9f41/storage/rocksdb/ha_rocksdb.cc (L3348-L3401)).
**Test Plan:**
Updated the unit tests.
`make check`
ldb:
$ ./ldb --db=/tmp/test_db --create_if_missing put a1 b1
OK
Now edit /tmp/test_db/<OPTIONS-file> and add an unknown option.
Try loading the options now, and it fails:
$ ./ldb --db=/tmp/test_db --try_load_options get a1
Failed: Invalid argument: Unrecognized option DBOptions:: abcd
Passes with the new --ignore_unknown_options flag
$ ./ldb --db=/tmp/test_db --try_load_options --ignore_unknown_options get a1
b1
Closes https://github.com/facebook/rocksdb/pull/2423
Differential Revision: D5212091
Pulled By: sagar0
fbshipit-source-id: 2ec17636feb47dc0351b53a77e5f15ef7cbf2ca7
Summary:
Allow users to rate limit background work based on read bytes, written bytes, or sum of read and written bytes. Support these by changing the RateLimiter API, so no additional options were needed.
Closes https://github.com/facebook/rocksdb/pull/2433
Differential Revision: D5216946
Pulled By: ajkr
fbshipit-source-id: aec57a8357dbb4bfde2003261094d786d94f724e
Summary:
At the beginning of write batch write, grab the latest sequence from base db and assume sequence number will increment by 1 for each put and delete, and write the exact sequence number with each put. This is assuming we are the only writer to increment sequence number (no external file ingestion, etc) and there should be no holes in the sequence number.
Also having some minor naming changes.
Closes https://github.com/facebook/rocksdb/pull/2402
Differential Revision: D5176134
Pulled By: yiwu-arbug
fbshipit-source-id: cb4712ee44478d5a2e5951213a10b72f08fe8c88
Summary:
USE_CLANG=1 make -j32 analyze
The two errors would disappear after the assertion.
Closes https://github.com/facebook/rocksdb/pull/2416
Differential Revision: D5193526
Pulled By: maysamyabandeh
fbshipit-source-id: 16a21f18f68023f862764dd3ab9e00ca60b0eefa
Summary:
… headers
https://github.com/facebook/rocksdb/pull/2199 should not reference RocksDB-specific macros (like ROCKSDB_SUPPORT_THREAD_LOCAL in this case) to public headers, `iostats_context.h` and `perf_context.h`. We shouldn't do that because users have to provide these compiler flags when building their binary with RocksDB.
We should hide the thread local global variable inside our implementation and just expose a function api to retrieve these variables. It may break some users for now but good for long term.
make check -j64
Closes https://github.com/facebook/rocksdb/pull/2380
Differential Revision: D5177896
Pulled By: lightmark
fbshipit-source-id: 6fcdfac57f2e2dcfe60992b7385c5403f6dcb390
Summary:
Blob db rely on base db returning sequence number through write batch after DB::Write(). However after recent changes to the write path, DB::Writ()e no longer return sequence number in some cases. Fixing it by have WriteBatchInternal::InsertInto() always encode sequence number into write batch.
Stacking on #2375.
Closes https://github.com/facebook/rocksdb/pull/2385
Differential Revision: D5148358
Pulled By: yiwu-arbug
fbshipit-source-id: 8bda0aa07b9334ed03ed381548b39d167dc20c33
Summary:
Re-enable blob_db_test with some update:
* Commented out delay at the end of GC tests. Will update the logic later with sync point to properly trigger GC.
* Added some helper functions.
Also update make files to include blob_dump tool.
Closes https://github.com/facebook/rocksdb/pull/2375
Differential Revision: D5133793
Pulled By: yiwu-arbug
fbshipit-source-id: 95470b26d0c1f9592ba4b7637e027fdd263f425c
Summary:
Fixing the build errors seen with GCC 4.8.1.
```
Makefile:105: Warning: Compiling in debug mode. Don't use the resulting binary in production
utilities/blob_db/blob_dump_tool.cc: In member function ‘rocksdb::Status rocksdb::blob_db::BlobDumpTool::DumpBlobLogFooter(uint64_t, uint64_t*)’:
utilities/blob_db/blob_dump_tool.cc:149:42: error: expected ‘)’ before ‘PRIu64’
fprintf(stdout, " Blob count : %" PRIu64 "\n", footer.GetBlobCount());
^
utilities/blob_db/blob_dump_tool.cc:149:76: error: spurious trailing ‘%’ in format [-Werror=format=]
fprintf(stdout, " Blob count : %" PRIu64 "\n", footer.GetBlobCount());
^
utilities/blob_db/blob_dump_tool.cc:149:76: error: too many arguments for format [-Werror=format-extra-args]
utilities/blob_db/blob_dump_tool.cc: In member function ‘rocksdb::Status rocksdb::blob_db::BlobDumpTool::DumpRecord(rocksdb::blob_db::BlobDumpTool::DisplayType, rocksdb::blob_db::BlobDumpTool::DisplayType, uint64_t*)’:
utilities/blob_db/blob_dump_tool.cc:161:49: error: expected ‘)’ before ‘PRIx64’
fprintf(stdout, "Read record with offset 0x%" PRIx64 " (%" PRIu64 "):\n",
^
utilities/blob_db/blob_dump_tool.cc:162:27: error: spurious trailing ‘%’ in format [-Werror=format=]
*offset, *offset);
^
utilities/blob_db/blob_dump_tool.cc:162:27: error: too many arguments for format [-Werror=format-extra-args]
utilities/blob_db/blob_dump_tool.cc:176:38: error: expected ‘)’ before ‘PRIu64’
fprintf(stdout, " blob size : %" PRIu64 "\n", record.GetBlobSize());
^
utilities/blob_db/blob_dump_tool.cc:176:71: error: spurious trailing ‘%’ in format [-Werror=format=]
fprintf(stdout, " blob size : %" PRIu64 "\n", record.GetBlobSize());
^
utilities/blob_db/blob_dump_tool.cc:176:71: error: too many arguments for format [-Werror=format-extra-args]
utilities/blob_db/blob_dump_tool.cc:178:38: error: expected ‘)’ before ‘PRIu64’
fprintf(stdout, " time : %" PRIu64 "\n", record.GetTimeVal());
^
utilities/blob_db/blob_dump_tool.cc:178:70: error: spurious trailing ‘%’ in format [-Werror=format=]
fprintf(stdout, " time : %" PRIu64 "\n", record.GetTimeVal());
^
utilities/blob_db/blob_dump_tool.cc:178:70: error: too many arguments for format [-Werror=format-extra-args]
utilities/blob_db/blob_dump_tool.cc:214:38: error: expected ‘)’ before ‘PRIu64’
fprintf(stdout, " sequence : %" PRIu64 "\n", record.GetSN());
^
utilities/blob_db/blob_dump_tool.cc:214:65: error: spurious trailing ‘%’ in format [-Werror=format=]
fprintf(stdout, " sequence : %" PRIu64 "\n", record.GetSN());
```
Closes https://github.com/facebook/rocksdb/pull/2359
Differential Revision: D5117684
Pulled By: sagar0
fbshipit-source-id: 7480346bcd96205fcae890927c5e68cf004e87be
Summary:
Disable direct reads for log and manifest. Direct reads should not affect sequential_file
Also add kDirectIO for option_config_ in db_test_util
Closes https://github.com/facebook/rocksdb/pull/2337
Differential Revision: D5100261
Pulled By: lightmark
fbshipit-source-id: 0ebfd13b93fa1b8f9acae514ac44f8125a05868b
Summary:
Fixing two types of clang-analyzer false positives:
* db is deleted and then reopen, and clang-analyzer thinks we are reusing the pointer after it has been deleted. Adding asserts to hint clang-analyzer the pointer is recreated.
* ParsedInternalKey is (intentionally) uninitialized. Initialize the struct only when clang-analyzer is running.
Closes https://github.com/facebook/rocksdb/pull/2334
Differential Revision: D5093801
Pulled By: yiwu-arbug
fbshipit-source-id: f51355382098eb3da5ab9f64e094c6d03e6bdf7d
Summary:
Looks like std::snprintf is not available on all platforms (e.g. MSVC 2010). Change it back to snprintf, where we have a macro in port.h to workaround compatibility.
Closes https://github.com/facebook/rocksdb/pull/2308
Differential Revision: D5070988
Pulled By: yiwu-arbug
fbshipit-source-id: bedfc1660bab0431c583ad434b7e68265e1211b1
Summary:
snprintf is in <stdio.h> and not in namespace std.
Closes https://github.com/facebook/rocksdb/pull/2287
Reviewed By: anirbanr-fb
Differential Revision: D5054752
Pulled By: yiwu-arbug
fbshipit-source-id: 356807ec38f3c7d95951cdb41f31a3d3ae0714d4
Summary:
- Introduced an include/ file dedicated to db-related debug functions to avoid making db.h more complex
- Added debugging function, `GetAllKeyVersions()`, to return a listing of internal data for a range of user keys. The new `struct KeyVersion` exposes data similar to internal key without exposing any internal type.
- Migrated the "ldb idump" subcommand to use this function
- The API takes an inclusive-exclusive range to match behavior of "ldb idump". This will be quite annoying for users who want to query a single user key's versions :(.
Closes https://github.com/facebook/rocksdb/pull/2232
Differential Revision: D4976007
Pulled By: ajkr
fbshipit-source-id: cab375da53a7595d6575af2b7e3b776aa3ad793e
Summary:
Remove double buffering on RandomRead on Windows.
With more logic appear in file reader/write Read no longer
obeys forwarding calls to Windows implementation.
Previously direct_io (unbuffered) was only available on Windows
but now is supported as generic.
We remove intermediate buffering on Windows.
Remove random_access_max_buffer_size option which was windows specific.
Non-zero values for that opton introduced unnecessary lock contention.
Remove Env::EnableReadAhead(), Env::ShouldForwardRawRequest() that are
no longer necessary.
Add aligned buffer reads for cases when requested reads exceed read ahead size.
Closes https://github.com/facebook/rocksdb/pull/2105
Differential Revision: D4847770
Pulled By: siying
fbshipit-source-id: 8ab48f8e854ab498a4fd398a6934859792a2788f
Summary:
These code paths forked when checkpoint was introduced by copy/pasting the core backup logic. Over time they diverged and bug fixes were sometimes applied to one but not the other (like fix to include all relevant WALs for 2PC), or it required extra effort to fix both (like fix to forge CURRENT file). This diff reunites the code paths by extracting the core logic into a function, CreateCustomCheckpoint(), that is customizable via callbacks to implement both checkpoint and backup.
Related changes:
- flush_before_backup is now forcibly enabled when 2PC is enabled
- Extracted CheckpointImpl class definition into a header file. This is so the function, CreateCustomCheckpoint(), can be called by internal rocksdb code but not exposed to users.
- Implemented more functions in DummyDB/DummyLogFile (in backupable_db_test.cc) that are used by CreateCustomCheckpoint().
Closes https://github.com/facebook/rocksdb/pull/1932
Differential Revision: D4622986
Pulled By: ajkr
fbshipit-source-id: 157723884236ee3999a682673b64f7457a7a0d87
Summary:
This is useful when we put the entries in the block cache for accounting
purposes and do not expect it to be used after it is released. If the cache does not
erase the item in such cases not only the performance of cache is
negatively affected but the item's destructor not being called at the
time of release might violate the assumptions about the lifetime of the
object.
The new change adds a force_erase option to the Release method and
returns a boolean to indicate whehter the item is successfully deleted.
Closes https://github.com/facebook/rocksdb/pull/2180
Differential Revision: D4916032
Pulled By: maysamyabandeh
fbshipit-source-id: 94409a346069923cac9de8e57adc313b4ed46f28
Summary:
Replacement of #2147
The change was squashed due to a lot of conflicts.
Closes https://github.com/facebook/rocksdb/pull/2194
Differential Revision: D4929799
Pulled By: siying
fbshipit-source-id: 5cd49c254737a1d5ac13f3c035f128e86524c581
Summary:
This was requested by a customer who wants to proactively monitor whether any valid backups are available. The existing performance was poor because Open() serially reads every small meta-file (one per backup), which was slow on HDFS.
Now we only read the minimum number of meta-files to find `max_valid_backups_to_open` valid backups. The customer mentioned above can just set it to one.
Closes https://github.com/facebook/rocksdb/pull/2151
Differential Revision: D4882564
Pulled By: ajkr
fbshipit-source-id: cb0edf9e8ac693e4d5f24902e725a011ed8c0c2f
Summary:
Upgrading a shared lock was silently succeeding because the actual locking code was skipped. This is because if the keys are tracked, it is assumed that they are already locked and do not require locking. Fix this by recording in tracked keys whether the key was locked exclusively or not.
Note that lock downgrades are impossible, which is the behaviour we expect.
This fixesfacebook/mysql-5.6#587.
Closes https://github.com/facebook/rocksdb/pull/2122
Differential Revision: D4861489
Pulled By: IslamAbdelRahman
fbshipit-source-id: 58c7ebe7af098bf01b9774b666d3e9867747d8fd
Summary:
Extend TransactionOptions to include max_write_batch_size which determines the maximum size of the writebatch representation. If memory limit is exceeded, the operation will abort with subcode kMemoryLimit.
Closes https://github.com/facebook/rocksdb/pull/2124
Differential Revision: D4861842
Pulled By: lth
fbshipit-source-id: 46fd172ea67cc90bbba829bf0d70cfab2261c161
Summary:
DBIter, and in-turn NewDBIterator and NewArenaWrappedDBIterator, take a bunch of params. They can be reduced by passing in ReadOptions directly instead of passing in every new param separately. It also seems much cleaner as a bunch of the params towards the end seem to be optional.
(Recently I introduced max_skippable_internal_keys, which added one more to the already huge count).
Idea courtesy IslamAbdelRahman
Closes https://github.com/facebook/rocksdb/pull/2116
Differential Revision: D4857128
Pulled By: sagar0
fbshipit-source-id: 7d239df094b94bd9ea79d145cdf825478ac037a8
Summary:
This is an effort to club all string related utility functions into one common place, in string_util, so that it is easier for everyone to know what string processing functions are available. Right now they seem to be spread out across multiple modules, like logging and options_helper.
Check the sub-commits for easier reviewing.
Closes https://github.com/facebook/rocksdb/pull/2094
Differential Revision: D4837730
Pulled By: sagar0
fbshipit-source-id: 344278a
Summary:
Move some files under util/ to new directories env/, monitoring/ options/ and cache/
Closes https://github.com/facebook/rocksdb/pull/2090
Differential Revision: D4833681
Pulled By: siying
fbshipit-source-id: 2fd8bef
Summary:
Level-based L0->L0 compaction operates on spans of files that aren't currently being compacted. It reduces the number of L0 files, thus making write stall conditions harder to reach.
- L0->L0 is triggered when base level is unavailable due to pending compactions
- L0->L0 always outputs one file of at most `max_level0_burst_file_size` bytes.
- Subcompactions are disabled for L0->L0 since we want to output one file.
- Input files are chosen as the longest span of available files that will fit within the size limit. This minimizes number of files in L0.
Closes https://github.com/facebook/rocksdb/pull/2027
Differential Revision: D4760318
Pulled By: ajkr
fbshipit-source-id: 9d07183
Summary:
I've needed Env timing measurements a few times now, so finally built something for it.
Closes https://github.com/facebook/rocksdb/pull/2073
Differential Revision: D4811231
Pulled By: ajkr
fbshipit-source-id: 218a249
Summary:
Refactor WriteImpl() so when I plug-in the pipeline write code (which is
an alternative approach for WriteThread), some of the logic can be
reuse. I split out the following methods from WriteImpl():
* PreprocessWrite()
* HandleWALFull() (previous MaybeFlushColumnFamilies())
* HandleWriteBufferFull()
* WriteToWAL()
Also adding a constructor to WriteThread::Writer, and move WriteContext into db_impl.h.
No real logic change in this patch.
Closes https://github.com/facebook/rocksdb/pull/2042
Differential Revision: D4781014
Pulled By: yiwu-arbug
fbshipit-source-id: d45ca18
Summary:
It is confusing to have auto_roll_logger to stay under db/, which has nothing to do with database. Move filename together as it is a dependency.
Closes https://github.com/facebook/rocksdb/pull/2080
Differential Revision: D4821141
Pulled By: siying
fbshipit-source-id: ca7d768
Summary:
some fbcode services override it, we need to keep it virtual.
original change: #1756
Closes https://github.com/facebook/rocksdb/pull/2065
Differential Revision: D4808123
Pulled By: ajkr
fbshipit-source-id: 5eaeea7
Summary:
There still are many warnings (most of them about invalid printf format
for long long), but it builds if FAIL_ON_WARNINGS is disabled.
Closes https://github.com/facebook/rocksdb/pull/2052
Differential Revision: D4807355
Pulled By: siying
fbshipit-source-id: ef03786
Summary:
previously we only cleaned up .tmp files under "shared/" and "private/" directories in case the previous backup failed. we need to do the same for "shared_checksum/"; otherwise, the subsequent backup will fail if it tries to backup at least one of the same files.
Closes https://github.com/facebook/rocksdb/pull/2062
Differential Revision: D4805599
Pulled By: ajkr
fbshipit-source-id: eaa6088
Summary:
Add a parameter to Checkpoint::CreateCheckpoint() so that flush can be skipped if total log file size is within a threshold.
Closes https://github.com/facebook/rocksdb/pull/1993
Differential Revision: D4719842
Pulled By: siying
fbshipit-source-id: 4f9d9e1
Summary:
PinnableSlice
Summary:
Currently the point lookup values are copied to a string provided by the
user. This incures an extra memcpy cost. This patch allows doing point lookup
via a PinnableSlice which pins the source memory location (instead of
copying their content) and releases them after the content is consumed
by the user. The old API of Get(string) is translated to the new API
underneath.
Here is the summary for improvements:
value 100 byte: 1.8% regular, 1.2% merge values
value 1k byte: 11.5% regular, 7.5% merge values
value 10k byte: 26% regular, 29.9% merge values
The improvement for merge could be more if we extend this approach to
pin the merge output and delay the full merge operation until the user
actually needs it. We have put that for future work.
PS:
Sometimes we observe a small decrease in performance when switching from
t5452014 to this patch but with the old Get(string) API. The d
Closes https://github.com/facebook/rocksdb/pull/1756
Differential Revision: D4391738
Pulled By: maysamyabandeh
fbshipit-source-id: 6f3edd3
Summary:
This is the metric I plan to use for adaptive rate limiting. The statistics are updated only if the rate limiter is drained by flush or compaction. I believe (but am not certain) that this is the normal case.
The Statistics object is passed in RateLimiter::Request() to avoid requiring changes to client code, which would've been necessary if we passed it in the RateLimiter constructor.
Closes https://github.com/facebook/rocksdb/pull/1946
Differential Revision: D4646489
Pulled By: ajkr
fbshipit-source-id: d8e0161
Summary:
InsertPathnameToSizeBytes() is called on shared/ and shared_checksum/ directories, which only exist for certain configurations. If we try to list a non-existent directory's contents, some Envs will dump an error message. Let's avoid this by checking whether the directory exists before listing its contents.
Closes https://github.com/facebook/rocksdb/pull/1895
Differential Revision: D4596301
Pulled By: ajkr
fbshipit-source-id: c809679
Summary:
`WriteBatchWithIndex` has an incorrect implicitly-generated move constructor (it will copy the pointer causing a double-free on destruction). Just switch to `unique_ptr` so we get correct move semantics for free.
Closes https://github.com/facebook/rocksdb/pull/1899
Differential Revision: D4598896
Pulled By: ajkr
fbshipit-source-id: 2373d47
Summary:
As the last step in backup creation, the .tmp directory is renamed omitting the .tmp suffix. In case the process terminates before this, the .tmp directory will be left behind. Even if this happens, we want future backups to succeed, so I added some checks/cleanup for this case.
Closes https://github.com/facebook/rocksdb/pull/1896
Differential Revision: D4597323
Pulled By: ajkr
fbshipit-source-id: 48900d8
Summary:
Remove disableDataSync, and another similarly named disable_data_sync options.
This is being done to simplify options, and also because the performance gains of this feature can be achieved by other methods.
Closes https://github.com/facebook/rocksdb/pull/1859
Differential Revision: D4541292
Pulled By: sagar0
fbshipit-source-id: 5b3a6ca
Summary:
introduce new methods into a public threadpool interface,
- allow submission of std::functions as they allow greater flexibility.
- add Joining methods to the implementation to join scheduled and submitted jobs with
an option to cancel jobs that did not start executing.
- Remove ugly `#ifdefs` between pthread and std implementation, make it uniform.
- introduce pimpl for a drop in replacement of the implementation
- Introduce rocksdb::port::Thread typedef which is a replacement for std::thread. On Posix Thread defaults as before std::thread.
- Implement WindowsThread that allocates memory in a more controllable manner than windows std::thread with a replaceable implementation.
- should be no functionality changes.
Closes https://github.com/facebook/rocksdb/pull/1823
Differential Revision: D4492902
Pulled By: siying
fbshipit-source-id: c74cb11
Summary:
merger.h was always a confusing name for me, simply give the file a better name
Closes https://github.com/facebook/rocksdb/pull/1836
Differential Revision: D4505357
Pulled By: IslamAbdelRahman
fbshipit-source-id: 07b28d8
Summary:
The Env registration framework supports registering client Envs and selecting which one to instantiate according to a text field. This enabled things like adding the -env_uri argument to db_bench, so the same binary could be reused with different Envs just by changing CLI config.
Now this problem has come up again in a non-Env context, as I want to instantiate a client Statistics implementation from db_bench, which is configured entirely via text parameters. Also, in the future we may wish to use it for deserializing client objects when loading OPTIONS file.
This diff generalizes the Env registration logic to work with arbitrary types.
- Generalized registration and instantiation code by templating them
- The entire implementation is in a header file as that's Google style guide's recommendation for template definitions
- Pattern match with std::regex_match rather than checking prefix, which was the previous behavior
- Rename functions/files to be non-Env-specific
Closes https://github.com/facebook/rocksdb/pull/1776
Differential Revision: D4421933
Pulled By: ajkr
fbshipit-source-id: 34647d1
Summary:
using ~0UL for mask uses a uint32_t at least in MSVC, but a uint64_t is required for it to work properly
Closes https://github.com/facebook/rocksdb/pull/1777
Differential Revision: D4444004
Pulled By: yiwu-arbug
fbshipit-source-id: 057cc42
Summary:
Consider the following single column family scenario:
prepare in log A
commit in log B
*WAL is too large, flush all CFs to releast log A*
*CFA is on log B so we do not see CFA is depending on log A so no flush is requested*
To fix this we must also consider the log containing the prepare section when determining what log a CF is dependent on.
Closes https://github.com/facebook/rocksdb/pull/1768
Differential Revision: D4403265
Pulled By: reidHoruff
fbshipit-source-id: ce800ff
Summary:
also change variable name `direct_io_` to `use_direct_io_` in WritableFile to make it consistent with read path.
Closes https://github.com/facebook/rocksdb/pull/1770
Differential Revision: D4416435
Pulled By: lightmark
fbshipit-source-id: 4143c53
Summary:
Previously the only way to increment a handle's refcount was to invoke Lookup(), which (1) did hash table lookup to get cache handle, (2) incremented that handle's refcount. For a future DeleteRange optimization, I added a function, Ref(), for when the caller already has a cache handle and only needs to do (2).
Closes https://github.com/facebook/rocksdb/pull/1761
Differential Revision: D4397114
Pulled By: ajkr
fbshipit-source-id: 9addbe5
Summary:
Enable directIO on WritableFileImpl::Append
with offset being current length of the file.
Enable UniqueID tests on Windows, disable others but
leeting them to compile. Unique tests are valuable to
detect failures on different filesystems and upcoming
ReFS.
Clear output in WinEnv Getchildren.This is different from
previous strategy, do not touch output on failure.
Make sure DBTest.OpenWhenOpen works with windows error message
Closes https://github.com/facebook/rocksdb/pull/1746
Differential Revision: D4385681
Pulled By: IslamAbdelRahman
fbshipit-source-id: c07b702
Summary:
In the test the last change to AAAZZZ in handles[1] is deleting it. The
result of the get must be NotFound then. Previosuly the test did not
check for the return value of Get and assumed that the status is ok. It
then move ahead asserting the returned value. The passed-by-reference
string value however was not changed (since the key was not found) and
the asserted value is what it contained before doing the Get.
Closes https://github.com/facebook/rocksdb/pull/1753
Differential Revision: D4390982
Pulled By: maysamyabandeh
fbshipit-source-id: dd55a34
Summary:
Fix some memory leaks in the test. Also rename the test class name from DBTest to CheckpointTest to avoid confusion.
Closes https://github.com/facebook/rocksdb/pull/1752
Differential Revision: D4390355
Pulled By: siying
fbshipit-source-id: 0fa388a
Summary:
Currently the point lookup values are copied to a string provided by the user.
This incures an extra memcpy cost. This patch allows doing point lookup
via a PinnableSlice which pins the source memory location (instead of
copying their content) and releases them after the content is consumed
by the user. The old API of Get(string) is translated to the new API
underneath.
Here is the summary for improvements:
1. value 100 byte: 1.8% regular, 1.2% merge values
2. value 1k byte: 11.5% regular, 7.5% merge values
3. value 10k byte: 26% regular, 29.9% merge values
The improvement for merge could be more if we extend this approach to
pin the merge output and delay the full merge operation until the user
actually needs it. We have put that for future work.
PS:
Sometimes we observe a small decrease in performance when switching from
t5452014 to this patch but with the old Get(string) API. The difference
is a little and could be noise. More importantly it is safely
cancelled
Closes https://github.com/facebook/rocksdb/pull/1732
Differential Revision: D4374613
Pulled By: maysamyabandeh
fbshipit-source-id: a077f1a
Summary:
#1733 started using SizeFileBytes(), so our dummy log file implementation should stop asserting that this function isn't called.
Closes https://github.com/facebook/rocksdb/pull/1740
Differential Revision: D4376055
Pulled By: ajkr
fbshipit-source-id: 2854d89
Summary:
Since the backup work as snapshot, we should only copy
the bytes of the wal while we get the alive files.
Closes https://github.com/facebook/rocksdb/pull/1733
Differential Revision: D4373457
Pulled By: ajkr
fbshipit-source-id: 389318f
Summary:
File copying happens when creating checkpoints and bulkloading files from different FS partition. We should fsync the files when copying them to guarantee durability. A side effect will be that the dirty pages in file system buffers won't grow too large.
Closes https://github.com/facebook/rocksdb/pull/1728
Differential Revision: D4371083
Pulled By: siying
fbshipit-source-id: 579e14c
Summary:
If 2PC is enabled, checkpoint may not copy previous log files that contain uncommitted prepare records. In this diff we keep those files.
Closes https://github.com/facebook/rocksdb/pull/1724
Differential Revision: D4368319
Pulled By: siying
fbshipit-source-id: cc2c746
Summary:
Improve cache options logging to info log.
Also print the value of
cache_index_and_filter_blocks_with_high_priority.
Closes https://github.com/facebook/rocksdb/pull/1709
Differential Revision: D4358776
Pulled By: yiwu-arbug
fbshipit-source-id: 8f030a0
Summary:
In persistent_cache/block_cache_tier.cc, timers are never restarted, so the latency measured is not correct.
Closes https://github.com/facebook/rocksdb/pull/1707
Differential Revision: D4355828
Pulled By: siying
fbshipit-source-id: cd5f9e1
Summary:
Fixes compile error:
In file included from ./util/statistics.h:17:0,
from ./util/stop_watch.h:8,
from ./util/perf_step_timer.h:9,
from ./util/iostats_context_imp.h:8,
from ./util/posix_logger.h:27,
from ./port/util_logger.h:18,
from ./db/auto_roll_logger.h:15,
from db/auto_roll_logger.cc:6:
./util/thread_local.h:65:16: error: 'function' in namespace 'std' does not name a template type
typedef std::function<void(void*, void*)> FoldFunc;
Closes https://github.com/facebook/rocksdb/pull/1656
Differential Revision: D4318702
Pulled By: yiwu-arbug
fbshipit-source-id: 8c5d17a
Summary:
hopefully the last of the gcc-7 compile errors
Closes https://github.com/facebook/rocksdb/pull/1675
Differential Revision: D4332106
Pulled By: IslamAbdelRahman
fbshipit-source-id: 139448c
Summary:
We used to treat any failure to read a backup's meta-file as if the backup were corrupted; however, we should distinguish corruption errors from errors in the backup Env. This fixes an issue where callers would get inconsistent results from GetBackupInfo() if they called it on an engine that encountered Env error during initialization. Now we fail Initialize() in this case so callers cannot invoke GetBackupInfo() on such engines.
Closes https://github.com/facebook/rocksdb/pull/1654
Differential Revision: D4318573
Pulled By: ajkr
fbshipit-source-id: f7a7c54
Summary:
The two tests keep failing in travis. Disable them and will fix later.
Closes https://github.com/facebook/rocksdb/pull/1648
Differential Revision: D4316389
Pulled By: yiwu-arbug
fbshipit-source-id: 0a370e7
Summary:
Some users are assuming NotFound means the backup does not
exist at the provided path, which is a reasonable assumption. We need to
stop returning NotFound for system errors.
Depends on #1644
Closes https://github.com/facebook/rocksdb/pull/1645
Differential Revision: D4312233
Pulled By: ajkr
fbshipit-source-id: 5343c10
Summary:
This is an implementation of non-exclusive locks for pessimistic transactions. It is relatively simple and does not prevent starvation (ie. it's possible that request for exclusive access will never be granted if there are always threads holding shared access). It is done by changing `KeyLockInfo` to hold an set a transaction ids, instead of just one, and adding a flag specifying whether this lock is currently held with exclusive access or not.
Some implementation notes:
- Some lock diagnostic functions had to be updated to return a set of transaction ids for a given lock, eg. `GetWaitingTxn` and `GetLockStatusData`.
- Deadlock detection is a bit more complicated since a transaction can now wait on multiple other transactions. A BFS is done in this case, and deadlock detection depth is now just a limit on the number of transactions we visit.
- Expirable transactions do not work efficiently with shared locks at the moment, but that's okay for now.
Closes https://github.com/facebook/rocksdb/pull/1573
Differential Revision: D4239097
Pulled By: lth
fbshipit-source-id: da7c074
Summary:
Now that we have userspace persisted cache, we don't need flashcache anymore.
Closes https://github.com/facebook/rocksdb/pull/1588
Differential Revision: D4245114
Pulled By: igorcanadi
fbshipit-source-id: e2c1c72
Summary:
disable UBSAN for functions with intentional left shift on -ve number / overflow
These functions are
rocksdb:: Hash
FixedLengthColBufEncoder::Append
FaultInjectionTest:: Key
Closes https://github.com/facebook/rocksdb/pull/1577
Differential Revision: D4240801
Pulled By: IslamAbdelRahman
fbshipit-source-id: 3e1caf6
Summary:
The persistent cache is designed to hop over errors and return key not found. So far, it has shown resilience to write errors, encoding errors, data corruption etc. It is not resilient against disappearing files/directories. This was exposed during testing when multiple instances of persistence cache was started sharing the same directory simulating an unpredictable filesystem environment.
This patch
- makes the write code path more resilient to errors while creating files
- makes the read code path more resilient to handle situation where files are not found
- added a test that does negative write/read testing by removing the directory while writes are in progress
Closes https://github.com/facebook/rocksdb/pull/1472
Differential Revision: D4143413
Pulled By: kradhakrishnan
fbshipit-source-id: fd25e9b
Summary:
Exposing persistent cache stats (counters) to the user via public API.
Closes https://github.com/facebook/rocksdb/pull/1485
Differential Revision: D4155274
Pulled By: siying
fbshipit-source-id: 30a9f50
Summary:
Currently, deadlock cycles are held in std::unordered_map. The problem with it is that it allocates/deallocates memory on every insertion/deletion. This limits throughput since we're doing this expensive operation while holding a global mutex. Fix this by using a vector which caches memory instead.
Running the deadlock stress test, this change increased throughput from 39k txns/s -> 49k txns/s. The effect is more noticeable in MyRocks.
Closes https://github.com/facebook/rocksdb/pull/1545
Differential Revision: D4205662
Pulled By: lth
fbshipit-source-id: ff990e4
Summary:
Dont use c_str() of temp std::string in RocksLuaCompactionFilter::Name()
Closes https://github.com/facebook/rocksdb/pull/1535
Differential Revision: D4199094
Pulled By: IslamAbdelRahman
fbshipit-source-id: e56ce62
Summary:
This diff includes an implementation of CompactionFilter that allows
users to write CompactionFilter in Lua. With this ability, users can
dynamically change compaction filter logic without requiring building
the rocksdb binary and restarting the database.
To compile, WITH_LUA_PATH must be specified to the base directory
of lua.
Closes https://github.com/facebook/rocksdb/pull/1478
Differential Revision: D4150138
Pulled By: yhchiang
fbshipit-source-id: ed84222
Summary:
When option_change_migration_test decides to go with a full compaction, we don't force a compaction but allow trivial move. This can cause assert failure if the destination is level 0. Fix it by forcing the full compaction to skip trivial move if the destination level is L0.
Closes https://github.com/facebook/rocksdb/pull/1518
Differential Revision: D4183610
Pulled By: siying
fbshipit-source-id: dea482b
Summary:
Originally sequence ids were calculated, in recovery, based off of the first seqid found if the first log recovered. The working seqid was then incremented from that value based on every insertion that took place. This was faulty because of the potential for missing log files or inserts that skipped the WAL. The current recovery scheme grabs sequence from current recovering batch and increments using memtableinserter to track how many actual inserts take place. This works for 2PC batches as well scenarios where some logs are missing or inserts that skip the WAL.
Closes https://github.com/facebook/rocksdb/pull/1486
Differential Revision: D4156064
Pulled By: reidHoruff
fbshipit-source-id: a6da8d9
Summary:
copied from: 5ebfd2623a
Opening existing RocksDB attempts recovery from log files, which uses
wrong sequence number to create the memtable. This is a regression
introduced in change a400336.
This change includes a test demonstrating the problem, without the fix
the test fails with "Operation failed. Try again.: Transaction could not
check for conflicts for operation at SequenceNumber 1 as the MemTable
only contains changes newer than SequenceNumber 2. Increasing the value
of the max_write_buffer_number_to_maintain option could reduce the
frequency of this error"
This change is a joint effort by Peter 'Stig' Edwards thatsafunnyname
and me.
Closes https://github.com/facebook/rocksdb/pull/1458
Differential Revision: D4143791
Pulled By: reidHoruff
fbshipit-source-id: 5a25033
Summary:
The general convention in RocksDB is to use GFLAGS instead of google. Fixing the anomaly.
Closes https://github.com/facebook/rocksdb/pull/1470
Differential Revision: D4149213
Pulled By: kradhakrishnan
fbshipit-source-id: 2dafa53
Summary:
Note: reviewed in https://reviews.facebook.net/D65115
- DBIter maintains a range tombstone accumulator. We don't cleanup obsolete tombstones yet, so if the user seeks back and forth, the same tombstones would be added to the accumulator multiple times.
- DBImpl::NewInternalIterator() (used to make DBIter's underlying iterator) adds memtable/L0 range tombstones, L1+ range tombstones are added on-demand during NewSecondaryIterator() (see D62205)
- DBIter uses ShouldDelete() when advancing to check whether keys are covered by range tombstones
Closes https://github.com/facebook/rocksdb/pull/1464
Differential Revision: D4131753
Pulled By: ajkr
fbshipit-source-id: be86559
Summary: OptionChangeMigration() to support FIFO compaction. If the DB before migration is using FIFO compaction, nothing should be done. If the desitnation option is FIFO options, compact to one single L0 file if the source has more than one levels.
Test Plan: Run option_change_migration_test
Reviewers: andrewkr, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D65289
Summary: Make `IsDeadlockDetect()` virtual member of base class `Transaction` for ease of use in MyRocks
Test Plan: compiles. compiles into MyRocks call-site.
Reviewers: mung
Reviewed By: mung
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D65385
Summary: Implement deadlock detection. This is done by maintaining a TxnID -> TxnID map which represents the edges in the wait for graph (this is named `wait_txn_map_`).
Test Plan: transaction_test
Reviewers: IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D64491
Summary: Auto-compactions will change memory usage of DB but memory_test
didn't take it into account. This PR disable auto compactions in the
test and hopefully it fixes its flakyness.
Test Plan:
UBSAN build used to catch the flakyness. Run `make ubsan_check` and it
passes.
Summary:
reland https://reviews.facebook.net/D62523
- Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno`
- Update TableProperties to be aware of the offset of each property in the file
- Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file
Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks
Test Plan: unit tests
Reviewers: sdong, yhchiang
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D65211
Summary:
MyRocks hit a regression, @mung generated perf reports showing that the reason is the cost of calling `GetDBOptions()` inside `GetFromBatchAndDB()`
This diff avoid calling `GetDBOptions` and use the `ImmutableDBOptions` instead
Test Plan: make check -j64
Reviewers: sdong, yiwu
Reviewed By: yiwu
Subscribers: andrewkr, dhruba, mung
Differential Revision: https://reviews.facebook.net/D65151
Summary: Modifies the lock info export test to test multiple column families after I was experiencing a bug while developing the MyRocks front-end for this.
Test Plan: is test.
Reviewers: mung
Reviewed By: mung
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D64725
Summary:
This exposes a transactions state through a public api rather than through a public member variable. I also do some name refactoring.
ExecutionStatus => TransactionState
exec_status_ => trx_state_
Test Plan: It compiles and transaction_test passes.
Reviewers: IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, mung, dhruba, sdong
Differential Revision: https://reviews.facebook.net/D64689
Summary:
When constructing a write batch a client may now call MarkWalTerminationPoint() on that batch. No batch operations after this call will be added written to the WAL but will still be inserted into the Memtable. This facility is used to remove one of the three WriteImpl calls in 2PC transactions. This produces a ~1% perf improvement.
```
RocksDB - unoptimized 2pc, sync_binlog=1, disable_2pc=off
INFO 2016-08-31 14:30:38,814 [main]: REQUEST PHASE COMPLETED. 75000000 requests done in 2619 seconds. Requests/second = 28628
RocksDB - optimized 2pc , sync_binlog=1, disable_2pc=off
INFO 2016-08-31 16:26:59,442 [main]: REQUEST PHASE COMPLETED. 75000000 requests done in 2581 seconds. Requests/second = 29054
```
Test Plan: Two unit tests added.
Reviewers: sdong, yiwu, IslamAbdelRahman
Reviewed By: yiwu
Subscribers: hermanlee4, dhruba, andrewkr
Differential Revision: https://reviews.facebook.net/D64599
Summary:
- Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno`
- Update TableProperties to be aware of the offset of each property in the file
- Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file
Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks
Test Plan: unit tests
Reviewers: andrewkr, yhchiang, yiwu, sdong
Reviewed By: sdong
Subscribers: hcz, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D62523
Summary:
Currently there is no mechanism to create persistent cache from
headers. Adding a simple factory method to create a simple persistent cache with
default or NVM optimized settings.
note: Any idea to test this factory is appreciated.
Test Plan: None
Reviewers: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D64527
Summary:
This diff does 3 things:
Expose TransactionID so that we can identify transactions when we retrieve locking and lock wait information. This is exposed as `Transaction::GetID`.
Expose lock state information by locking all stripes in all column families and copying their contents to a data structure. This is exposed as `TransactionDB::GetLockStatusData`.
Adds support for tracking the transaction and the key being waited on, and exposes this as `Transaction::GetWaitingTxn`.
Test Plan: unit tests
Reviewers: horuff, sdong
Reviewed By: sdong
Subscribers: vasilep, hermanlee4, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D64413
* enable cmake to work on linux and osx also
* port part of build_detect_platform not covered by thirdparty.inc
to cmake.
- detect fallocate()
- detect malloc_usable_size()
- detect JeMalloc
- detect snappy
* check for asan,tsan,ubsan
* create 'build_version.cc' in build directory.
* add `check` target to support 'make check'.
* add `tools` target to match its counterpart in Makefile.
* use `date` on non-win32 platforms.
* pass different cflags on non-win32 platforms
* detect pthead library using FindThread cmake module.
* enable CMP0042 to silence the cmake warning on osx
* reorder the linked libraries. because testutillib references gtest, to
enable the linker to find the referenced symbols, we need to put gtest
after testutillib.
Signed-off-by: Marcus Watts <mwatts@redhat.com>
Signed-off-by: Kefu Chai <kchai@redhat.com>
* hash_table_bench.cc: fix build without gflags
Signed-off-by: Kefu Chai <kchai@redhat.com>
* remove gtest from librocksdb linkage
testharness.cc is included in librocksdb sources, and it uses gtest. but
gtest is not supposed to be part of the public API of librocksdb. so, in
this change, the testharness.cc is moved out out librocksdb, and is
built as an object target, then linked with the tools and tests instead.
Signed-off-by: Marcus Watts <mwatts@redhat.com>
Signed-off-by: Kefu Chai <kchai@redhat.com>
Summary:
Add new Iterator API, `SeekForPrev`: find the last key that <= target key
support prefix_extractor
support prefix_same_as_start
support upper_bound
not supported in iterators without Prev()
Also add tests in db_iter_test and db_iterator_test
Pass all tests
Cheers!
Test Plan: make all check -j64
Reviewers: andrewkr, yiwu, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D64149
Summary:
ZSTD 1.0.0 is coming. We can finally add a support of ZSTD without worrying about compatibility.
Still keep ZSTDNotFinal for compatibility reason.
Test Plan: Run all tests. Run db_bench with ZSTD version with RocksDB built with ZSTD 1.0 and older.
Reviewers: andrewkr, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: cyan, igor, IslamAbdelRahman, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D63141
Summary: There's no reference to ImmutableCFOptions elsewhere in /include/rocksdb. ImmutableCFOptions was introduced in this commit (5665e5e285) but later its reference in /include/rocksdb/table.h is removed.
Test Plan:
make all check
Reviewers: IslamAbdelRahman, sdong, yhchiang
Reviewed By: yhchiang
Subscribers: yhchiang, andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D63177
Summary: To reduce number of options, merge source_compaction_factor, max_grandparent_overlap_bytes and expanded_compaction_factor into max_compaction_bytes.
Test Plan: Add two new unit tests. Run all existing tests, including jtest.
Reviewers: yhchiang, igor, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59829
Summary: basically for SimCache stats. I find most times it is hard to pass Statistics* to SimCache constructor.
Test Plan: make all check
Reviewers: andrewkr, sdong, yiwu
Reviewed By: yiwu
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62193
Summary:
Add mid-point insertion functionality to LRU cache. Caller of `Cache::Insert()` can set an additional parameter to make a cache entry have higher priority. The LRU cache will reserve at most `capacity * high_pri_pool_pct` bytes for high-pri cache entries. If `high_pri_pool_pct` is zero, the cache degenerates to normal LRU cache.
Context: If we are to put index and filter blocks into RocksDB block cache, index/filter block can be swap out too early. We want to add an option to RocksDB to reserve some capacity in block cache just for index/filter blocks, to mitigate the issue.
In later diffs I'll update block based table reader to use the interface to cache index/filter blocks at high priority, and expose the option to `DBOptions` and make it dynamic changeable.
Test Plan: unit test.
Reviewers: IslamAbdelRahman, sdong, lightmark
Reviewed By: lightmark
Subscribers: andrewkr, dhruba, march, leveldb
Differential Revision: https://reviews.facebook.net/D61977
Summary: 1. Range Deletion Tombstone structure 2. Modify Add() in table_builder to make it usable for adding range del tombstones 3. Expose NewTombstoneIterator() API in table_reader
Test Plan: table_test.cc (now BlockBasedTableBuilder::Add() only accepts InternalKey. I make table_test only pass InternalKey to BlockBasedTableBuidler. Also test writing/reading range deletion tombstones in table_test )
Reviewers: sdong, IslamAbdelRahman, lightmark, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61473
Summary:
Add API to WriteBatch to store range deletions in its buffer
which are later added to memtable. In the WriteBatch buffer, a range
deletion is encoded as "<optype><CF ID (optional)><begin key><end key>".
With this diff, the range tombstones are stored inline with the data in
the memtable. It's useful for now because the test cases rely on the
data being accessible via memtable. My next step is to store range
tombstones in a separate area in the memtable.
Test Plan: unit tests
Reviewers: IslamAbdelRahman, sdong, wanning
Reviewed By: wanning
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61401
Summary: make transactionDB working with StackableDB
Test Plan: make all check -j64
Reviewers: andrewkr, yiwu, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60705
Summary:
Travis times out with persistent_cache_test, there is no way to increase the
timeout, so lower the test case for Travis
Test Plan: Run on travis and sandcastle
Subscribers: leveldb, dhruba, andrewkr
Differential Revision: https://reviews.facebook.net/D61863
Summary:
add SIM_BLOCK_CACHE_HIT and SIM_BLOCK_CACHE_MISS tickers.
maybe can be combined with Histograms like DB_GET to evaluate the current setting of the size of block cache.
Test Plan: make all check
Reviewers: sdong, andrewkr, IslamAbdelRahman, yiwu
Reviewed By: yiwu
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61803
Summary:
This is a proof of concept of a RocksDB blob log file. The actual value of the Put() is appended to a blob log using normal data block format, and the handle of the block is written as the value of the key in RocksDB.
The prototype only supports Put() and Get(). It doesn't support DB restart, garbage collection, Write() call, iterator, snapshots, etc.
Test Plan: Add unit tests.
Reviewers: arahut
Reviewed By: arahut
Subscribers: kradhakrishnan, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61485
Summary:
Travis provides a very low capability resource for testing and the
persistent cache tests are too aggressive. It causes the kernel to not have
enough time to do file-system meta data sync operation, causing open file
descriptors to increase and available space to decrease.
This change skips certain tests
Test Plan: Run on travis
Subscribers: leveldb, dhruba, andrewkr
Differential Revision: https://reviews.facebook.net/D61755
Summary: Implement a time series database that supports DateTieredCompactionStrategy. It wraps a db object and separate SST files in different column families (time windows).
Test Plan: Add `date_tiered_test`.
Reviewers: dhruba, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61653
Summary: Add a utility function that trigger necessary full compaction and put output to the correct level by looking at new options and old options.
Test Plan: Add unit tests for it.
Reviewers: andrewkr, igor, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: muthu, sumeet, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60783
Summary: Fix failed persistent cache test introduced by recent PR to support cross-platform build.
Test Plan: Run persistent_cache_test (on multiple platforms)
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61467
Summary: Fix problem when casting `size_t` to `uint64_t`. Avoid usage of `size_t` when possible.
Test Plan: Pass related test on Travis CI.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61443
Summary: MyRocks is adding support for the user of the SstFileWriter which needs a comparator. It would be more convenient to get the comparator from the column family (which already has to have it) than to have caller keep track of it.
Test Plan: Standard tests (adding one for the new method)
Reviewers: IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61155
Summary: Regex support for c++ is very inconsistent across compilers, converting
the logic to simple string manipulation.
Test Plan: Local test
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61377
Summary:
Fixing build break on Mac
(1) uint64_t fix
(2) O_DIRECT works only for Linux
Test Plan: Build and test on Mac and Unix
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61353
Summary:
Experiments on column-aware encodings. Supported features: 1) extract data blocks from SST file and encode with specified encodings; 2) Decode encoded data back into row format; 3) Directly extract data blocks and write in row format (without prefix encoding); 4) Get column distribution statistics for column format; 5) Dump data blocks separated by columns in human-readable format.
There is still on-going work on this diff. More refactoring is necessary.
Test Plan: Wrote tests in `column_aware_encoding_test.cc`. More tests should be added.
Reviewers: sdong
Reviewed By: sdong
Subscribers: arahut, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60027
Summary:
The patch is a continuation of part 5. It glues the abstraction for
file layout and metadata, and flush out the implementation of the API. It
adds unit tests for the implementation.
Test Plan: Run unit tests
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57549
EnvLibrados is a customized RocksDB Env to use RADOS as the backend file system of RocksDB. It overrides all file system related API of default Env. The easiest way to use it is just like following:
std::string db_name = "test_db";
std::string config_path = "path/to/ceph/config";
DB* db;
Options options;
options.env = EnvLibrados(db_name, config_path);
Status s = DB::Open(options, kDBPath, &db);
Then EnvLibrados will forward all file read/write operation to the RADOS cluster assigned by config_path. Default pool is db_name+"_pool".
There are some options that users could set for EnvLibrados.
- write_buffer_size. This variable is the max buffer size for WritableFile. After reaching the buffer_max_size, EnvLibrados will sync buffer content to RADOS, then clear buffer.
- db_pool. Rather than using default pool, users could set their own db pool name
- wal_dir. The dir for WAL files. Because RocksDB only has 2-level structure (dir_name/file_name), the format of wal_dir is "/dir_name"(CAN'T be "/dir1/dir2"). Default wal_dir is "/wal".
- wal_pool. Corresponding pool name for WAL files. Default value is db_name+"_wal_pool"
The example of setting options looks like following:
db_name = "test_db";
db_pool = db_name+"_pool";
wal_dir = "/wal";
wal_pool = db_name+"_wal_pool";
write_buffer_size = 1 << 20;
env_ = new EnvLibrados(db_name, config, db_pool, wal_dir, wal_pool, write_buffer_size);
DB* db;
Options options;
options.env = env_;
// The last level dir name should match the dir name in prefix_pool_map
options.wal_dir = "/tmp/wal";
// open DB
Status s = DB::Open(options, kDBPath, &db);
Librados is required to compile EnvLibrados. Then use "$make LIBRADOS=1" to compile RocksDB. If you want to only compile EnvLibrados test, just run "$ make env_librados_test LIBRADOS=1". To run env_librados_test, you need to have a running RADOS cluster with the configure file located in "../ceph/src/ceph.conf" related to "rocksdb/".