Commit Graph

648 Commits

Author SHA1 Message Date
Peter Dillinger
00d58a370e Abandon use of folly::Optional (#6036)
Summary:
Had complications with LITE build and valgrind test.
Reverts/fixes small parts of PR https://github.com/facebook/rocksdb/issues/6007
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6036

Test Plan:
make LITE=1 all check
and
ROCKSDB_VALGRIND_RUN=1 DISABLE_JEMALLOC=1 make -j24 db_bloom_filter_test && ROCKSDB_VALGRIND_RUN=1 DISABLE_JEMALLOC=1 ./db_bloom_filter_test

Differential Revision: D18512238

Pulled By: pdillinger

fbshipit-source-id: 37213cf0d309edf11c483fb4b2fb6c02c2cf2b28
2019-11-14 14:04:15 -08:00
Peter Dillinger
f059c7d9b9 New Bloom filter implementation for full and partitioned filters (#6007)
Summary:
Adds an improved, replacement Bloom filter implementation (FastLocalBloom) for full and partitioned filters in the block-based table. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single filter.

Speed

The improved speed, at least on recent x86_64, comes from
* Using fastrange instead of modulo (%)
* Using our new hash function (XXH3 preview, added in a previous commit), which is much faster for large keys and only *slightly* slower on keys around 12 bytes if hashing the same size many thousands of times in a row.
* Optimizing the Bloom filter queries with AVX2 SIMD operations. (Added AVX2 to the USE_SSE=1 build.) Careful design was required to support (a) SIMD-optimized queries, (b) compatible non-SIMD code that's simple and efficient, (c) flexible choice of number of probes, and (d) essentially maximized accuracy for a cache-local Bloom filter. Probes are made eight at a time, so any number of probes up to 8 is the same speed, then up to 16, etc.
* Prefetching cache lines when building the filter. Although this optimization could be applied to the old structure as well, it seems to balance out the small added cost of accumulating 64 bit hashes for adding to the filter rather than 32 bit hashes.

Here's nominal speed data from filter_bench (200MB in filters, about 10k keys each, 10 bits filter data / key, 6 probes, avg key size 24 bytes, includes hashing time) on Skylake DE (relatively low clock speed):

$ ./filter_bench -quick -impl=2 -net_includes_hashing # New Bloom filter
Build avg ns/key: 47.7135
Mixed inside/outside queries...
  Single filter net ns/op: 26.2825
  Random filter net ns/op: 150.459
    Average FP rate %: 0.954651
$ ./filter_bench -quick -impl=0 -net_includes_hashing # Old Bloom filter
Build avg ns/key: 47.2245
Mixed inside/outside queries...
  Single filter net ns/op: 63.2978
  Random filter net ns/op: 188.038
    Average FP rate %: 1.13823

Similar build time but dramatically faster query times on hot data (63 ns to 26 ns), and somewhat faster on stale data (188 ns to 150 ns). Performance differences on batched and skewed query loads are between these extremes as expected.

The only other interesting thing about speed is "inside" (query key was added to filter) vs. "outside" (query key was not added to filter) query times. The non-SIMD implementations are substantially slower when most queries are "outside" vs. "inside". This goes against what one might expect or would have observed years ago, as "outside" queries only need about two probes on average, due to short-circuiting, while "inside" always have num_probes (say 6). The problem is probably the nastily unpredictable branch. The SIMD implementation has few branches (very predictable) and has pretty consistent running time regardless of query outcome.

Accuracy

The generally improved accuracy (re: Issue https://github.com/facebook/rocksdb/issues/5857) comes from a better design for probing indices
within a cache line (re: Issue https://github.com/facebook/rocksdb/issues/4120) and improved accuracy for millions of keys in a single filter from using a 64-bit hash function (XXH3p). Design details in code comments.

Accuracy data (generalizes, except old impl gets worse with millions of keys):
Memory bits per key: FP rate percent old impl -> FP rate percent new impl
6: 5.70953 -> 5.69888
8: 2.45766 -> 2.29709
10: 1.13977 -> 0.959254
12: 0.662498 -> 0.411593
16: 0.353023 -> 0.0873754
24: 0.261552 -> 0.0060971
50: 0.225453 -> ~0.00003 (less than 1 in a million queries are FP)

Fixes https://github.com/facebook/rocksdb/issues/5857
Fixes https://github.com/facebook/rocksdb/issues/4120

Unlike the old implementation, this implementation has a fixed cache line size (64 bytes). At 10 bits per key, the accuracy of this new implementation is very close to the old implementation with 128-byte cache line size. If there's sufficient demand, this implementation could be generalized.

Compatibility

Although old releases would see the new structure as corrupt filter data and read the table as if there's no filter, we've decided only to enable the new Bloom filter with new format_version=5. This provides a smooth path for automatic adoption over time, with an option for early opt-in.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6007

Test Plan: filter_bench has been used thoroughly to validate speed, accuracy, and correctness. Unit tests have been carefully updated to exercise new and old implementations, as well as the logic to select an implementation based on context (format_version).

Differential Revision: D18294749

Pulled By: pdillinger

fbshipit-source-id: d44c9db3696e4d0a17caaec47075b7755c262c5f
2019-11-13 16:44:01 -08:00
Yun Tang
07a0ad3c29 Download bzip2 packages from sourceforge (#5995)
Summary:
From bzip2's official [download page](http://www.bzip.org/downloads.html), we could download it from sourceforge. This source would be more credible than previous web archive.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5995

Differential Revision: D18377662

fbshipit-source-id: e8353f83d5d6ea6067f78208b7bfb7f0d5b49c05
2019-11-07 12:51:06 -08:00
Yanqin Jin
925250f42f Include db_stress_tool in rocksdb tools lib (#5950)
Summary:
include db_stress_tool in rocksdb tools lib

Test Plan (on devserver):
```
$make db_stress
$./db_stress
$make all && make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5950

Differential Revision: D18044399

Pulled By: riversand963

fbshipit-source-id: 895585abbbdfd8b954965921dba4b1400b7af1b1
2019-10-21 19:40:35 -07:00
Yanqin Jin
e60cc0925c Expose db stress tests (#5937)
Summary:
expose db stress test by providing db_stress_tool.h in public header.
This PR does the following:
- adds a new header, db_stress_tool.h, in include/rocksdb/
- renames db_stress.cc to db_stress_tool.cc
- adds a db_stress.cc which simply invokes a test function.
- update Makefile accordingly.

Test Plan (dev server):
```
make db_stress
./db_stress
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5937

Differential Revision: D17997647

Pulled By: riversand963

fbshipit-source-id: 1a8d9994f89ce198935566756947c518f0052410
2019-10-18 09:46:44 -07:00
Peter Dillinger
46ca51d430 filter_bench - a prelim tool for SST filter benchmarking (#5825)
Summary:
Example: using the tool before and after PR https://github.com/facebook/rocksdb/issues/5784 shows that
the refactoring, presumed performance-neutral, actually sped up SST
filters by about 3% to 8% (repeatable result):

Before:
-  Dry run ns/op: 22.4725
-  Single filter ns/op: 51.1078
-  Random filter ns/op: 120.133

After:
+  Dry run ns/op: 22.2301
+  Single filter run ns/op: 47.4313
+  Random filter ns/op: 115.9

Only tests filters for the block-based table (full filters and
partitioned filters - same implementation; not block-based filters),
which seems to be the recommended format/implementation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5825

Differential Revision: D17804987

Pulled By: pdillinger

fbshipit-source-id: 0f18a9c254c57f7866030d03e7fa4ba503bac3c5
2019-10-07 20:10:53 -07:00
Yanqin Jin
a9c5e8e944 Refactor deletefile_test.cc (#5822)
Summary:
Make DeleteFileTest inherit DBTestBase to avoid code duplication.

Test Plan (on devserver)
```
$make deletefile_test
$./deletefile_test
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5822

Differential Revision: D17456750

Pulled By: riversand963

fbshipit-source-id: 224e97967da7b98838a98981cd5095d3230a814f
2019-09-18 16:58:21 -07:00
Yanqin Jin
6a279037cf Refactor ObsoleteFilesTest to inherit from DBTestBase (#5820)
Summary:
Make class ObsoleteFilesTest inherit from DBTestBase.

Test plan (on devserver):
```
$COMPILE_WITH_ASAN=1 make obsolete_files_test
$./obsolete_files_test
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5820

Differential Revision: D17452348

Pulled By: riversand963

fbshipit-source-id: b09f4581a18022ca2bfd79f2836c0bf7083f5f25
2019-09-18 11:52:17 -07:00
Peter Dillinger
68626249c3 Refactor/consolidate legacy Bloom implementation details (#5784)
Summary:
Refactoring to consolidate implementation details of legacy
Bloom filters. This helps to organize and document some related,
obscure code.

Also added make/cpp var TEST_CACHE_LINE_SIZE so that it's easy to
compile and run unit tests for non-native cache line size. (Fixed a
related test failure in db_properties_test.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5784

Test Plan:
make check, including Recently added Bloom schema unit tests
(in ./plain_table_db_test && ./bloom_test), and including with
TEST_CACHE_LINE_SIZE=128U and TEST_CACHE_LINE_SIZE=256U. Tested the
schema tests with temporary fault injection into new implementations.

Some performance testing with modified unit tests suggest a small to moderate
improvement in speed.

Differential Revision: D17381384

Pulled By: pdillinger

fbshipit-source-id: ee42586da996798910fc45ac0b6289147f16d8df
2019-09-16 16:17:09 -07:00
Peter Dillinger
d3a6726f02 Revert changes from PR#5784 accidentally in PR#5780 (#5810)
Summary:
This will allow us to fix history by having the code changes for PR#5784 properly attributed to it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5810

Differential Revision: D17400231

Pulled By: pdillinger

fbshipit-source-id: 2da8b1cdf2533cfedb35b5526eadefb38c291f09
2019-09-16 11:38:53 -07:00
Peter Dillinger
aa2486b23c Refactor some confusing logic in PlainTableReader
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5780

Test Plan: existing plain table unit test

Differential Revision: D17368629

Pulled By: pdillinger

fbshipit-source-id: f25409cdc2f39ebe8d5cbb599cf820270e6b5d26
2019-09-13 10:26:36 -07:00
Wilfried Goesgens
fbab9913e2 upgrade gtest 1.7.0 => 1.8.1 for json result writing
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5332

Differential Revision: D17242232

fbshipit-source-id: c0d4646556a1335e51ac7382b986ca7f6ced7b64
2019-09-09 11:24:11 -07:00
ENDOH takanao
3f2723a81b fix checking the '-march' flag (#5766)
Summary:
Hi! guys,

I got errors on the ARM machine.

before:

```console
$ make static_lib
...
g++: error: unrecognized argument in option '-march=armv8-a+crc+crypto'
g++: note: valid arguments to '-march=' are: armv2 armv2a armv3 armv3m armv4 armv4t armv5 armv5e armv5t armv5te armv6 armv6-m armv6j armv6k armv6kz armv6s-m armv6t2 armv6z armv6zk armv7 armv7-a armv7-m armv7-r armv7e-m armv7ve armv8-a armv8-a+crc armv8.1-a armv8.1-a+crc iwmmxt iwmmxt2 native
```

Thanks!
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5766

Differential Revision: D17191117

fbshipit-source-id: 7a61e3a2a4a06f37faeb8429bd7314da54ec5868
2019-09-04 14:34:28 -07:00
sdong
d8a27d9331 Atomic Flush Crash Test also covers the case that WAL is enabled. (#5729)
Summary:
AtomicFlushStressTest is a powerful test, but right now we only run it for atomic_flush=true + disable_wal=true. We further extend it to the case where atomic_flush=false + disable_wal = false. All the workload generation and validation can stay the same.
Atomic flush crash test is also changed to switch between the two test scenarios. It makes the name "atomic flush crash test" out of sync from what it really does. We leave it as it is to avoid troubles with continous test set-up.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5729

Test Plan: Run "CRASH_TEST_KILL_ODD=188 TEST_TMPDIR=/dev/shm/ USE_CLANG=1 make whitebox_crash_test_with_atomic_flush", observe the settings used and see it passed.

Differential Revision: D16969791

fbshipit-source-id: 56e37487000ae631e31b0100acd7bdc441c04163
2019-08-22 16:32:55 -07:00
sdong
e1c468d16f Do readahead in VerifyChecksum() (#5713)
Summary:
Right now VerifyChecksum() doesn't do read-ahead. In some use cases, users won't be able to achieve good performance. With this change, by default, RocksDB will do a default readahead, and users will be able to overwrite the readahead size by passing in a ReadOptions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5713

Test Plan: Add a new unit test.

Differential Revision: D16860874

fbshipit-source-id: 0cff0fe79ac855d3d068e6ccd770770854a68413
2019-08-16 16:42:56 -07:00
Adam Retter
f2bf0b2d1e Fixes for building RocksJava releases on arm64v8
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5674

Differential Revision: D16870338

fbshipit-source-id: c8dac644b1479fa734b491f3a8d50151772290f7
2019-08-16 16:27:50 -07:00
Aaryaman Sagar
77273d4137 Fix TSAN failures in DistributedMutex tests (#5684)
Summary:
TSAN was not able to correctly instrument atomic bts and btr instructions, so
when TSAN is enabled implement those with std::atomic::fetch_or and
std::atomic::fetch_and. Also disable tests that fail on TSAN with false
negatives (we know these are false negatives because this other verifiably
correct program fails with the same TSAN error <link>)

```
make clean
TEST_TMPDIR=/dev/shm/rocksdb OPT=-g COMPILE_WITH_TSAN=1 make J=1 -j56 folly_synchronization_distributed_mutex_test
```

This is the code that fails with the same false-negative with TSAN
```
namespace {
class ExceptionWithConstructionTrack : public std::exception {
 public:
  explicit ExceptionWithConstructionTrack(int id)
      : id_{folly::to<std::string>(id)}, constructionTrack_{id} {}

  const char* what() const noexcept override {
    return id_.c_str();
  }

 private:
  std::string id_;
  TestConstruction constructionTrack_;
};

template <typename Storage, typename Atomic>
void transferCurrentException(Storage& storage, Atomic& produced) {
  assert(std::current_exception());
  new (&storage) std::exception_ptr(std::current_exception());
  produced->store(true, std::memory_order_release);
}

void concurrentExceptionPropagationStress(
    int numThreads,
    std::chrono::milliseconds milliseconds) {
  auto&& stop = std::atomic<bool>{false};
  auto&& exceptions = std::vector<std::aligned_storage<48, 8>::type>{};
  auto&& produced = std::vector<std::unique_ptr<std::atomic<bool>>>{};
  auto&& consumed = std::vector<std::unique_ptr<std::atomic<bool>>>{};
  auto&& consumers = std::vector<std::thread>{};
  for (auto i = 0; i < numThreads; ++i) {
    produced.emplace_back(new std::atomic<bool>{false});
    consumed.emplace_back(new std::atomic<bool>{false});
    exceptions.push_back({});
  }

  auto producer = std::thread{[&]() {
    auto counter = std::vector<int>(numThreads, 0);
    for (auto i = 0; true; i = ((i + 1) % numThreads)) {
      try {
        throw ExceptionWithConstructionTrack{counter.at(i)++};
      } catch (...) {
        transferCurrentException(exceptions.at(i), produced.at(i));
      }

      while (!consumed.at(i)->load(std::memory_order_acquire)) {
        if (stop.load(std::memory_order_acquire)) {
          return;
        }
      }

      consumed.at(i)->store(false, std::memory_order_release);
    }
  }};

  for (auto i = 0; i < numThreads; ++i) {
    consumers.emplace_back([&, i]() {
      auto counter = 0;
      while (true) {
        while (!produced.at(i)->load(std::memory_order_acquire)) {
          if (stop.load(std::memory_order_acquire)) {
            return;
          }
        }
        produced.at(i)->store(false, std::memory_order_release);

        try {
          auto storage = &exceptions.at(i);
          auto exc = folly::launder(
            reinterpret_cast<std::exception_ptr*>(storage));
          auto copy = std::move(*exc);
          exc->std::exception_ptr::~exception_ptr();
          std::rethrow_exception(std::move(copy));
        } catch (std::exception& exc) {
          auto value = std::stoi(exc.what());
          EXPECT_EQ(value, counter++);
        }

        consumed.at(i)->store(true, std::memory_order_release);
      }
    });
  }

  std::this_thread::sleep_for(milliseconds);
  stop.store(true);
  producer.join();
  for (auto& thread : consumers) {
    thread.join();
  }
}
} // namespace
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5684

Differential Revision: D16746077

Pulled By: miasantreble

fbshipit-source-id: 8af88dcf9161c05daec1a76290f577918638f79d
2019-08-14 17:01:31 -07:00
Aaryaman Sagar
38b03c840e Port folly/synchronization/DistributedMutex to rocksdb (#5642)
Summary:
This ports `folly::DistributedMutex` into RocksDB. The PR includes everything else needed to compile and use DistributedMutex as a component within folly. Most files are unchanged except for some portability stuff and includes.

For now, I've put this under `rocksdb/third-party`, but if there is a better folder to put this under, let me know. I also am not sure how or where to put unit tests for third-party stuff like this. It seems like gtest is included already, but I need to link with it from another third-party folder.

This also includes some other common components from folly

- folly/Optional
- folly/ScopeGuard (In particular `SCOPE_EXIT`)
- folly/synchronization/ParkingLot (A portable futex-like interface)
- folly/synchronization/AtomicNotification (The standard C++ interface for futexes)
- folly/Indestructible (For singletons that don't get destroyed without allocations)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5642

Differential Revision: D16544439

fbshipit-source-id: 179b98b5dcddc3075926d31a30f92fd064245731
2019-08-07 14:34:19 -07:00
Vijay Nadimpalli
d150e01474 New API to get all merge operands for a Key (#5604)
Summary:
This is a new API added to db.h to allow for fetching all merge operands associated with a Key. The main motivation for this API is to support use cases where doing a full online merge is not necessary as it is performance sensitive. Example use-cases:
1. Update subset of columns and read subset of columns -
Imagine a SQL Table, a row is encoded as a K/V pair (as it is done in MyRocks). If there are many columns and users only updated one of them, we can use merge operator to reduce write amplification. While users only read one or two columns in the read query, this feature can avoid a full merging of the whole row, and save some CPU.
2. Updating very few attributes in a value which is a JSON-like document -
Updating one attribute can be done efficiently using merge operator, while reading back one attribute can be done more efficiently if we don't need to do a full merge.
----------------------------------------------------------------------------------------------------
API :
Status GetMergeOperands(
      const ReadOptions& options, ColumnFamilyHandle* column_family,
      const Slice& key, PinnableSlice* merge_operands,
      GetMergeOperandsOptions* get_merge_operands_options,
      int* number_of_operands)

Example usage :
int size = 100;
int number_of_operands = 0;
std::vector<PinnableSlice> values(size);
GetMergeOperandsOptions merge_operands_info;
db_->GetMergeOperands(ReadOptions(), db_->DefaultColumnFamily(), "k1", values.data(), merge_operands_info, &number_of_operands);

Description :
Returns all the merge operands corresponding to the key. If the number of merge operands in DB is greater than merge_operands_options.expected_max_number_of_operands no merge operands are returned and status is Incomplete. Merge operands returned are in the order of insertion.
merge_operands-> Points to an array of at-least merge_operands_options.expected_max_number_of_operands and the caller is responsible for allocating it. If the status returned is Incomplete then number_of_operands will contain the total number of merge operands found in DB for key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5604

Test Plan:
Added unit test and perf test in db_bench that can be run using the command:
./db_bench -benchmarks=getmergeoperands --merge_operator=sortlist

Differential Revision: D16657366

Pulled By: vjnadimpalli

fbshipit-source-id: 0faadd752351745224ee12d4ae9ef3cb529951bf
2019-08-06 14:26:44 -07:00
Yanqin Jin
b1a02ffeab Fix make target 'all' and 'check' (#5672)
Summary:
If a test is one of parallel tests, then it should also be one of the 'tests'.
Otherwise, `make all` won't build the binaries. For examle,
```
$COMPILE_WITH_ASAN=1 make -j32 all
```
Then if you do
```
$make check
```
The second command will invoke the compilation and building for db_bloom_test
and file_reader_writer_test **without** the `COMPILE_WITH_ASAN=1`, causing the
command to fail.

Test plan (on devserver):
```
$make -j32 all
```
Verify all binaries are built so that `make check` won't have to compile any
thing.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5672

Differential Revision: D16655834

Pulled By: riversand963

fbshipit-source-id: 050131412b5313496f85ae3deeeeb8d28af75746
2019-08-05 15:45:56 -07:00
haoyuhuang
70c7302fb5 Block cache simulator: Add pysim to simulate caches using reinforcement learning. (#5610)
Summary:
This PR implements cache eviction using reinforcement learning. It includes two implementations:
1. An implementation of Thompson Sampling for the Bernoulli Bandit [1].
2. An implementation of LinUCB with disjoint linear models [2].

The idea is that a cache uses multiple eviction policies, e.g., MRU, LRU, and LFU. The cache learns which eviction policy is the best and uses it upon a cache miss.
Thompson Sampling is contextless and does not include any features.
LinUCB includes features such as level, block type, caller, column family id to decide which eviction policy to use.

[1] Daniel J. Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. 2018. A Tutorial on Thompson Sampling. Found. Trends Mach. Learn. 11, 1 (July 2018), 1-96. DOI: https://doi.org/10.1561/2200000070
[2] Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. 2010. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web (WWW '10). ACM, New York, NY, USA, 661-670. DOI=http://dx.doi.org/10.1145/1772690.1772758
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5610

Differential Revision: D16435067

Pulled By: HaoyuHuang

fbshipit-source-id: 6549239ae14115c01cb1e70548af9e46d8dc21bb
2019-07-26 14:41:13 -07:00
Levi Tamasi
3617287e0e Parallelize db_bloom_filter_test (#5632)
Summary:
This test frequently times out under TSAN; parallelizing it should fix
this issue.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5632

Test Plan:
make check
buck test mode/dev-tsan internal_repo_rocksdb/repo:db_bloom_filter_test

Differential Revision: D16519399

Pulled By: ltamasi

fbshipit-source-id: 66e05a644d6f79c6d544255ffcf6de195d2d62fe
2019-07-26 11:48:17 -07:00
Yanqin Jin
74782cec32 Fix target 'clean' to include parallel test binaries (#5629)
Summary:
current `clean` target in Makefile does not remove parallel test
binaries. Fix this.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5629

Test Plan:
(on devserver)
Take file_reader_writer_test for instance.
```
$make -j32 file_reader_writer_test
$make clean
```
Verify that binary file 'file_reader_writer_test' is delete by `make clean`.

Differential Revision: D16513176

Pulled By: riversand963

fbshipit-source-id: 70acb9f56c928a494964121b86aacc0090f31ff6
2019-07-26 09:56:09 -07:00
anand76
112702ac6c Parallelize file_reader_writer_test in order to reduce timeouts
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5608

Test Plan:
make check
buck test mode/dev-tsan internal_repo_rocksdb/repo:file_reader_writer_test -- --run-disabled

Differential Revision: D16441796

Pulled By: anand1976

fbshipit-source-id: afbb88a9fcb1c0ba22215118767e8eab3d1d6a4a
2019-07-23 11:50:10 -07:00
Venki Pallipadi
22ce462450 Export Import sst files (#5495)
Summary:
Refresh of the earlier change here - https://github.com/facebook/rocksdb/issues/5135

This is a review request for code change needed for - https://github.com/facebook/rocksdb/issues/3469
"Add support for taking snapshot of a column family and creating column family from a given CF snapshot"

We have an implementation for this that we have been testing internally. We have two new APIs that together provide this functionality.

(1) ExportColumnFamily() - This API is modelled after CreateCheckpoint() as below.
// Exports all live SST files of a specified Column Family onto export_dir,
// returning SST files information in metadata.
// - SST files will be created as hard links when the directory specified
//   is in the same partition as the db directory, copied otherwise.
// - export_dir should not already exist and will be created by this API.
// - Always triggers a flush.
virtual Status ExportColumnFamily(ColumnFamilyHandle* handle,
                                  const std::string& export_dir,
                                  ExportImportFilesMetaData** metadata);

Internally, the API will DisableFileDeletions(), GetColumnFamilyMetaData(), Parse through
metadata, creating links/copies of all the sst files, EnableFileDeletions() and complete the call by
returning the list of file metadata.

(2) CreateColumnFamilyWithImport() - This API is modeled after IngestExternalFile(), but invoked only during a CF creation as below.
// CreateColumnFamilyWithImport() will create a new column family with
// column_family_name and import external SST files specified in metadata into
// this column family.
// (1) External SST files can be created using SstFileWriter.
// (2) External SST files can be exported from a particular column family in
//     an existing DB.
// Option in import_options specifies whether the external files are copied or
// moved (default is copy). When option specifies copy, managing files at
// external_file_path is caller's responsibility. When option specifies a
// move, the call ensures that the specified files at external_file_path are
// deleted on successful return and files are not modified on any error
// return.
// On error return, column family handle returned will be nullptr.
// ColumnFamily will be present on successful return and will not be present
// on error return. ColumnFamily may be present on any crash during this call.
virtual Status CreateColumnFamilyWithImport(
    const ColumnFamilyOptions& options, const std::string& column_family_name,
    const ImportColumnFamilyOptions& import_options,
    const ExportImportFilesMetaData& metadata,
    ColumnFamilyHandle** handle);

Internally, this API creates a new CF, parses all the sst files and adds it to the specified column family, at the same level and with same sequence number as in the metadata. Also performs safety checks with respect to overlaps between the sst files being imported.

If incoming sequence number is higher than current local sequence number, local sequence
number is updated to reflect this.

Note, as the sst files is are being moved across Column Families, Column Family name in sst file
will no longer match the actual column family on destination DB. The API does not modify Column
Family name or id in the sst files being imported.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5495

Differential Revision: D16018881

fbshipit-source-id: 9ae2251025d5916d35a9fc4ea4d6707f6be16ff9
2019-07-17 12:27:14 -07:00
Yuqi Gu
a3c1832e86 Arm64 CRC32 parallel computation optimization for RocksDB (#5494)
Summary:
Crc32c Parallel computation optimization:
Algorithm comes from Intel whitepaper: [crc-iscsi-polynomial-crc32-instruction-paper](https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf)
 Input data is divided into three equal-sized blocks
Three parallel blocks (crc0, crc1, crc2) for 1024 Bytes
One Block: 42(BLK_LENGTH) * 8(step length: crc32c_u64) bytes

1. crc32c_test:
```
[==========] Running 4 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 4 tests from CRC
[ RUN      ] CRC.StandardResults
[       OK ] CRC.StandardResults (1 ms)
[ RUN      ] CRC.Values
[       OK ] CRC.Values (0 ms)
[ RUN      ] CRC.Extend
[       OK ] CRC.Extend (0 ms)
[ RUN      ] CRC.Mask
[       OK ] CRC.Mask (0 ms)
[----------] 4 tests from CRC (1 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 1 test case ran. (1 ms total)
[  PASSED  ] 4 tests.
```

2. RocksDB benchmark: db_bench --benchmarks="crc32c"

```
Linear Arm crc32c:
  crc32c: 1.005 micros/op 995133 ops/sec; 3887.2 MB/s (4096 per op)
```

```
Parallel optimization with Armv8 crypto extension:
  crc32c: 0.419 micros/op 2385078 ops/sec; 9316.7 MB/s (4096 per op)
```

It gets ~2.4x speedup compared to linear Arm crc32c instructions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5494

Differential Revision: D16340806

fbshipit-source-id: 95dae9a5b646fd20a8303671d82f17b2e162e945
2019-07-17 11:22:38 -07:00
haoyuhuang
1a59b6e2a9 Cache simulator: Add a ghost cache for admission control and a hybrid row-block cache. (#5534)
Summary:
This PR adds a ghost cache for admission control. Specifically, it admits an entry on its second access.
It also adds a hybrid row-block cache that caches the referenced key-value pairs of a Get/MultiGet request instead of its blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5534

Test Plan: make clean && COMPILE_WITH_ASAN=1 make check -j32

Differential Revision: D16101124

Pulled By: HaoyuHuang

fbshipit-source-id: b99edda6418a888e94eb40f71ece45d375e234b1
2019-07-11 12:43:29 -07:00
ggaurav28
60d8b19836 Implemented a file logger that uses WritableFileWriter (#5491)
Summary:
Current PosixLogger performs IO operations using posix calls. Thus the
current implementation will not work for non-posix env. Created a new
logger class EnvLogger that uses env specific WritableFileWriter for IO operations.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5491

Test Plan: make check

Differential Revision: D15909002

Pulled By: ggaurav28

fbshipit-source-id: 13a8105176e8e42db0c59798d48cb6a0dbccc965
2019-07-09 16:27:22 -07:00
Adam Retter
5dc9fbd117 Update the version of ZStd for the Rocks Java static build
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5228

Differential Revision: D15880451

Pulled By: sagar0

fbshipit-source-id: 84da6f42cac15367d95bffa5336ebd002e7c3308
2019-06-18 11:57:01 -07:00
haoyuhuang
2d1dd5bce7 Support computing miss ratio curves using sim_cache. (#5449)
Summary:
This PR adds a BlockCacheTraceSimulator that reports the miss ratios given different cache configurations. A cache configuration contains "cache_name,num_shard_bits,cache_capacities". For example, "lru, 1, 1K, 2K, 4M, 4G".

When we replay the trace, we also perform lookups and inserts on the simulated caches.
In the end, it reports the miss ratio for each tuple <cache_name, num_shard_bits, cache_capacity> in a output file.

This PR also adds a main source block_cache_trace_analyzer so that we can run the analyzer in command line.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5449

Test Plan:
Added tests for block_cache_trace_analyzer.
COMPILE_WITH_ASAN=1 make check -j32.

Differential Revision: D15797073

Pulled By: HaoyuHuang

fbshipit-source-id: aef0c5c2e7938f3e8b6a10d4a6a50e6928ecf408
2019-06-17 16:41:12 -07:00
Zhongyi Xie
671d15cbdd Persistent Stats: persist stats history to disk (#5046)
Summary:
This PR continues the work in https://github.com/facebook/rocksdb/pull/4748 and https://github.com/facebook/rocksdb/pull/4535 by adding a new DBOption `persist_stats_to_disk` which instructs RocksDB to persist stats history to RocksDB itself. When statistics is enabled, and  both options `stats_persist_period_sec` and `persist_stats_to_disk` are set, RocksDB will periodically write stats to a built-in column family in the following form: key -> (timestamp in microseconds)#(stats name), value -> stats value. The existing API `GetStatsHistory` will detect the current value of `persist_stats_to_disk` and either read from in-memory data structure or from the hidden column family on disk.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5046

Differential Revision: D15863138

Pulled By: miasantreble

fbshipit-source-id: bb82abdb3f2ca581aa42531734ac799f113e931b
2019-06-17 15:21:50 -07:00
Patrick Zhang
5c76ba9dc4 Support rocksdbjava aarch64 build and test (#5258)
Summary:
Verified with an Ampere Computing eMAG aarch64 system.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5258

Differential Revision: D15807309

Pulled By: maysamyabandeh

fbshipit-source-id: ab85d2fd3fe40e6094430ab0eba557b1e979510d
2019-06-13 11:48:10 -07:00
haoyuhuang
9bbccda01e First commit for block cache trace analyzer (#5425)
Summary:
This PR contains the first commit for block cache trace analyzer. It reads a block cache trace file and prints statistics of the traces.

We will extend this class to provide more functionalities.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5425

Differential Revision: D15709580

Pulled By: HaoyuHuang

fbshipit-source-id: 2f43bd2311f460ab569880819d95eeae217c20bb
2019-06-11 12:22:44 -07:00
haoyuhuang
aa71718ac3 Add block cache tracer. (#5410)
Summary:
This PR adds a help class block cache tracer to read/write block cache accesses. It uses the trace reader/writer to perform this task.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5410

Differential Revision: D15612843

Pulled By: HaoyuHuang

fbshipit-source-id: f30fd1e1524355ca87db5d533a5c086728b141ea
2019-06-06 11:24:39 -07:00
Siying Dong
000b9ec217 Move some logging related files to logging/ (#5387)
Summary:
Many logging related source files are under util/. It will be more structured if they are together.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5387

Differential Revision: D15579036

Pulled By: siying

fbshipit-source-id: 3850134ed50b8c0bb40a0c8ae1f184fa4081303f
2019-05-31 17:23:59 -07:00
Vijay Nadimpalli
49c5a12dbe Organizing rocksdb/db directory
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5390

Differential Revision: D15579388

Pulled By: vjnadimpalli

fbshipit-source-id: 5bfc95e31554b8ff05b97b76d6534113f527f366
2019-05-31 11:57:01 -07:00
Siying Dong
8843129ece Move some memory related files from util/ to memory/ (#5382)
Summary:
Move arena, allocator, and memory tools under util to a separate memory/ directory.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5382

Differential Revision: D15564655

Pulled By: siying

fbshipit-source-id: 9cd6b5d0d3d52b39606e19221fa154596e5852a5
2019-05-30 17:44:09 -07:00
Vijay Nadimpalli
50e470791d Organizing rocksdb/table directory by format
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5373

Differential Revision: D15559425

Pulled By: vjnadimpalli

fbshipit-source-id: 5d6d6d615582bedd96a4b879bb25d429a6de8b55
2019-05-30 14:51:11 -07:00
Siying Dong
e9e0101ca4 Move test related files under util/ to test_util/ (#5377)
Summary:
There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo
ve them to a new directory test_util/, so that util/ is cleaner.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377

Differential Revision: D15551366

Pulled By: siying

fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c
2019-05-30 11:25:51 -07:00
Siying Dong
545d206040 Move some file related files outside util/ (#5375)
Summary:
util/ means for lower level libraries, so it's a good idea to move the files which requires knowledge to DB out. Create a file/ and move some files there.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5375

Differential Revision: D15550935

Pulled By: siying

fbshipit-source-id: 61a9715dcde5386eebfb43e93f847bba1ae0d3f2
2019-05-29 20:47:06 -07:00
Adam Retter
5882e847aa Allow builds of RocksJava debug releases (#5274)
Summary:
This allows debug releases of RocksJava to be build with the Docker release targets.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5274

Differential Revision: D15185067

Pulled By: sagar0

fbshipit-source-id: f3988e472f281f5844d9a07098344a827b1e7eb1
2019-05-02 14:27:20 -07:00
Yuqi Gu
03c7ae24c2 RocksDB CRC32c optimization with ARMv8 Intrinsic (#5221)
Summary:
1. Add Arm linear crc32c implemtation for RocksDB.
2. Arm runtime check for crc32
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5221

Differential Revision: D15013685

Pulled By: siying

fbshipit-source-id: 2c2983743d26656d93f212dc7c1a3cf66a1acf12
2019-04-30 10:59:05 -07:00
Yanqin Jin
9358178edc Support for single-primary, multi-secondary instances (#4899)
Summary:
This PR allows RocksDB to run in single-primary, multi-secondary process mode.
The writer is a regular RocksDB (e.g. an `DBImpl`) instance playing the role of a primary.
Multiple `DBImplSecondary` processes (secondaries) share the same set of SST files, MANIFEST, WAL files with the primary. Secondaries tail the MANIFEST of the primary and apply updates to their own in-memory state of the file system, e.g. `VersionStorageInfo`.

This PR has several components:
1. (Originally in #4745). Add a `PathNotFound` subcode to `IOError` to denote the failure when a secondary tries to open a file which has been deleted by the primary.

2. (Similar to #4602). Add `FragmentBufferedReader` to handle partially-read, trailing record at the end of a log from where future read can continue.

3. (Originally in #4710 and #4820). Add implementation of the secondary, i.e. `DBImplSecondary`.
3.1 Tail the primary's MANIFEST during recovery.
3.2 Tail the primary's MANIFEST during normal processing by calling `ReadAndApply`.
3.3 Tailing WAL will be in a future PR.

4. Add an example in 'examples/multi_processes_example.cc' to demonstrate the usage of secondary RocksDB instance in a multi-process setting. Instructions to run the example can be found at the beginning of the source code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4899

Differential Revision: D14510945

Pulled By: riversand963

fbshipit-source-id: 4ac1c5693e6012ad23f7b4b42d3c374fecbe8886
2019-03-26 16:45:31 -07:00
Adam Retter
bb474e9a02 Add missing functionality to RocksJava (#4833)
Summary:
This is my latest round of changes to add missing items to RocksJava. More to come in future PRs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4833

Differential Revision: D14152266

Pulled By: sagar0

fbshipit-source-id: d6cff67e26da06c131491b5cf6911a8cd0db0775
2019-02-22 14:46:46 -08:00
Yanqin Jin
7d23210226 Separate crash test with atomic flush (#4945)
Summary:
Currently crash test covers cases with and without atomic flush, but takes too
long to finish. Therefore it may be a better idea to put crash test with atomic
flush in a separate set of tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4945

Differential Revision: D13947548

Pulled By: riversand963

fbshipit-source-id: 177c6de865290fd650b0103408339eaa3f801d8c
2019-02-19 14:08:39 -08:00
Yanqin Jin
5af9446ee6 Remove Lua compaction filter from RocksDB main repo (#4971)
Summary:
as title. For people who continue to need Lua compaction filter, you
can copy the include/rocksdb/utilities/rocks_lua/lua_compaction_filter.h and
utilities/lua/rocks_lua_compaction_filter.cc to your own codebase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4971

Differential Revision: D14047468

Pulled By: riversand963

fbshipit-source-id: 9ad1a6484a7c94e478f1e108127a3184e4069f70
2019-02-13 12:42:44 -08:00
Yanqin Jin
95604d13e9 Change the command to invoke parallel tests (#4922)
Summary:
We used to call `printf $(t_run)` and later feed the result to GNU parallel in the recipe of target `check_0`. However, this approach is problematic when the length of $(t_run) exceeds the
maximum length of a command and the `printf` command cannot be executed. Instead we use 'find -print' to avoid generating an overly long command.

**This PR is actually the last commit of #4916. Prefer to merge this PR separately.**
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4922

Differential Revision: D13845883

Pulled By: riversand963

fbshipit-source-id: b56de7f7af43337c6ec89b931de843c9667cb679
2019-01-28 15:02:26 -08:00
Yanqin Jin
e1de88c8c7 Escape '.' by adding a '\' to avoid matching any char
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4912

Differential Revision: D13789449

Pulled By: riversand963

fbshipit-source-id: 0639dae82049b7ac977c8f81851f1c9fdc346705
2019-01-24 11:25:27 -08:00
Remington Brasga
1eded07f00 Bug in Regular Expression in Makefile (#4682)
Summary:
False-negative about path not existing. The regex is ignoring the "." in front of a path.
Example: "./path/to/file"
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4682

Differential Revision: D13777110

Pulled By: sagar0

fbshipit-source-id: 9f8173b7581407555fdc055580732aeab37d4ade
2019-01-23 10:24:10 -08:00
Varadharajan
349c7cceff Fix downloaded filename of snappy (#4870)
Summary:
Build failing due to incorrect filename.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4870

Differential Revision: D13637205

Pulled By: sagar0

fbshipit-source-id: 72da45d51b49bce32f696532ba0656ee0dc2b89f
2019-01-11 10:29:40 -08:00