Commit Graph

2221 Commits

Author SHA1 Message Date
Dhruba Borthakur
1a2cc27e01 ColumnFamilyOptions SanitizeOptions is buggy on 32-bit platforms.
Summary:
The pre-existing code is trying to clamp between 65,536 and 0,
resulting in clamping to 65,536, resulting in very small buffers,
resulting in ShouldFlushNow() being true quite easily,
resulting in assertion failing and database performance
being "not what it should be".

https://github.com/facebook/rocksdb/issues/1018

Test Plan: make check

Reviewers: sdong, andrewkr, IslamAbdelRahman, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D55455
2016-03-14 16:21:54 -07:00
sdong
b2ae5950ba Index Reader should not be reused after DB restart
Summary:
In block based table reader, wow we put index reader to block cache, which can be retrieved after DB restart. However, index reader may reference internal comparator, which can be destroyed after DB restarts, causing problems.
Fix it by making cache key identical per table reader.

Test Plan: Add a new test which failed with out the commit but now pass.

Reviewers: IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: maro, yhchiang, kradhakrishnan, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D55287
2016-03-14 10:04:09 -07:00
Islam AbdelRahman
580fede347 Aggregate hot Iterator counters in LocalStatistics (DBIter::Next perf regression)
Summary:
This patch bump the counters in the frequent code path DBIter::Next() / DBIter::Prev() in a local data members and send them to Statistics when the iterator is destroyed
A better solution will be to have thread_local implementation for Statistics

New performance
```
readseq      :       0.035 micros/op 28597881 ops/sec; 3163.7 MB/s
     1,851,568,819      stalled-cycles-frontend   #   31.29% frontend cycles idle    [49.86%]
       884,929,823      stalled-cycles-backend    #   14.95% backend  cycles idle    [50.21%]
readreverse  :       0.071 micros/op 14077393 ops/sec; 1557.3 MB/s
     3,239,575,993      stalled-cycles-frontend   #   27.36% frontend cycles idle    [49.96%]
     1,558,253,983      stalled-cycles-backend    #   13.16% backend  cycles idle    [50.14%]

```

Existing performance

```
readreverse  :       0.174 micros/op 5732342 ops/sec;  634.1 MB/s
    20,570,209,389      stalled-cycles-frontend   #   70.71% frontend cycles idle    [50.01%]
    18,422,816,837      stalled-cycles-backend    #   63.33% backend  cycles idle    [50.04%]

readseq      :       0.119 micros/op 8400537 ops/sec;  929.3 MB/s
    15,634,225,844      stalled-cycles-frontend   #   79.07% frontend cycles idle    [49.96%]
    14,227,427,453      stalled-cycles-backend    #   71.95% backend  cycles idle    [50.09%]
```

Test Plan: unit tests

Reviewers: yhchiang, sdong, igor

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D55107
2016-03-11 19:01:12 -08:00
Baris Yazici
e8e6cf0173 fix: handle_fatal_signal (sig=6) in std::vector<std::string, std::allocator<std::string> >::_M_range_check | c++/4.8.2/bits/stl_vector.h:794 #174
Summary:
Fix for https://github.com/facebook/mysql-5.6/issues/174

When there is no old files to purge, vector.at(i) function was crashing

if (old_info_log_file_count != 0 &&
      old_info_log_file_count >= db_options_.keep_log_file_num) {
    std::sort(old_info_log_files.begin(), old_info_log_files.end());
    size_t end = old_info_log_file_count - db_options_.keep_log_file_num;
    for (unsigned int i = 0; i <= end; i++) {
      std::string& to_delete = old_info_log_files.at(i);

Added check to old_info_log_file_count be non zero.

Test Plan: run existing tests

Reviewers: gunnarku, vasilep, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: andrewkr, webscalesql-eng, dhruba

Differential Revision: https://reviews.facebook.net/D55245
2016-03-11 11:11:45 -08:00
Andrew Kryczka
d9620239d2 Cleanup stale manifests outside of full purge
Summary:
- Keep track of obsolete manifests in VersionSet
- Updated FindObsoleteFiles() to put obsolete manifests in the JobContext for later use by PurgeObsoleteFiles()
- Added test case that verifies a stale manifest is deleted by a non-full purge

Test Plan:
  $ ./backupable_db_test --gtest_filter=BackupableDBTest.ChangeManifestDuringBackupCreation

Reviewers: IslamAbdelRahman, yoshinorim, sdong

Reviewed By: sdong

Subscribers: andrewkr, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D55269
2016-03-10 18:16:21 -08:00
Yi Wu
f71fc77b7c Cache to have an option to fail Cache::Insert() when full
Summary:
Cache to have an option to fail Cache::Insert() when full. Update call sites to check status and handle error.

I totally have no idea what's correct behavior of all the call sites when they encounter error. Please let me know if you see something wrong or more unit test is needed.

Test Plan: make check -j32, see tests pass.

Reviewers: anthony, yhchiang, andrewkr, IslamAbdelRahman, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D54705
2016-03-10 17:35:19 -08:00
Yueh-Hsuan Chiang
765597fa78 Update compaction score right after CompactFiles forms a compaction
Summary:
This is a follow-up patch of https://reviews.facebook.net/D54891.
As the information about files being compacted will also be used
when making compaction decision, it is necessary to update the compaction
score when a compaction plan has been made but not yet execute.

This patch adds a missing call to update the compaction score in
CompactFiles().

Test Plan: compact_files_test

Reviewers: sdong, IslamAbdelRahman, kradhakrishnan, yiwu, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D55227
2016-03-10 14:34:28 -08:00
Yueh-Hsuan Chiang
aa3f02d50c Improve comment in compaction.h and compaction_picker.h
Summary:
ReleaseCompactionFiles must be called when DB mutex is held,
but the documentation is mission.

Test Plan: no code change

Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D54987
2016-03-08 16:46:41 -08:00
sdong
294bdf9ee2 Change Property name from "rocksdb.current_version_number" to "rocksdb.current-super-version-number"
Summary: I realized I again is wrong about the naming convention. Let me change it to the correct one.

Test Plan: Run unit tests.

Reviewers: IslamAbdelRahman, kradhakrishnan, yhchiang, andrewkr

Reviewed By: andrewkr

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D55041
2016-03-04 18:15:29 -08:00
Yueh-Hsuan Chiang
a7d4eb2f34 Fix a bug where flush does not happen when a manual compaction is running
Summary:
Currently, when rocksdb tries to run manual compaction to refit data into a level,
there's a ReFitLevel() process that requires no bg work is currently running.
When RocksDB plans to ReFitLevel(), it will do the following:

 1. pause scheduling new bg work.
 2. wait until all bg work finished
 3. do the ReFitLevel()
 4. unpause scheduling new bg work.

However, as it pause scheduling new bg work at step one and waiting for all bg work
finished in step 2, RocksDB will stop flushing until all bg work is done (which
could take a long time.)

This patch fix this issue by changing the way ReFitLevel() pause the background work:

1. pause scheduling compaction.
2. wait until all bg work finished.
3. pause scheduling flush
4. do ReFitLevel()
5. unpause both flush and compaction.

The major difference is that.  We only pause scheduling compaction in step 1 and wait
for all bg work finished in step 2.  This prevent flush being blocked for a long time.
Although there's a very rare case that ReFitLevel() might be in starvation in step 2,
but it's less likely the case as flush typically finish very fast.

Test Plan: existing test.

Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D55029
2016-03-04 14:24:52 -08:00
Islam AbdelRahman
dfe96c72c3 Fix WriteLevel0TableForRecovery file delete protection
Summary:
The call to

```
CaptureCurrentFileNumberInPendingOutputs()
```

should be before

```
versions_->NewFileNumber()
```
Right now we are not actually protecting the file from being deleted

Test Plan: make check

Reviewers: sdong, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D54645
2016-03-03 18:25:07 -08:00
sdong
ef204df7ef Compaction always needs to be removed from level0_compactions_in_progress_ for universal compaction
Summary: We always put compaction to level0_compactions_in_progress_ for universal compaction, so we should also remove it. The bug causes assert failure when running manual compaction.

Test Plan:
TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom,compact --subcompactions=16 --compaction_style=1
always fails on my host. After the fix, it doesn't fail any more.

Reviewers: IslamAbdelRahman, andrewkr, kradhakrishnan, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D55017
2016-03-02 21:23:28 -08:00
sdong
e79ad9e184 Add Iterator Property rocksdb.iterator.version_number
Summary: We want to provide a way to detect whether an iterator is stale and needs to be recreated. Add a iterator property to return version number.

Test Plan: Add two unit tests for it.

Reviewers: IslamAbdelRahman, yhchiang, anthony, kradhakrishnan, andrewkr

Reviewed By: andrewkr

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54921
2016-03-02 16:23:59 -08:00
sdong
19ea40f8b6 Subcompaction boundary keys should not terminate after an empty level
Summary: Now we skip to add boundary keys to subcompaction candidates since we see an empty level. This makes subcompaction almost disabled for universal compaction. We should consider all files instead.

Test Plan: Run existing tests.

Reviewers: IslamAbdelRahman, andrewkr, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D55005
2016-03-02 15:45:07 -08:00
sdong
74b660702e Rename iterator property "rocksdb.iterator.is.key.pinned" => "rocksdb.iterator.is-key-pinned"
Summary: Rename iterator property to folow property naming convention.

Test Plan: Run all existing tests.

Reviewers: andrewkr, anthony, yhchiang, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54957
2016-03-01 13:47:12 -08:00
Islam AbdelRahman
6743135ea1 Fix DB::AddFile() issue when PurgeObsoleteFiles() is called
Summary:
In some situations the DB will scan all existing files in the DB path and delete the ones that are Obsolete.
If this happen during adding an external sst file. this could cause the file to be deleted while we are adding it.
This diff fix this issue

Test Plan:
unit test to reproduce the bug
existing unit tests

Reviewers: sdong, yhchiang, andrewkr

Reviewed By: andrewkr

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D54627
2016-03-01 12:05:29 -08:00
sdong
432f3adf2c Add DB Property "rocksdb.current_version_number"
Summary: Add a DB Property "rocksdb.current_version_number" for users to monitor version changes and stale iterators.

Test Plan: Add a unit test.

Reviewers: andrewkr, yhchiang, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54927
2016-03-01 10:55:40 -08:00
sdong
b5b1db167a Recompute compaction score after scheduling manual compaction
Summary: After we made manual compaction runnable concurrently with automaticallly compaction, we need to run ComputeCompactionScore() to prepare a coming compaction picking call before the compaction finishes.

Test Plan: Run existing tests.

Reviewers: yhchiang, IslamAbdelRahman, andrewkr, kradhakrishnan, anthony, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54891
2016-02-29 17:17:51 -08:00
sdong
1f5954147b Introduce Iterator::GetProperty() and replace Iterator::IsKeyPinned()
Summary:
Add Iterator::GetProperty(), a way for users to communicate with iterator, and turn Iterator::IsKeyPinned() with it.
As a follow-up, I'll ask a property as the version number attached to the iterator

Test Plan: Rerun existing tests and add a negative test case.

Reviewers: yhchiang, andrewkr, kradhakrishnan, anthony, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54783
2016-02-29 14:01:31 -08:00
Andrew Kryczka
69c471bd9b Handle concurrent manifest update and backup creation
Summary:
Fixed two related race conditions in backup creation.

(1) CreateNewBackup() uses DB::DisableFileDeletions() to prevent table files
from being deleted while it is copying; however, the MANIFEST file could still
rotate during this time. The fix is to stop deleting the old manifest in the
rotation logic. It will be deleted safely later when PurgeObsoleteFiles() runs
(can only happen when file deletions are enabled).

(2) CreateNewBackup() did not account for the CURRENT file being mutable.
This is significant because the files returned by GetLiveFiles() contain a
particular manifest filename, but the manifest to which CURRENT refers can
change at any time. This causes problems when CURRENT changes between the call
to GetLiveFiles() and when it's copied to the backup directory. To workaround this, I
manually forge a CURRENT file referring to the manifest filename returned in
GetLiveFiles().

(2) also applies to the checkpointing code, so let me know if this approach is
good and I'll make the same change there.

Test Plan:
new test for roll manifest during backup creation.

running the test before this change:

  $ ./backupable_db_test --gtest_filter=BackupableDBTest.ChangeManifestDuringBackupCreation
  ...
  IO error: /tmp/rocksdbtest-9383/backupable_db/MANIFEST-000001: No such file or directory

running the test after this change:

  $ ./backupable_db_test --gtest_filter=BackupableDBTest.ChangeManifestDuringBackupCreation
  ...
  [ RUN      ] BackupableDBTest.ChangeManifestDuringBackupCreation
  [       OK ] BackupableDBTest.ChangeManifestDuringBackupCreation (2836 ms)

Reviewers: IslamAbdelRahman, anthony, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D54711
2016-02-29 12:56:55 -08:00
sdong
8800975fb0 Make DBTestUniversalCompaction.IncreaseUniversalCompactionNumLevels more robust
Summary:
Based on thread scheduling, DBTestUniversalCompaction.IncreaseUniversalCompactionNumLevels can fail to flush enough files to trigger expected compactions. Fix it by waiting for flush after inserting each key.
There are failrue reported:

db/db_universal_compaction_test.cc:1134: Failure
Expected: (NumTableFilesAtLevel(options.num_levels - 1, 1)) > (0), actual: 0 vs 0

but I can't repro it. Try to fix the bug and see whether it goes away.

Test Plan: Run the test multiple time.

Reviewers: IslamAbdelRahman, anthony, andrewkr, kradhakrishnan, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54747
2016-02-26 11:59:31 -08:00
sdong
82f15fb15d Add test to make sure DropColumnFamily doesn't impact existing iterators
Summary: Add a test case in ColumnFamilyTest.ReadDroppedColumnFamily to make sure existing iterator is not impacted by column family dropping.

Test Plan: N/A

Reviewers: igor, yhchiang, anthony, andrewkr, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54657
2016-02-24 10:25:38 -08:00
sdong
38201b3599 Fix assert failure when DBImpl::SyncWAL() conflicts with log rolling
Summary: DBImpl::SyncWAL() releases db mutex before calling DBImpl::MarkLogsSynced(), while inside DBImpl::MarkLogsSynced() we assert there is none or one outstanding log file. However, a memtable switch can happen in between and causing two or outstanding logs there, failing the assert. The diff adds a unit test that repros the issue and fix the assert so that the unit test passes.

Test Plan: Run the new tests.

Reviewers: anthony, kolmike, yhchiang, IslamAbdelRahman, kradhakrishnan, andrewkr

Reviewed By: andrewkr

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54621
2016-02-23 11:42:15 -08:00
Andrew Kryczka
b046916656 Redo SyncPoints for flush while rolling test
Summary:
There was a race condition in the test where the rolling thread
acquired the mutex before the flush thread pinned the logger. Rather than add
more complicated synchronization to fix it, I followed Siying's suggestion to
use SyncPoint in the test code.

Comments in the LoadDependency() invocation explain the reason for each of the
sync points.

Test Plan:
Ran test 1000 times for tsan/asan. Will wait for all sandcastle tests
to finish before committing since this is a tricky test.

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D54615
2016-02-22 21:32:19 -08:00
Mike Kolupaev
eef63ef807 Fixed CompactFiles() spuriously failing or corrupting DB
Summary:
We started getting two kinds of crashes since we started using `DB::CompactFiles()`:
(1) `CompactFiles()` fails saying something like "/data/logdevice/4440/shard12/012302.sst: No such file or directory", and presumably makes DB read-only,
(2) DB fails to open saying "Corruption: Can't access /267000.sst: IO error: /data/logdevice/4440/shard1/267000.sst: No such file or directory".

AFAICT, both can be explained by background thread deleting compaction output as "obsolete" while it's being written, before it's committed to manifest. If it ends up committed to the manifest, we get (2); if compaction notices the disappearance and fails, we get (1). The internal tasks t10068021 and t10134177 have some details about the investigation that led to this.

Test Plan: `make -j check`; the new test fails to reopen the DB without the fix

Reviewers: yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, sdong

Differential Revision: https://reviews.facebook.net/D54561
2016-02-22 13:54:58 -08:00
Dmitri Smirnov
d37d348da8 This addresses build issues on Windows
https://github.com/facebook/rocksdb/issues/1002
2016-02-19 12:29:54 -08:00
Andrew Kryczka
d825fc70d4 Use condition variable in log roller test
Summary:
Previously I just slept until the flush_thread was "probably" ready
since proper synchronization in test cases seemed like overkill. But then tsan
complained about it, so I did the synchronization (mostly) properly now.

Test Plan:
  $ COMPILE_WITH_TSAN=1 make -j32 auto_roll_logger_test
  $ ./auto_roll_logger_test

Reviewers: anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D54399
2016-02-18 18:03:53 -08:00
Islam AbdelRahman
df9ba6df62 Introduce SstFileManager::SetMaxAllowedSpaceUsage() to cap disk space usage
Summary:
Introude SstFileManager::SetMaxAllowedSpaceUsage() that can be used to limit the maximum space usage allowed for RocksDB.
When this limit is exceeded WriteImpl() will fail and return Status::Aborted()

Test Plan: unit testing

Reviewers: yhchiang, anthony, andrewkr, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D53763
2016-02-17 15:20:23 -08:00
Andrew Kryczka
3943d16780 Fix race conditions in auto-rolling logger
Summary:
For GetLogFileSize() and Flush(), they previously did not follow the
synchronization pattern for accessing logger_. This meant ResetLogger() could
cause logger_ destruction while the unsynchronized functions were accessing it,
causing a segfault.

Also made the mutex instance variable mutable so we can preserve
GetLogFileSize()'s const-ness.

Test Plan:
new test case, it's quite ugly because both threads need to access
one of the functions with SyncPoints (PosixLogger::Flush()), and also special
handling is needed to prevent the mutex and sync points from conflicting.

Reviewers: kradhakrishnan, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D54237
2016-02-17 12:06:45 -08:00
reid horuff
a7b6f0748a Improve write_with_callback_test to sync WAL
Summary: Currently write_with_callback_test does not test with WAL syncing enabled. This addresses that.

Test Plan: write_with_callback_test

Reviewers: anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba, hermanlee4

Differential Revision: https://reviews.facebook.net/D54255
2016-02-16 14:04:14 -08:00
reid horuff
5bcf952a87 Fix WriteImpl empty batch hanging issue
Summary: There is an issue in DBImpl::WriteImpl where if an empty writebatch comes in and sync=true then the logs will be marked as being synced yet the sync never actually happens because there is no data in the writebatch. This causes the next incoming batch to hang while waiting for the logs to complete syncing. This fix syncs logs even if the writebatch is empty.

Test Plan: DoubleEmptyBatch unit test in transaction_test.

Reviewers: yoshinorim, hermanlee4, sdong, ngbronson, anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54057
2016-02-16 12:21:33 -08:00
Mike Kolupaev
44371501f0 Fixed a segfault when compaction fails
Summary: We've hit it today.

Test Plan: `make -j check`; didn't reproduce the issue

Reviewers: yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D54219
2016-02-16 11:11:16 -08:00
Jonathan Wiepert
7bd284c374 Separeate main from bench functionality to allow cusomizations
Summary: Isolate db_bench functionality from main so custom benchmark code can be written and managed

Test Plan:
Tested commands
./build_tools/regression_build_test.sh
./db_bench --db=/tmp/rocksdbtest-12321/dbbench --stats_interval_seconds=1 --num=1000
./db_bench --db=/tmp/rocksdbtest-12321/dbbench --stats_interval_seconds=1 --num=1000 --reads=500 --writes=500
./db_bench --db=/tmp/rocksdbtest-12321/dbbench --stats_interval_seconds=1 --num=1000 --merge_keys=100 --numdistinct=100 --num_column_families=3 --num_hot_column_families=1
./db_bench --stats_interval_seconds=1 --num=1000 --bloom_locality=1 --seed=5 --threads=5
./db_bench --duration=60 --value_size=50 --seek_nexts=10 --reverse_iterator=true --usee_uint64_comparator=true --batch-size=5
./db_bench --duration=60 --value_size=50 --seek_nexts=10 --reverse_iterator=true --use_uint64_comparator=true --batch_size=5
./db_bench --duration=60 --value_size=50 --seek_nexts=10 --reverse_iterator=true --usee_uint64_comparator=true --batch-size=5

Test Results - https://phabricator.fb.com/P56130387

Additional tests for:
./db_bench --duration=60 --value_size=50 --seek_nexts=10 --reverse_iterator=true --use_uint64_comparator=true --batch_size=5 --key_size=8 --merge_operator=put
./db_bench --stats_interval_seconds=1 --num=1000 --bloom_locality=1 --seed=5 --threads=5 --merge_operator=uint64add

Results: https://phabricator.fb.com/P56130607

Reviewers: yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D53991
2016-02-16 06:17:31 -08:00
sdong
92a9ccf1a6 Add a new compaction priority that picks file whose overlapping ratio is smallest
Summary:
Add a new compaction priority as following:
For every file, we calculate total size of files overalapping with the file in the next level, over the file's size itself. The file with smallest ratio will be picked first.
My "db_bench --fillrandom" shows about 5% less compaction than kOldestSmallestSeqFirst if --hard_pending_compaction_bytes_limit value to keep LSM tree in shape. If not limiting hard_pending_compaction_bytes_limit, improvement is only 1% or 2%.

Test Plan: Add a unit test

Reviewers: andrewkr, kradhakrishnan, anthony, IslamAbdelRahman, yhchiang

Reviewed By: yhchiang

Subscribers: MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54075
2016-02-11 15:59:19 -08:00
Peter Mattis
239aaf2fc0 Use user_comparator when comparing against iterate_upper_bound.
Fixes #983.
2016-02-11 08:47:16 -05:00
Baraa Hamodi
21e95811d1 Updated all copyright headers to the new format. 2016-02-09 15:12:00 -08:00
Yueh-Hsuan Chiang
4a8cbf4e31 Allows Get and MultiGet to read directly from SST files.
Summary:
Add kSstFileTier to ReadTier, which allows Get and MultiGet to
read only directly from SST files and skip mem-tables.

    kSstFileTier = 0x2      // data in SST files.
                          // Note that this ReadTier currently only supports
                          // Get and MultiGet and does not support iterators.

Test Plan: add new test in db_test.

Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: igor, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D53511
2016-02-09 11:20:22 -08:00
sdong
a76e9093f0 Fix LITE db_test build broken by previous commit
Summary: Previous commit introduces a test that is not supported in LITE. Fix it.

Test Plan: Build the test with ROCKSDB_LITE.

Reviewers: kradhakrishnan, IslamAbdelRahman, anthony, yhchiang, andrewkr

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D53901
2016-02-05 14:29:09 -08:00
sdong
b1887c5dd9 Explictly fail when memtable doesn't support concurrent insert
Summary: If users turn on concurrent insert but the memtable doesn't support it, they might see unexcepted crash. Fix it by explicitly fail.

Test Plan:
Run different setting of stress_test and make sure it fails correctly.
Will add a unit test too.

Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, andrewkr, ngbronson

Reviewed By: ngbronson

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D53895
2016-02-05 14:15:50 -08:00
reid horuff
6f71d3b68b Improve perf of Pessimistic Transaction expirations (and optimistic transactions)
Summary:
copy from task 8196669:

1) Optimistic transactions do not support batching writes from different threads.
2) Pessimistic transactions do not support batching writes if an expiration time is set.

In these 2 cases, we currently do not do any write batching in DBImpl::WriteImpl() because there is a WriteCallback that could decide at the last minute to abort the write.  But we could support batching write operations with callbacks if we make sure to process the callbacks correctly.

To do this, we would first need to modify write_thread.cc to stop preventing writes with callbacks from being batched together.  Then we would need to change DBImpl::WriteImpl() to call all WriteCallback's in a batch, only write the batches that succeed, and correctly set the state of each batch's WriteThread::Writer.

Test Plan: Added test WriteWithCallbackTest to write_callback_test.cc which creates multiple client threads and verifies that writes are batched and executed properly.

Reviewers: hermanlee4, anthony, ngbronson

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52863
2016-02-05 10:44:13 -08:00
Islam AbdelRahman
8e6172bc57 Add BlockBasedTableOptions::index_block_restart_interval
Summary: Add a new option to BlockBasedTableOptions that will allow us to change the restart interval for the index block

Test Plan: unit tests

Reviewers: yhchiang, anthony, andrewkr, sdong

Reviewed By: sdong

Subscribers: march, dhruba

Differential Revision: https://reviews.facebook.net/D53721
2016-02-05 10:22:37 -08:00
Nathan Bronson
2c1db5ea51 always invalidate sequential-insertion cache for concurrent skiplist adds
Summary:
InlineSkipList::InsertConcurrently should invalidate the
sequential-insertion cache prev_[] for all inserts of multi-level nodes,
not just those that increase the height of the skip list.  The invariant
for prev_ is that prev_[i] (i > 0) is supposed to be the predecessor of
prev_[0] at level i.  Before this diff InsertConcurrently could violate
this constraint when inserting a multi-level node after prev_[i] but
before prev_[0].

This diff also reenables kConcurrentSkipList as db_test's
MultiThreaded/MultiThreadedDBTest.MultiThreaded/29.

Test Plan:
1. unit tests
2. temporarily hack kConcurrentSkipList timing so that it is fast but has a 1.5% failure rate on my dev box (1ms stagger on thread launch, 1s test duration, failure rate baseline over 1000 runs)
3. observe 1000 passes post-fix

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: MarkCallaghan, dhruba

Differential Revision: https://reviews.facebook.net/D53751
2016-02-03 11:08:16 -08:00
Andrew Kryczka
284aa613a7 Eliminate duplicated property constants
Summary:
Before this diff, there were duplicated constants to refer to properties (user-
facing API had strings and InternalStats had an enum). I noticed these were
inconsistent in terms of which constants are provided, names of constants, and
documentation of constants. Overall it seemed annoying/error-prone to maintain
these duplicated constants.

So, this diff gets rid of InternalStats's constants and replaces them with a map
keyed on the user-facing constant. The value in that map contains a function
pointer to get the property value, so we don't need to do string matching while
holding db->mutex_. This approach has a side benefit of making many small
handler functions rather than a giant switch-statement.

Test Plan: db_properties_test passes, running "make commit-prereq -j32"

Reviewers: sdong, yhchiang, kradhakrishnan, IslamAbdelRahman, rven, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D53253
2016-02-02 19:14:56 -08:00
Nathan Bronson
5fcd1ba30a disable kConcurrentSkipList multithreaded test
Summary: Disable test that is intermittently failing

Test Plan: unit tests

Reviewers: igor, andrewkr, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D53715
2016-02-02 18:24:47 -08:00
Tomas Kolda
a62c519bb6 RollLogFile tries to find non conflicting file until there is no conflict. 2016-02-02 10:33:49 +01:00
Tomas Kolda
57a95a7001 Making use of GetSystemTimePreciseAsFileTime dynamic - code review fixes 2016-02-02 10:23:56 +01:00
Tomas Kolda
502d41f150 Making use of GetSystemTimePreciseAsFileTime dynamic to not
break compatibility with Windows 7. The issue with rotated logs
was fixed other way.
2016-02-02 10:23:56 +01:00
Nathan Bronson
9c2cf9479b Fix for --allow_concurrent_memtable_write with batching
Summary:
Concurrent memtable adds were incorrectly computing
the last sequence number for a write batch group when the
write batches were not solitary.  This is the cause of
https://github.com/facebook/mysql-5.6/issues/155

Test Plan:
1. unit tests
2. new unit test
3. parallel db_bench stress tests with batch size of 10 and asserts enabled

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: IslamAbdelRahman, MarkCallaghan, dhruba

Differential Revision: https://reviews.facebook.net/D53595
2016-02-01 20:41:57 -08:00
Siying Dong
7b943da1b2 Merge pull request #967 from SherlockNoMad/ValueSize
Add histogram for value size per operation
2016-02-01 17:59:08 -08:00
Islam AbdelRahman
1ad8182950 Fix WriteBatchTest.ManyUpdates, WriteBatchTest.LargeKeyValue under clang
Summary:
Fix current clang failure
https://ci-builds.fb.com/view/rocksdb/job/rocksdb_clang_build/1398/console

Test Plan:
make sure that both clang and g++ compilation succeed

USE_CLANG=1 make check -j64
make check -j64

Reviewers: anthony, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D53667
2016-02-01 16:07:53 -08:00