Commit Graph

2459 Commits

Author SHA1 Message Date
Gunnar Kudrjavets
b954847fca Fix release build for MyRocks by using debug-only code only in debug builds
Summary: MyRocks release integration build breaks because we treat warnings caused by unused variables as errors. Variable `edit` is only used in debug builds. Therefore we need to guard it using `#ifndef NDEBUG` check.

Test Plan:
- `[p]arc diff --preview` for the default validation.
- Verify that release build fails before this fix and passes after applying it.

Reviewers: andrewkr, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60423
2016-07-06 16:07:53 -07:00
sdong
a00bf1b3cf Add More Logging to track total_log_size
Summary: We saw instances where total_log_size is off the real value, but I'm not able to reproduce it. Add more logging to help debugging when it happens again.

Test Plan: Run the unit test and see the logging.

Reviewers: andrewkr, yhchiang, igor, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60081
2016-07-06 14:29:18 -07:00
sdong
32df9733d1 Add options.write_buffer_manager: control total memtable size across DB instances
Summary: Add option write_buffer_manager to help users control total memory spent on memtables across multiple DB instances.

Test Plan: Add a new unit test.

Reviewers: yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: adela, benj, sumeet, muthu, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59925
2016-07-05 18:11:25 -07:00
Aaron Gao
5aaef91d4a group multiple batch of flush into one manifest file (one call to LogAndApply)
Summary: Currently, if several flush outputs are committed together, we issue each manifest write per batch (1 batch = 1 flush = 1 sst file = 1+ continuous memtables). Each manifest write requires one fsync and one fsync to parent directory. In some cases, it becomes the bottleneck of write. We should batch them and write in one manifest write when possible.

Test Plan:
` ./db_bench -benchmarks="fillseq" -max_write_buffer_number=16 -max_background_flushes=16 -disable_auto_compactions=true -min_write_buffer_number_to_merge=1 -write_buffer_size=65536 -level0_stop_writes_trigger=10000 -level0_slowdown_writes_trigger=10000`
**Before**
```
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB:    version 4.9
Date:       Fri Jul  1 15:38:17 2016
CPU:        32 * Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
CPUCache:   20480 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Write rate: 0 bytes/second
Compression: Snappy
Memtablerep: skip_list
Perf Level: 1
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
DB path: [/tmp/rocksdbtest-112628/dbbench]
fillseq      :     166.277 micros/op 6014 ops/sec;    0.7 MB/s
```
**After**
```
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB:    version 4.9
Date:       Fri Jul  1 15:35:05 2016
CPU:        32 * Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
CPUCache:   20480 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Write rate: 0 bytes/second
Compression: Snappy
Memtablerep: skip_list
Perf Level: 1
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
DB path: [/tmp/rocksdbtest-112628/dbbench]
fillseq      :      52.328 micros/op 19110 ops/sec;    2.1 MB/s
```

Reviewers: andrewkr, IslamAbdelRahman, yhchiang, sdong

Reviewed By: sdong

Subscribers: igor, andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60075
2016-07-05 18:09:59 -07:00
omegaga
a45ee83181 Fix a bug that accesses invalid address in iterator cleanup function
Summary: Reported in T11889874. When registering the cleanup function we should copy the option so that we can still access it if ReadOptions is deleted.

Test Plan: Add a unit test to reproduce this bug.

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60087
2016-07-05 11:57:14 -07:00
Gunnar Kudrjavets
bdb1d19a69 Fix UBSan build break caused by variable not initialized
Summary: UBSan is unhappy because `cfd` is not initialized. This breaks UBSan build which in turn breaks MyRocks continuous integration with RocksDB which in turns makes me unhappy :-) Fix this.

Test Plan:
- `[p]arc diff --preview` + Sandcastle.
- Verify that `COMPILE_WITH_UBSAN=1 OPT=-g make J=1 ubsan_check` gets past the break.

Reviewers: andrewkr, hermanlee4, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60117
2016-06-29 10:49:25 -07:00
sdong
c4cef07f1b Update DBTestUniversalCompaction.UniversalCompactionSingleSortedRun to use max_size_amplification_percent = 0
Summary: With max_size_amplification_percent = 0 to make sure that DBTestUniversalCompaction.UniversalCompactionSingleSortedRun tests the configuration to compact to one single sorted run.

Test Plan: Run all existing tests

Reviewers: yhchiang, andrewkr, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60021
2016-06-27 15:19:27 -07:00
charsyam
4f2b0946d1 fix simple typos (#1183) 2016-06-25 08:29:40 +01:00
Andrew Kryczka
3b7ed677de ColumnFamilyOptions API [CF + RepairDB part 3/3]
Summary:
Overload RepairDB to take vector-of-ColumnFamilyDescriptor, which tells
us CF name + options. Also takes a ColumnFamilyOptions for unspecified column
families encountered during the repair.

One potentially confusing thing is that we store options in the constructor and
don't invoke AddColumnFamily() until discovering the CF in ScanTable. This is
because we don't know the CF ID until we find a table belonging to that CF.

Depends on D59781.

Test Plan:
  $ ./repair_test

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D59853
2016-06-24 16:29:43 -07:00
Andrew Kryczka
56ac686292 Detect column family from properties [CF + RepairDB part 2/3]
Summary:
This diff uses the CF ID and CF name properties in the SST file
to associate recovered data with the proper column family. Depends on D59775.

- In ScanTable(), create column families in VersionSet each time a new one is discovered (via reading SST file properties)
- In ConvertLogToTable(), dump an SST file for every column family with data in the WAL
- In AddTables(), make a VersionEdit per-column family that adds all of that CF's tables

Test Plan:
  $ ./repair_test

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D59781
2016-06-24 13:12:13 -07:00
Andrew Kryczka
343507afb1 Refactor to use VersionSet [CF + RepairDB part 1/3]
Summary:
To support column families, it is easiest to use VersionSet to manage
our column families (if we don't have Versions then ColumnFamilyData always
behaves as a dummy column family). This diff only refactors the existing repair
logic to use VersionSet; the next two parts will add support for multiple
column families.

Test Plan:
  $ ./repair_test

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D59775
2016-06-24 11:19:40 -07:00
omegaga
c4e19b77e8 Add a read option to enable background purge when cleaning up iterators
Summary:
Add a read option `background_purge_on_iterator_cleanup` to avoid deleting files in foreground when destroying iterators.
Instead, a job is scheduled in high priority queue and would be executed in a separate background thread.

Test Plan: Add a variant of PurgeObsoleteFileTest. Turn on background purge option in the new test, and use sleeping task to ensure files are deleted in background.

Reviewers: IslamAbdelRahman, sdong

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59499
2016-06-21 18:41:23 -07:00
Islam AbdelRahman
fa813f7478 Update DB::AddFile() to ingest the file to the lowest possible level
Summary:
DB::AddFile() right now always add the ingested file to L0
update the logic to add the file to the lowest possible level

Test Plan: unit tests

Reviewers: jkedgar, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D59637
2016-06-21 17:57:59 -07:00
sdong
7b79238b65 Deprectate filter_deletes
Summary: filter_deltes is not a frequently used feature. Remove it.

Test Plan: Run all test suites.

Reviewers: igor, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59427
2016-06-17 10:30:47 -07:00
Islam AbdelRahman
30a24f2d3d Add InternalStats and logging for AddFile()
Summary:
We dont report the bytes that we ingested from AddFile which make the write amplification numbers incorrect
Update InternalStats and add logging for AddFile()

Test Plan: Make sure the code compile and existing tests pass

Reviewers: lightmark, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59763
2016-06-16 16:21:41 -07:00
sdong
249e796dfc Fix Flaky DBCompactionTest.SkipStatsUpdateTest
Summary: DBCompactionTest.SkipStatsUpdateTest sometimes fails. I don't see any verification related to the deletes issued. Remove them to avoid the uncertainty.

Test Plan: Run the test.

Reviewers: IslamAbdelRahman, andrewkr, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59613
2016-06-15 12:00:51 -07:00
Islam AbdelRahman
f5177c761f Remove wasteful instrumentation in FullMerge (stacked on D59577)
Summary:
[ This diff is stacked on top of D59577 ]

We keep calling timer.ElapsedNanos() on every call to MergeOperator::FullMerge even when statistics are disabled, this is wasteful.

I run the readseq benchmark on a DB containing 100K merge operands for 100K keys (1 operand per key) with 1GB block cache
I see slight performance improvment

Original results

```
$ ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=100000 --num=100000 --db="/dev/shm/100K_merge_compacted/" --cache_size=1073741824 --use_existing_db --disable_auto_compactions
------------------------------------------------
DB path: [/dev/shm/100K_merge_compacted/]
readseq      :       0.498 micros/op 2006597 ops/sec;  222.0 MB/s
DB path: [/dev/shm/100K_merge_compacted/]
readseq      :       0.295 micros/op 3393627 ops/sec;  375.4 MB/s
DB path: [/dev/shm/100K_merge_compacted/]
readseq      :       0.285 micros/op 3511155 ops/sec;  388.4 MB/s
DB path: [/dev/shm/100K_merge_compacted/]
readseq      :       0.286 micros/op 3500470 ops/sec;  387.2 MB/s
DB path: [/dev/shm/100K_merge_compacted/]
readseq      :       0.283 micros/op 3530751 ops/sec;  390.6 MB/s
DB path: [/dev/shm/100K_merge_compacted/]
readseq      :       0.289 micros/op 3464811 ops/sec;  383.3 MB/s
DB path: [/dev/shm/100K_merge_compacted/]
readseq      :       0.277 micros/op 3612814 ops/sec;  399.7 MB/s
DB path: [/dev/shm/100K_merge_compacted/]
readseq      :       0.283 micros/op 3539640 ops/sec;  391.6 MB/s
DB path: [/dev/shm/100K_merge_compacted/]
readseq      :       0.285 micros/op 3503766 ops/sec;  387.6 MB/s
```

After patch

```
$ ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=100000 --num=100000 --db="/dev/shm/100K_merge_compacted/" --cache_size=1073741824 --use_existing_db --disable_auto_compactions
------------------------------------------------
DB path: [/dev/shm/100K_merge_compacted/]
readseq      :       0.476 micros/op 2100119 ops/sec;  232.3 MB/s
DB path: [/dev/shm/100K_merge_compacted/]
readseq      :       0.278 micros/op 3600887 ops/sec;  398.4 MB/s
DB path: [/dev/shm/100K_merge_compacted/]
readseq      :       0.275 micros/op 3636698 ops/sec;  402.3 MB/s
DB path: [/dev/shm/100K_merge_compacted/]
readseq      :       0.271 micros/op 3691661 ops/sec;  408.4 MB/s
DB path: [/dev/shm/100K_merge_compacted/]
readseq      :       0.273 micros/op 3661534 ops/sec;  405.1 MB/s
DB path: [/dev/shm/100K_merge_compacted/]
readseq      :       0.276 micros/op 3627106 ops/sec;  401.3 MB/s
DB path: [/dev/shm/100K_merge_compacted/]
readseq      :       0.272 micros/op 3682635 ops/sec;  407.4 MB/s
DB path: [/dev/shm/100K_merge_compacted/]
readseq      :       0.266 micros/op 3758331 ops/sec;  415.8 MB/s
DB path: [/dev/shm/100K_merge_compacted/]
readseq      :       0.266 micros/op 3761907 ops/sec;  416.2 MB/s
```

Test Plan: make check -j64

Reviewers: yhchiang, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59583
2016-06-13 16:22:14 -07:00
Islam AbdelRahman
7c919deccc Reuse TimedFullMerge instead of FullMerge + instrumentation
Summary:
We have alot of code duplication whenever we call FullMerge we keep duplicating the instrumentation and statistics code
This is a simple diff to refactor the code to use TimedFullMerge instead of FullMerge

Test Plan: COMPILE_WITH_ASAN=1 make check -j64

Reviewers: andrewkr, yhchiang, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59577
2016-06-13 16:17:26 -07:00
Yi Wu
bc8af90e8c add option to not flush memtable on open()
Summary:
Add option to not flush memtable on open()
In case the option is enabled, don't delete existing log files by not updating log numbers to MANIFEST.
Will still flush if we need to (e.g. memtable full in the middle). In that case we also flush final memtable.
If wal_recovery_mode = kPointInTimeRecovery, do not halt immediately after encounter corruption. Instead, check if seq id of next log file is last_log_sequence + 1. In that case we continue recovery.

Test Plan: See unit test.

Reviewers: dhruba, horuff, sdong

Reviewed By: sdong

Subscribers: benj, yhchiang, andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57813
2016-06-13 11:34:16 -07:00
sdong
6faddd7c55 Merge db/slice.cc into util/slice.cc
Summary: It confuses some compilers to have slice.cc under multiple directories. Merge them.

Test Plan: Run existing tests

Reviewers: andrewkr, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59409
2016-06-10 16:37:36 -07:00
sdong
5009b5326b BlockBasedTable::FullFilterKeyMayMatch() Should skip prefix bloom if full key bloom exists
Summary: Currently, if users define both of full key bloom and prefix bloom in SST files. During Get(), if full key bloom shows the key may exist, we still go ahead and check prefix bloom. This is wasteful. If bloom filter for full keys exists, we should always ignore prefix bloom in Get().

Test Plan: Run existing tests

Reviewers: yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57825
2016-06-10 16:27:56 -07:00
sdong
20699df843 memtable_prefix_bloom_bits -> memtable_prefix_bloom_bits_ratio and deprecate memtable_prefix_bloom_probes
Summary:
memtable_prefix_bloom_probes is not a critical option. Remove it to reduce number of options.
It's easier for users to make mistakes with memtable_prefix_bloom_bits, turn it to memtable_prefix_bloom_bits_ratio

Test Plan: Run all existing tests

Reviewers: yhchiang, igor, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: gunnarku, yoshinorim, MarkCallaghan, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59199
2016-06-10 12:12:10 -07:00
Wanning Jiang
56887f6cb8 Backup Options
Summary: Backup options file to private directory

Test Plan:
backupable_db_test.cc, BackupOptions
	   Modify DB options by calling OpenDB for 3 times. Check the latest options file is in the right place. Also check no redundent files are backuped.

Reviewers: andrewkr

Reviewed By: andrewkr

Subscribers: leveldb, dhruba, andrewkr

Differential Revision: https://reviews.facebook.net/D59373
2016-06-09 19:03:10 -07:00
Anirban Rahut
a73b26f601 Adding test for contiguous WAL detection
Summary:
Add a test to detect that when WAL gets truncated,
seq no's are checked to be contiguous.

This test is put in ColumnFamilyTest as it has the necessary
infrastructure/functions for flushing column families, which
we use to ensure 2 active WAL files

Test Plan:
This is a test, no feature has been added.
This test fails today and hence disabled

Reviewers: sdong

Reviewed By: sdong

Subscribers: lgalanis, dhruba, andrewkr, pritamdamania

Differential Revision: https://reviews.facebook.net/D59253
2016-06-07 18:04:15 -07:00
Aaron Gao
e532877940 Add statistics field to show total size of index and filter blocks in block cache
Summary: With `table_options.cache_index_and_filter_blocks = true`, index and filter blocks are stored in block cache. Then people are curious how much of the block cache total size is used by indexes and bloom filters. It will be nice we have a way to report that. It can help people tune performance and plan for optimized hardware setting. We add several enum values for db Statistics. BLOCK_CACHE_INDEX/FILTER_BYTES_INSERT - BLOCK_CACHE_INDEX/FILTER_BYTES_ERASE = current INDEX/FILTER total block size in bytes.

Test Plan:
write a test case called `DBBlockCacheTest.IndexAndFilterBlocksStats`. The result is:
```
[gzh@dev9927.prn1 ~/local/rocksdb]  make db_block_cache_test -j64 && ./db_block_cache_test --gtest_filter=DBBlockCacheTest.IndexAndFilterBlocksStats
Makefile:101: Warning: Compiling in debug mode. Don't use the resulting binary in production
  GEN      util/build_version.cc
  make: `db_block_cache_test' is up to date.
  Note: Google Test filter = DBBlockCacheTest.IndexAndFilterBlocksStats
  [==========] Running 1 test from 1 test case.
  [----------] Global test environment set-up.
  [----------] 1 test from DBBlockCacheTest
  [ RUN      ] DBBlockCacheTest.IndexAndFilterBlocksStats
  [       OK ] DBBlockCacheTest.IndexAndFilterBlocksStats (689 ms)
  [----------] 1 test from DBBlockCacheTest (689 ms total)

  [----------] Global test environment tear-down
  [==========] 1 test from 1 test case ran. (689 ms total)
  [  PASSED  ] 1 test.
```

Reviewers: IslamAbdelRahman, andrewkr, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D58677
2016-06-03 10:47:47 -07:00
Jan Doms
02ec8154e5 allow updating block cache capacity from C (#1149) 2016-06-03 14:04:51 +01:00
Andrew Kryczka
842958651f Fix race condition in SwitchMemtable
Summary:
MemTableList::current_ could be written by background flush thread and
simultaneously read in the user thread (NumNotFlushed() is used in
SwitchMemtable()). Use the lock to prevent this case. Found the error from tsan.

Related: D58833

Test Plan:
  $ OPT=-g COMPILE_WITH_TSAN=1 make -j64 db_test
  $ TEST_TMPDIR=/dev/shm/rocksdb ./db_test --gtest_filter=DBTest.RepeatedWritesToSameKey

Reviewers: lightmark, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D59139
2016-06-02 17:11:45 -07:00
PraveenSinghRao
3a276b0cbe Add a callback for when memtable is moved to immutable (#1137)
* Create a callback for memtable becoming immutable

Create a callback for memtable becoming immutable

Create a callback for memtable becoming immutable

moved notification outside the lock

Move sealed notification to unlocked portion of SwitchMemtable

* fix lite build
2016-06-02 11:57:31 -07:00
Mike Kolupaev
936973d145 Small tweaks to logging to track the number of immutable memtables
Summary:
We see some write stalls because of number of unflushed memtables. With existing logging I couldn't figure out what's happening exactly. See internal task t11446054 for details if interested. This diff adds:
- logging of memtable creation at info level; I wanted it on multiple occasions for different reasons; also include number of immutable memtables,
- logging of number of remaining immutable memtables after a flush.

Test Plan: ran tests

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58833
2016-06-01 11:11:33 -07:00
siddontang
21c047ab49 add readahead size option (#1146) 2016-06-01 10:48:50 -07:00
Reid Horuff
5d85fdb2c5 add missing lock 2016-05-31 12:26:48 -07:00
sdong
345fd73faf Fix flaky DBTestDynamicLevel.DynamicLevelMaxBytesBase2
Summary: We added more table properties for each SST file, so when using 2KB SST file size, the estimated size of SST files is off by almost half, causing the LSM tree structure not as expected. Fix it by making file size 4x as previously, as well as LSM base size. Also avoid the sleeping based synchronization and turn to use sync points.

Test Plan: Run paralell unit tests multiple times and make sure they always pass.

Reviewers: IslamAbdelRahman, kradhakrishnan

Reviewed By: kradhakrishnan

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58749
2016-05-26 10:13:24 -07:00
krad
8fc75de327 Minor fix to disable DynamicLevelMaxBytesBase2 2016-05-24 17:45:50 -07:00
Ashish Shenoy
99765ed855 Clean up the ComputeCompactionScore() API
Summary: Make CompactionOptionsFIFO a part of mutable_cf_options

Test Plan: UT

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, lgalanis, dhruba

Differential Revision: https://reviews.facebook.net/D58653
2016-05-23 15:55:29 -07:00
Shen Li
def2f7bd0e Expose report_bg_io_stats option in the C API. (#1131) 2016-05-23 13:13:47 -07:00
siddontang
8f1214531e C API: Expose DeleteFileInRange (#1132) 2016-05-23 04:19:47 -07:00
Sage Weil
11f329bd40 db/db_impl: restrict WALRecoveryMode when using recycled log files
kPointInTimeRecovery is indistinguishable from
kTolerateCorruptedTailRecords in recycle mode since we define
the "end" of the log as the first corrupt record we encounter.

kAbsoluteConsistency doesn't make sense because even a clean
shutdown leaves old junk at the end of the log file.

Signed-off-by: Sage Weil <sage@redhat.com>
2016-05-22 22:00:15 -07:00
Sage Weil
2b2a898e0b db/log_reader: combine kBadRecord{Len,Checksum} for readability
These vary only by the corruption string reported.

Signed-off-by: Sage Weil <sage@redhat.com>
2016-05-22 22:00:15 -07:00
Sage Weil
34df1c94d5 db/log_reader: treat bad record length or checksum as EOF
If we are in kTolerateCorruptedTailRecords, treat these
errors as the end of the log.  This is particularly
important for recycled logs, where we will regularly see
corrupted headers (bad length or checksum) when replaying
a log.  If we are aligned with a block boundary or get lucky,
we will land on an old header and see the log number
mismatch, but more commonly we will land midway through
some previous block and record and effectively see noise.
These must be treated as the end of the log in order for
recycling to work.

This makes the LogTest.Recycle/1 test pass.

We also modify a number of existing tests because the
recycled log files behave fundamentally differently in that
they always stop when they reach the first bad record.

Signed-off-by: Sage Weil <sage@redhat.com>
2016-05-22 22:00:15 -07:00
Sage Weil
7947aba68c db/log_reader: move kBadRecord{Len,Checksum} handling into ReadRecord
The behavior here needs to depend on the WAL recovery mode.  No functional
change in this patch.

Signed-off-by: Sage Weil <sage@redhat.com>
2016-05-22 22:00:15 -07:00
Sage Weil
847e471db6 db/log_test: add recycle log test
This currently fails because we do not properly map a
corrupt header to the logical end of the log.

Signed-off-by: Sage Weil <sage@redhat.com>
2016-05-22 22:00:15 -07:00
Aaron Orenstein
2073cf3775 Eliminate use of 'using namespace std'. Also remove a number of ADL references to std functions.
Summary: Reduce use of argument-dependent name lookup in RocksDB.

Test Plan: 'make check' passed.

Reviewers: andrewkr

Reviewed By: andrewkr

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58203
2016-05-20 07:42:18 -07:00
Richard Cairns Jr
f6e404c20a Added "number of merge operands" to statistics in ssts.
Summary:
A couple of notes from the diff:
  - The namespace block I added at the top of table_properties_collector.cc was in reaction to an issue i was having with PutVarint64 and reusing the "val" string.  I'm not sure this is the cleanest way of doing this, but abstracting this out at least results in the correct behavior.
  - I chose "rocksdb.merge.operands" as the property name.  I am open to suggestions for better names.
  - The change to sst_dump_tool.cc seems a bit inelegant to me.  Is there a better way to do the if-else block?

Test Plan:
I added a test case in table_properties_collector_test.cc.  It adds two merge operands and checks to make sure that both of them are reflected by GetMergeOperands.  It also checks to make sure the wasPropertyPresent bool is properly set in the method.

Running both of these tests should pass:
./table_properties_collector_test
./sst_dump_test

Reviewers: IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58119
2016-05-19 14:24:48 -07:00
omegaga
3c69f77c67 Move IO failure test to separate file
Summary:
This is a part of effort to reduce the size of db_test.cc. We move the following tests to a separate file `db_io_failure_test.cc`:

* DropWrites
* DropWritesFlush
* NoSpaceCompactRange
* NonWritableFileSystem
* ManifestWriteError
* PutFailsParanoid

Test Plan: Run `make check` to see if the tests are working properly.

Reviewers: sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58341
2016-05-18 17:09:20 -07:00
Islam AbdelRahman
c70a9335de Fix mutex unlock issue between scheduled compaction and ReleaseCompactionFiles()
Summary:
NotifyOnCompactionCompleted can unlock the mutex.
That mean that we can schedule a background compaction that will start before we ReleaseCompactionFiles().

Test Plan:
added unittest
existing unittest

Reviewers: yhchiang, sdong

Reviewed By: sdong

Subscribers: yoshinorim, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58065
2016-05-18 14:56:30 -07:00
Reid Horuff
a6254f2bd4 Long outstanding prepare test
Summary: This tests that a prepared transaction is not lost after several crashes, restarts, and memtable flushes.

Test Plan: TwoPhaseLongPrepareTest

Reviewers: sdong

Subscribers: hermanlee4, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58185
2016-05-17 18:57:06 -07:00
Aaron Gao
43afd72bee [rocksdb] make more options dynamic
Summary:
make more ColumnFamilyOptions dynamic:
- compression
- soft_pending_compaction_bytes_limit
- hard_pending_compaction_bytes_limit
- min_partial_merge_operands
- report_bg_io_stats
- paranoid_file_checks

Test Plan:
Add sanity check in `db_test.cc` for all above options except for soft_pending_compaction_bytes_limit and hard_pending_compaction_bytes_limit.
All passed.

Reviewers: andrewkr, sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57519
2016-05-17 13:11:56 -07:00
Islam AbdelRahman
f6aedb62c0 Fix Transaction memory leak
Summary:
- Make sure we clean up recovered_transactions_ on DBImpl destructor
- delete leaked txns and env in TransactionTest

Test Plan: Run transaction_test under valgrind

Reviewers: sdong, andrewkr, yhchiang, horuff

Reviewed By: horuff

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58263
2016-05-16 16:32:55 -07:00
krad
a08c8c851a Added PersistentCache abstraction
Summary:
Added a new abstraction to cache page to RocksDB designed for the read
cache use.

RocksDB current block cache is more of an object cache. For the persistent read cache
project, what we need is a page cache equivalent. This changes adds a cache
abstraction to RocksDB to cache pages called PersistentCache. PersistentCache can cache
uncompressed pages or raw pages (content as in filesystem). The user can
choose to operate PersistentCache either in  COMPRESSED or UNCOMPRESSED mode.

Blame Rev:

Test Plan: Run unit tests

Reviewers: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D55707
2016-05-15 22:17:18 -07:00
Reid Horuff
a400336398 TransactionLogIterator sequence gap fix
Summary: DBTestXactLogIterator.TransactionLogIterator was failing due the sequence gaps. This was caused by an off-by-one error when calculating the new sequence number after recovering from logs.

Test Plan: db_log_iter_test

Reviewers: andrewkr

Subscribers: andrewkr, hermanlee4, dhruba, IslamAbdelRahman

Differential Revision: https://reviews.facebook.net/D58053
2016-05-12 13:54:08 -07:00
Islam AbdelRahman
560358dc93 Fix data race in GetObsoleteFiles()
Summary:
GetObsoleteFiles() and LogAndApply() functions modify obsolete_manifests_ vector
we need to make sure that the mutex is held when we modify the obsolete_manifests_

Test Plan: run the test under TSAN

Reviewers: andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58011
2016-05-10 19:30:09 -07:00
Reid Horuff
c27061dae7 [rocksdb] 2PC double recovery bug fix
Summary:
1. prepare()
2. crash
3. recover
4. commit()
5. crash
6. data is lost

This is due to the transaction data still only residing in the WAL but because the logs were flushed on the first recovery the data is ignored on the second recovery. We must scan all logs found on recovery and only ignore redundant data at the time of replay. It is not possible to know which logs still contain relevant data at time of recovery. We cannot simply ignore a log because all of the non-2pc data it contains has already been written to L0.

The changes made to MemTableInserter are to ensure that prepared sections are still recovered even if all of the non-2pc data in that log has already been flushed to L0.

Test Plan: Provided test.

Reviewers: sdong

Subscribers: andrewkr, hermanlee4, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57729
2016-05-10 14:06:07 -07:00
Reid Horuff
a657ee9a9c [rocksdb] Recovery path sequence miscount fix
Summary:
Consider the following WAL with 4 batch entries prefixed with their sequence at time of memtable insert.
[1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(a)]
[1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(b)]
[4: COMMIT(a)]
[7: COMMIT(b)]

The first two batches do not consume any sequence numbers so are both prefixed with seq=1.
For 2pc commit, memtable insertion takes place before COMMIT batch is written to WAL.
We can see that sequence number consumption takes place between WAL entries giving us the seemingly sparse sequence prefix for WAL entries.
This is a valid WAL.

Because with 2PC markers one WriteBatch points to another batch containing its inserts a writebatch can consume more or less sequence numbers than the number of sequence consuming entries that it contains.

We can see that, given the entries in the WAL, 6 sequence ids were consumed. Yet on recovery the maximum sequence consumed would be 7 + 3 (the number of sequence numbers consumed by COMMIT(b))

So, now upon recovery we must track the actual consumption of sequence numbers.
In the provided scenario there will be no sequence gaps, but it is possible to produce a sequence gap. This should not be a problem though. correct?

Test Plan: provided test.

Reviewers: sdong

Subscribers: andrewkr, leveldb, dhruba, hermanlee4

Differential Revision: https://reviews.facebook.net/D57645
2016-05-10 14:06:07 -07:00
Reid Horuff
8a66c85e90 [rocksdb] Two Phase Transaction
Summary:
Two Phase Commit addition to RocksDB.

See wiki: https://github.com/facebook/rocksdb/wiki/Two-Phase-Commit-Implementation
Quip: https://fb.quip.com/pxZrAyrx53r3

Depends on:
WriteBatch modification: https://reviews.facebook.net/D54093
Memtable Log Referencing and Prepared Batch Recovery: https://reviews.facebook.net/D56919

Test Plan:
- SimpleTwoPhaseTransactionTest
- PersistentTwoPhaseTransactionTest.
- TwoPhaseRollbackTest
- TwoPhaseMultiThreadTest
- TwoPhaseLogRollingTest
- TwoPhaseEmptyWriteTest
- TwoPhaseExpirationTest

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: leveldb, hermanlee4, andrewkr, vasilep, dhruba, santoshb

Differential Revision: https://reviews.facebook.net/D56925
2016-05-10 14:06:07 -07:00
Reid Horuff
1b8a2e8fdd [rocksdb] Memtable Log Referencing and Prepared Batch Recovery
Summary:
This diff is built on top of WriteBatch modification: https://reviews.facebook.net/D54093 and adds the required functionality to rocksdb core necessary for rocksdb to support 2PC.

modfication of DBImpl::WriteImpl()
- added two arguments *uint64_t log_used = nullptr, uint64_t log_ref = 0;
- *log_used is an output argument which will return the log number which the incoming batch was inserted into, 0 if no WAL insert took place.
-  log_ref is a supplied log_number which all memtables inserted into will reference after the batch insert takes place. This number will reside in 'FindMinPrepLogReferencedByMemTable()' until all Memtables insertinto have flushed.

- Recovery/writepath is now aware of prepared batches and commit and rollback markers.

Test Plan: There is currently no test on this diff. All testing of this functionality takes place in the Transaction layer/diff but I will add some testing.

Reviewers: IslamAbdelRahman, sdong

Subscribers: leveldb, santoshb, andrewkr, vasilep, dhruba, hermanlee4

Differential Revision: https://reviews.facebook.net/D56919
2016-05-10 14:06:07 -07:00
Reid Horuff
0460e9dcce Modification of WriteBatch to support two phase commit
Summary: Adds three new WriteBatch data types: Prepare(xid), Commit(xid), Rollback(xid). Prepare(xid) should precede the (single) operation to which is applies. There can obviously be multiple Prepare(xid) markers. There should only be one Rollback(xid) or Commit(xid) marker yet not both. None of this logic is currently enforced and will most likely be implemented further up such as in the memtableinserter. All three markers are similar to PutLogData in that they are writebatch meta-data, ie stored but not counted. All three markers differ from PutLogData in that they will actually be written to disk. As for WriteBatchWithIndex, Prepare, Commit, Rollback are all implemented just as PutLogData and none are tested just as PutLogData.

Test Plan: single unit test in write_batch_test.

Reviewers: hermanlee4, sdong, anthony

Subscribers: leveldb, dhruba, vasilep, andrewkr

Differential Revision: https://reviews.facebook.net/D57867
2016-05-10 14:06:07 -07:00
Islam AbdelRahman
d86f9b9c3f Fix lite build
Summary: Fix lite build

Test Plan: run under lite

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57945
2016-05-09 16:08:30 -07:00
Islam AbdelRahman
4b31723433 Add bottommost_compression option
Summary:
Add a new option that can be used to set a specific compression algorithm for bottommost level.
This option will only affect levels larger than base level.

I have also updated CompactionJobInfo to include the compression algorithm used in compaction

Test Plan:
added new unittest
existing unittests

Reviewers: andrewkr, yhchiang, sdong

Reviewed By: sdong

Subscribers: lightmark, andrewkr, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D57669
2016-05-09 15:57:19 -07:00
sdong
bfb6b1b8a8 Estimate pending compaction bytes more accurately
Summary: Currently we estimate bytes needed for compaction by assuming fanout value to be level multiplier. It overestimates when size of a level exceeds the target by large. We estimate by the ratio of actual sizes in levels instead.

Test Plan: Fix existing test cases and add a new one.

Reviewers: IslamAbdelRahman, igor, yhchiang

Reviewed By: yhchiang

Subscribers: MarkCallaghan, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57789
2016-05-09 15:30:02 -07:00
Yi Wu
730f7e2e21 Fix win build
Summary: Fixing error with win build where we compare int64_t with size_t.

Test Plan: make check

Reviewers: andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57885
2016-05-09 11:52:28 -07:00
Andrew Kryczka
269f6b2e2d Revert "Modification of WriteBatch to support two phase commit"
Summary: Revert D54093 and D57453

Test Plan: running make check

Reviewers: horuff, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57819
2016-05-06 16:58:24 -07:00
sdong
7ccb8d6ef3 BlockBasedTable::Get() not to use prefix bloom if read_options.total_order_seek = true
Summary: This is to provide a way for users to skip prefix bloom in point look-up.

Test Plan: Add a new unit test scenario.

Reviewers: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57747
2016-05-06 10:16:11 -07:00
Islam AbdelRahman
967476eaee Fix valgrind (DBIteratorTest.ReadAhead)
Summary: This test is failing under valgrind because we dont delete the Env that we allocated

Test Plan: run the test under valgrind

Reviewers: andrewkr, yhchiang, yiwu, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57693
2016-05-05 11:24:08 -07:00
Yi Wu
a4ea345b04 Fixing lite build
Summary: Fixing lite build broke in unit test. `FilesPerLevel()` depends on `DB::GetProperty()`, which lite build doesn't support.

Test Plan: OPT=-DROCKSDB_LITE make check -j64

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57651
2016-05-04 17:20:52 -07:00
Yi Wu
24a24f013d Enable configurable readahead for iterators
Summary:
Add an option `iterator_readahead_size` to `ReadOptions` to enable
configurable readahead for iterators similar to the corresponding
option for compaction.

Test Plan:
```
make commit_prereq
```

Reviewers: kumar.rangarajan, ott, igor, sdong

Reviewed By: sdong

Subscribers: yiwu, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D55419
2016-05-04 15:25:58 -07:00
Islam AbdelRahman
ff4b3fb5b4 Fix Iterator::Prev memory pinning bug
Summary: We should not use IterKey::SetKey with copy = false except if we are pinning the iterator thru it's life time, otherwise we may release the temporarily pinned blocks and in this case the IterKey will be pointing to freed memory

Test Plan: added a new test

Reviewers: sdong, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57561
2016-05-03 16:50:01 -07:00
Islam AbdelRahman
6e801b0bd1 Eliminate memcpy in Iterator::Prev() by pinning blocks for keys spanning multiple blocks
Summary:
This diff is stacked on top of this diff https://reviews.facebook.net/D56493
The current Iterator::Prev() implementation need to copy every value since the underlying Iterator may move after reading the value.
This can be optimized by making sure that the block containing the value is pinned until the Iterator move. which will improve the throughput by up to 1.5X

master
```
==> 1000000_Keys_100Byte.txt <==
readreverse  :       0.449 micros/op 2225887 ops/sec;  246.2 MB/s
readreverse  :       0.433 micros/op 2311508 ops/sec;  255.7 MB/s
readreverse  :       0.436 micros/op 2294335 ops/sec;  253.8 MB/s
readreverse  :       0.471 micros/op 2121295 ops/sec;  234.7 MB/s
readreverse  :       0.465 micros/op 2152227 ops/sec;  238.1 MB/s
readreverse  :       0.454 micros/op 2203011 ops/sec;  243.7 MB/s
readreverse  :       0.451 micros/op 2216095 ops/sec;  245.2 MB/s
readreverse  :       0.462 micros/op 2162447 ops/sec;  239.2 MB/s
readreverse  :       0.476 micros/op 2099151 ops/sec;  232.2 MB/s
readreverse  :       0.472 micros/op 2120710 ops/sec;  234.6 MB/s

avg : 242.34 MB/s

==> 1000000_Keys_1KB.txt <==
readreverse  :       1.013 micros/op 986793 ops/sec;  978.7 MB/s
readreverse  :       0.942 micros/op 1061136 ops/sec; 1052.5 MB/s
readreverse  :       0.951 micros/op 1051901 ops/sec; 1043.3 MB/s
readreverse  :       0.932 micros/op 1072894 ops/sec; 1064.1 MB/s
readreverse  :       1.024 micros/op 976720 ops/sec;  968.7 MB/s
readreverse  :       0.935 micros/op 1069169 ops/sec; 1060.4 MB/s
readreverse  :       1.012 micros/op 988132 ops/sec;  980.1 MB/s
readreverse  :       0.962 micros/op 1039579 ops/sec; 1031.1 MB/s
readreverse  :       0.991 micros/op 1008924 ops/sec; 1000.7 MB/s
readreverse  :       1.004 micros/op 996144 ops/sec;  988.0 MB/s

avg : 1016.76 MB/s

==> 1000000_Keys_10KB.txt <==
readreverse  :       4.167 micros/op 239952 ops/sec; 2346.9 MB/s
readreverse  :       4.070 micros/op 245713 ops/sec; 2403.3 MB/s
readreverse  :       4.572 micros/op 218733 ops/sec; 2139.4 MB/s
readreverse  :       4.497 micros/op 222388 ops/sec; 2175.2 MB/s
readreverse  :       4.203 micros/op 237920 ops/sec; 2327.1 MB/s
readreverse  :       4.206 micros/op 237756 ops/sec; 2325.5 MB/s
readreverse  :       4.181 micros/op 239149 ops/sec; 2339.1 MB/s
readreverse  :       4.157 micros/op 240552 ops/sec; 2352.8 MB/s
readreverse  :       4.187 micros/op 238848 ops/sec; 2336.1 MB/s
readreverse  :       4.106 micros/op 243575 ops/sec; 2382.4 MB/s

avg : 2312.78 MB/s

==> 100000_Keys_100KB.txt <==
readreverse  :      41.281 micros/op 24224 ops/sec; 2366.0 MB/s
readreverse  :      39.722 micros/op 25175 ops/sec; 2458.9 MB/s
readreverse  :      40.319 micros/op 24802 ops/sec; 2422.5 MB/s
readreverse  :      39.762 micros/op 25149 ops/sec; 2456.4 MB/s
readreverse  :      40.916 micros/op 24440 ops/sec; 2387.1 MB/s
readreverse  :      41.188 micros/op 24278 ops/sec; 2371.4 MB/s
readreverse  :      40.061 micros/op 24962 ops/sec; 2438.1 MB/s
readreverse  :      40.221 micros/op 24862 ops/sec; 2428.4 MB/s
readreverse  :      40.084 micros/op 24947 ops/sec; 2436.7 MB/s
readreverse  :      40.655 micros/op 24597 ops/sec; 2402.4 MB/s

avg : 2416.79 MB/s

==> 10000_Keys_1MB.txt <==
readreverse  :     298.038 micros/op 3355 ops/sec; 3355.3 MB/s
readreverse  :     335.001 micros/op 2985 ops/sec; 2985.1 MB/s
readreverse  :     286.956 micros/op 3484 ops/sec; 3484.9 MB/s
readreverse  :     329.954 micros/op 3030 ops/sec; 3030.8 MB/s
readreverse  :     306.428 micros/op 3263 ops/sec; 3263.5 MB/s
readreverse  :     330.749 micros/op 3023 ops/sec; 3023.5 MB/s
readreverse  :     328.903 micros/op 3040 ops/sec; 3040.5 MB/s
readreverse  :     324.853 micros/op 3078 ops/sec; 3078.4 MB/s
readreverse  :     320.488 micros/op 3120 ops/sec; 3120.3 MB/s
readreverse  :     320.536 micros/op 3119 ops/sec; 3119.8 MB/s

avg : 3150.21 MB/s
```

After memcpy elimination
```

==> 1000000_Keys_100Byte.txt <==
readreverse  :       0.395 micros/op 2529890 ops/sec;  279.9 MB/s
readreverse  :       0.368 micros/op 2715922 ops/sec;  300.5 MB/s
readreverse  :       0.384 micros/op 2603929 ops/sec;  288.1 MB/s
readreverse  :       0.375 micros/op 2663286 ops/sec;  294.6 MB/s
readreverse  :       0.357 micros/op 2802180 ops/sec;  310.0 MB/s
readreverse  :       0.363 micros/op 2757684 ops/sec;  305.1 MB/s
readreverse  :       0.372 micros/op 2689603 ops/sec;  297.5 MB/s
readreverse  :       0.379 micros/op 2638599 ops/sec;  291.9 MB/s
readreverse  :       0.375 micros/op 2663803 ops/sec;  294.7 MB/s
readreverse  :       0.375 micros/op 2665579 ops/sec;  294.9 MB/s

avg: 295.72 MB/s (1.22 X)

==> 1000000_Keys_1KB.txt <==
readreverse  :       0.879 micros/op 1138112 ops/sec; 1128.8 MB/s
readreverse  :       0.842 micros/op 1187998 ops/sec; 1178.3 MB/s
readreverse  :       0.837 micros/op 1194915 ops/sec; 1185.1 MB/s
readreverse  :       0.845 micros/op 1182983 ops/sec; 1173.3 MB/s
readreverse  :       0.877 micros/op 1140308 ops/sec; 1131.0 MB/s
readreverse  :       0.849 micros/op 1177581 ops/sec; 1168.0 MB/s
readreverse  :       0.915 micros/op 1093284 ops/sec; 1084.3 MB/s
readreverse  :       0.863 micros/op 1159418 ops/sec; 1149.9 MB/s
readreverse  :       0.895 micros/op 1117670 ops/sec; 1108.5 MB/s
readreverse  :       0.852 micros/op 1174116 ops/sec; 1164.5 MB/s

avg: 1147.17 MB/s (1.12 X)

==> 1000000_Keys_10KB.txt <==
readreverse  :       3.870 micros/op 258386 ops/sec; 2527.2 MB/s
readreverse  :       3.568 micros/op 280296 ops/sec; 2741.5 MB/s
readreverse  :       4.005 micros/op 249694 ops/sec; 2442.2 MB/s
readreverse  :       3.550 micros/op 281719 ops/sec; 2755.5 MB/s
readreverse  :       3.562 micros/op 280758 ops/sec; 2746.1 MB/s
readreverse  :       3.507 micros/op 285125 ops/sec; 2788.8 MB/s
readreverse  :       3.463 micros/op 288739 ops/sec; 2824.1 MB/s
readreverse  :       3.428 micros/op 291734 ops/sec; 2853.4 MB/s
readreverse  :       3.553 micros/op 281491 ops/sec; 2753.2 MB/s
readreverse  :       3.535 micros/op 282885 ops/sec; 2766.9 MB/s

avg : 2719.89 MB/s (1.17 X)

==> 100000_Keys_100KB.txt <==
readreverse  :      22.815 micros/op 43830 ops/sec; 4281.0 MB/s
readreverse  :      29.957 micros/op 33381 ops/sec; 3260.4 MB/s
readreverse  :      25.334 micros/op 39473 ops/sec; 3855.4 MB/s
readreverse  :      23.037 micros/op 43409 ops/sec; 4239.8 MB/s
readreverse  :      27.810 micros/op 35958 ops/sec; 3512.1 MB/s
readreverse  :      30.327 micros/op 32973 ops/sec; 3220.6 MB/s
readreverse  :      29.704 micros/op 33665 ops/sec; 3288.2 MB/s
readreverse  :      29.423 micros/op 33987 ops/sec; 3319.6 MB/s
readreverse  :      23.334 micros/op 42856 ops/sec; 4185.9 MB/s
readreverse  :      29.969 micros/op 33368 ops/sec; 3259.1 MB/s

avg : 3642.21 MB/s (1.5 X)

==> 10000_Keys_1MB.txt <==
readreverse  :     244.748 micros/op 4085 ops/sec; 4085.9 MB/s
readreverse  :     230.208 micros/op 4343 ops/sec; 4344.0 MB/s
readreverse  :     235.655 micros/op 4243 ops/sec; 4243.6 MB/s
readreverse  :     235.730 micros/op 4242 ops/sec; 4242.2 MB/s
readreverse  :     237.346 micros/op 4213 ops/sec; 4213.3 MB/s
readreverse  :     227.306 micros/op 4399 ops/sec; 4399.4 MB/s
readreverse  :     194.957 micros/op 5129 ops/sec; 5129.4 MB/s
readreverse  :     238.359 micros/op 4195 ops/sec; 4195.4 MB/s
readreverse  :     221.588 micros/op 4512 ops/sec; 4513.0 MB/s
readreverse  :     235.911 micros/op 4238 ops/sec; 4239.0 MB/s

avg : 4360.52 MB/s (1.38 X)
```

Test Plan: COMPILE_WITH_ASAN=1 make check -j64

Reviewers: andrewkr, yhchiang, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56511
2016-05-02 21:46:30 -07:00
Warren Falk
b8cf9130f8 Fix #1110, 32-bit build failure on Mac OSX (#1112)
Using explicit 64-bit type in conditional in platforms above 32-bits
This appears to be necessary on Mac OSX as std::conditional does not appear to short circuit and evaluates the third template arg
Making the third template arg be 64 bits explicitly works around this problem and will work on both 32 bit and 64+ bit platforms.
2016-05-02 10:04:37 -07:00
Islam AbdelRahman
21441c09bd Fix calling GetCurrentMutableCFOptions in CompactionJob::ProcessKeyValueCompaction()
Summary: GetCurrentMutableCFOptions() can only be called when DB mutex is held so we cannot call it in CompactionJob::ProcessKeyValueCompaction() since it's not holding the db mutex

Test Plan: make check -j64

Reviewers: sdong, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57471
2016-04-29 17:00:50 -07:00
Reid Horuff
6e56a114be Modification of WriteBatch to support two phase commit
Summary: Adds three new WriteBatch data types: Prepare(xid), Commit(xid), Rollback(xid). Prepare(xid) should precede the (single) operation to which is applies. There can obviously be multiple Prepare(xid) markers. There should only be one Rollback(xid) or Commit(xid) marker yet not both. None of this logic is currently enforced and will most likely be implemented further up such as in the memtableinserter. All three markers are similar to PutLogData in that they are writebatch meta-data, ie stored but not counted. All three markers differ from PutLogData in that they will actually be written to disk. As for WriteBatchWithIndex, Prepare, Commit, Rollback are all implemented just as PutLogData and none are tested just as PutLogData.

Test Plan: single unit test in write_batch_test.

Reviewers: hermanlee4, sdong, anthony

Subscribers: andrewkr, vasilep, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D54093
2016-04-29 11:50:30 -07:00
Yi Wu
a92049e3e7 Added EventListener::OnTableFileCreationStarted() callback
Summary: Added EventListener::OnTableFileCreationStarted. EventListener::OnTableFileCreated will be called on failure case. User can check creation status via TableFileCreationInfo::status.

Test Plan: unit test.

Reviewers: dhruba, yhchiang, ott, sdong

Reviewed By: sdong

Subscribers: sdong, kradhakrishnan, IslamAbdelRahman, andrewkr, yhchiang, leveldb, ott, dhruba

Differential Revision: https://reviews.facebook.net/D56337
2016-04-29 11:35:00 -07:00
sdong
6a14f7a976 Change several option defaults
Summary:
Changing several option defaults:
 options.max_open_files changes from 5000 to -1
 options.base_background_compactions changes from max_background_compactions to 1
 options.wal_recovery_mode changes from kTolerateCorruptedTailRecords to kTolerateCorruptedTailRecords
 options.compaction_pri changes from kByCompensatedSize to kByCompensatedSize

Test Plan: Write unit tests to see OldDefaults() works as expected.

Reviewers: IslamAbdelRahman, yhchiang, igor

Reviewed By: igor

Subscribers: MarkCallaghan, yiwu, kradhakrishnan, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56427
2016-04-28 17:50:58 -07:00
Peter Mattis
c6c770a1ac Use prefix_same_as_start to avoid iteration in FindNextUserEntryInternal. (#1102)
This avoids excessive iteration in tombstone fields.
2016-04-28 16:48:03 -07:00
sdong
992a8f83b7 Not enable jemalloc status printing if USE_CLANG=1
Summary: Warning is printed out with USE_CLANG=1 when including jemalloc.h. Disable it in that case.

Test Plan: Run db_bench with USE_CLANG=1 and not. Make sure they can all build and jemalloc status is printed out in the case where USE_CLANG is not set.

Reviewers: andrewkr, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57399
2016-04-28 16:16:14 -07:00
Andrew Kryczka
a06faa6327 Skip PresetCompressionDict test for lite
Summary:
This test relies on "rocksdb.num-files-at-levelN" property that isn't
implemented in rocksdb lite. So we will compile it only for non-lite builds.

Test Plan:
  $ make -j40 check 'OPT=-g -DROCKSDB_LITE'

Reviewers: sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57387
2016-04-28 15:11:28 -07:00
Andrew Kryczka
0f428c5619 Fix compression dictionary clang osx error
Summary:
There was one narrowing conversion in D52287 that only showed up with
clang on osx.

Test Plan:
  $ make clean && USE_CLANG=1 DISABLE_JEMALLOC=1 TEST_TMPDIR=/dev/shm/rocksdb OPT=-g make -j32 check

Reviewers: sdong, lightmark, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57357
2016-04-28 10:42:10 -07:00
Li Peng
6d4832a998 Merge pull request #1101 from flyd1005/wip-fix-typo
fix typos and remove duplicated words
2016-04-28 02:30:44 -07:00
Andrew Kryczka
54de13abac Fix compression dictionary clang errors
Summary: There were a few narrowing conversions that clang didn't like.

Test Plan:
  $ make clean && USE_CLANG=1 DISABLE_JEMALLOC=1 TEST_TMPDIR=/dev/shm/rocksdb OPT=-g make -j32 check

Reviewers: IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57351
2016-04-27 18:30:04 -07:00
Islam AbdelRahman
0850bc5147 Fix build on machines without jemalloc
Summary: It looks like we mistakenly enable JEMALLOC even if it's not available on the machine, that's why travis is failing

Test Plan:
check on my devserver
check on my mac

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57345
2016-04-27 18:25:19 -07:00
Andrew Kryczka
843d2e3137 Shared dictionary compression using reference block
Summary:
This adds a new metablock containing a shared dictionary that is used
to compress all data blocks in the SST file. The size of the shared dictionary
is configurable in CompressionOptions and defaults to 0. It's currently only
used for zlib/lz4/lz4hc, but the block will be stored in the SST regardless of
the compression type if the user chooses a nonzero dictionary size.

During compaction, computes the dictionary by randomly sampling the first
output file in each subcompaction. It pre-computes the intervals to sample
by assuming the output file will have the maximum allowable length. In case
the file is smaller, some of the pre-computed sampling intervals can be beyond
end-of-file, in which case we skip over those samples and the dictionary will
be a bit smaller. After the dictionary is generated using the first file in a
subcompaction, it is loaded into the compression library before writing each
block in each subsequent file of that subcompaction.

On the read path, gets the dictionary from the metablock, if it exists. Then,
loads that dictionary into the compression library before reading each block.

Test Plan: new unit test

Reviewers: yhchiang, IslamAbdelRahman, cyan, sdong

Reviewed By: sdong

Subscribers: andrewkr, yoshinorim, kradhakrishnan, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D52287
2016-04-27 17:36:03 -07:00
Sergey Makarenko
1c80dfab24 Print memory allocation counters
Summary:
Introduced option to dump malloc statistics using new option flag.
    Added new command line option to db_bench tool to enable this
    funtionality.
    Also extended build to support environments with/without jemalloc.

Test Plan:
1) Build rocksdb using `make` command. Launch the following command
    `./db_bench --benchmarks=fillrandom --dump_malloc_stats=true
    --num=10000000` end verified that jemalloc dump is present in LOG file.
    2) Build rocksdb using `DISABLE_JEMALLOC=1  make db_bench -j32` and ran
    the same db_bench tool and found the following message in LOG file:
    "Please compile with jemalloc to enable malloc dump".
    3) Also built rocksdb using `make` command on MacOS to verify behavior
    in non-FB environment.
    Also to debug build configuration change temporary changed
    AM_DEFAULT_VERBOSITY = 1 in Makefile to see compiler and build
    tools output. For case 1) -DROCKSDB_JEMALLOC was present in compiler
    command line. For both 2) and 3) this flag was not present.

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57321
2016-04-27 16:23:33 -07:00
sdong
ac0e54b4c6 CompactedDB should not be used if there is outstanding WAL files
Summary: CompactedDB skips memtable. So we shouldn't use compacted DB if there is outstanding WAL files.

Test Plan: Change to options.max_open_files = -1 perf context test to create a compacted DB, which we shouldn't do.

Reviewers: yhchiang, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57057
2016-04-26 14:22:07 -07:00
Islam AbdelRahman
d719b095dc Introduce PinnedIteratorsManager (Reduce PinData() overhead / Refactor PinData)
Summary:
While trying to reuse PinData() / ReleasePinnedData() .. to optimize away some memcpys I realized that there is a significant overhead for using PinData() / ReleasePinnedData if they were called many times.
This diff refactor the pinning logic by introducing PinnedIteratorsManager a centralized component that will be created once and will be notified whenever we need to Pin an Iterator. This implementation have much less overhead than the original implementation

Test Plan:
make check -j64
COMPILE_WITH_ASAN=1 make check -j64

Reviewers: yhchiang, sdong, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56493
2016-04-26 12:41:07 -07:00
Islam AbdelRahman
7c14abf2c7 Improve BytewiseComparatorImpl::FindShortestSeparator
Summary:
The current implementation find the first different byte and try to increment it, if it cannot it return the original key
we can improve this by keep going after the first different byte to find the first non 0xFF byte and increment it

After trying this patch on some logdevice sst files I see decrease in there index block size by 8.5%

Test Plan: existing tests and updated test

Reviewers: yhchiang, andrewkr, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56241
2016-04-25 23:02:14 -07:00
Islam AbdelRahman
f3eb0b5b8c Make EventListenerTest.CompactionReasonLevel more deterministic
Summary:
In this test some times automatic compactions do everything and Manual compaction become a no-op.
Update the test to make sure manual compaction is not a no-op

Test Plan: run the test

Reviewers: andrewkr, yhchiang, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57189
2016-04-25 18:18:35 -07:00
sdong
7b78d623f7 Shouldn't report default column family's compaction stats as DB compaction stats
Summary:
Now we collect compaction stats per column family, but report default colum family's stat as compaction stats for DB.
Fix it by reporting compaction stats per column family instead.

Test Plan: Run db_bench with --num_column_families=4 and see the number fixed.

Reviewers: IslamAbdelRahman, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57063
2016-04-25 13:56:59 -07:00
Yueh-Hsuan Chiang
24110ce90c Correct Statistics FLUSH_WRITE_BYTES
Summary:
In https://reviews.facebook.net/D56271, we fixed an issue where
we consider flush as compaction.  However, that makes us mistakenly
count FLUSH_WRITE_BYTES twice (one in flush_job and one in db_impl.)

This patch removes the one incremented in db_impl.

Test Plan: db_test

Reviewers: yiwu, andrewkr, IslamAbdelRahman, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57111
2016-04-25 12:01:01 -07:00
dx9
b71c4e613f Alpine Linux Build (#990)
* Musl libc does not provide adaptive mutex. Added feature test for PTHREAD_MUTEX_ADAPTIVE_NP.

* Musl libc does not provide backtrace(3). Added a feature check for backtrace(3).

* Fixed compiler error.

* Musl libc does not implement backtrace(3). Added platform check for libexecinfo.

* Alpine does not appear to support gcc -pg option. By default (gcc has PIE option enabled) it fails with:

gcc: error: -pie and -pg|p|profile are incompatible when linking

When -fno-PIE and -nopie are used it fails with:

/usr/lib/gcc/x86_64-alpine-linux-musl/5.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find gcrt1.o: No such file or directory

Added gcc -pg platform test and output PROFILING_FLAGS accordingly. Replaced pg var in Makefile with PROFILING_FLAGS.

* fix segfault when TEST_IOCTL_FRIENDLY_TMPDIR is undefined and default candidates are not suitable

* use ASSERT_DOUBLE_EQ instead of ASSERT_EQ

* When compiled with ROCKSDB_MALLOC_USABLE_SIZE UniversalCompactionFourPaths and UniversalCompactionSecondPathRatio tests fail due to premature memtable flushes on systems with 16-byte alignment. Arena runs out of block space before GenerateNewFile() completes.

Increased options.write_buffer_size.
2016-04-22 16:49:12 -07:00
Naitik Shah
99a3bf8f62 Merge pull request #1068 from daaku/c-purge-old-backups
rocksdb_backup_engine_purge_old_backups for C libraries
2016-04-22 13:49:59 -07:00
Naitik Shah
c146c9be18 rocksdb_create_mem_env to allow C libraries to create mem env (#1066) 2016-04-22 13:25:05 -07:00
Naitik Shah
6da70c5815 expose more options in the c api (#1067) 2016-04-22 13:24:09 -07:00
Naitik Shah
6f01687aae C rocksdb_create_iterators to expose NewIterators (#1069) 2016-04-22 13:22:21 -07:00
Andrew Kryczka
73a847ef89 Add per-level compression ratio property
Summary:
This is needed so we can measure compression ratio improvements
achieved by D52287.

The property compares raw data size against the total file size for a given
level. If the level is empty it should return 0.0.

Test Plan: new unit test

Reviewers: IslamAbdelRahman, yhchiang, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D56967
2016-04-20 18:46:54 -07:00
Dmitri Smirnov
ee221d2de0 Introduce XPRESS compresssion on Windows. (#1081)
Comparable with Snappy on comp ratio.
  Implemented using Windows API, does not require external package.
  Avaiable since Windows 8 and server 2012.
  Use -DXPRESS=1 with CMake to enable.
2016-04-19 22:54:24 -07:00
Islam AbdelRahman
b95510ddf4 Fix DBTest.RateLimitedDelete flakiness
Summary: We need to enable sync_point processing before creating the SstFileManager to ensure that we are holding the bg delete scheduler thread from running

Test Plan:
run the test
debug using printf

Reviewers: sdong, yhchiang, yiwu, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56871
2016-04-19 14:05:48 -07:00
Yi Wu
725184b04e Fix db_block_cache_test in lite build
Summary: D56715 move some of the tests from db_test to db_block_cache_test. Some of them should be disabled in lite build.

Test Plan:
    make check -j32
    OPT='-DROCKSDB_LITE' make check -j32

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D56907
2016-04-18 11:34:11 -07:00
Yi Wu
290883d94a Fix lite build
Summary: Fix rocksdb lite build after D56715.

Test Plan:
  make -j40 'OPT=-g -DROCKSDB_LITE'

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D56895
2016-04-18 10:47:10 -07:00
sdong
23089fd281 write_callback_test: clean test directory before running tests
Summary: write_callback_test fails if previous run didn't finish cleanly. Clean the DB before runing the test.

Test Plan: Run the test that see it doesn't fail any more.

Reviewers: andrewkr, yhchiang, yiwu, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: kradhakrishnan, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56859
2016-04-18 10:18:41 -07:00
Yi Wu
792762c42c Split db_test.cc
Summary: Split db_test.cc into several files. Moving several helper functions into DBTestBase.

Test Plan: make check

Reviewers: sdong, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: dhruba, andrewkr, kradhakrishnan, yhchiang, leveldb, sdong

Differential Revision: https://reviews.facebook.net/D56715
2016-04-18 09:42:50 -07:00
Victor Tyutyunov
e5c614e1df Fixing snapshot 0 assertion
Summary:
Solution is not to change db sequence number to start from 1 because 0 value is used in multiple other places.
Fix covers only compact_iterator::findEarliestVisibleSnapshot with updated logic to support snapshot's numbering starting from 0.

Test Plan:
run:
  make all check

it should pass all tests

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: lgalanis, mgalushka, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56601
2016-04-16 01:47:15 -07:00
sdong
6d436a3f85 DBTest.HardLimit made more deterministic
Summary: In DBTest.HardLimit, multiple flushes may merge into one, based on thread scheduling. Avoid it by waiting each flush to finish before generating the next one.

Test Plan: Run test in parallel several times and see it doesn't fail any more.

Reviewers: yhchiang, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: yiwu, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56853
2016-04-15 17:36:57 -07:00
sdong
9d35ae649e Make DBTestUniversalCompaction.IncreaseUniversalCompactionNumLevels more deterministic
Summary: DBTestUniversalCompaction, IncreaseUniversalCompactionNumLevels fails one in about 30 runs when running in parallel. We wait for compaction after each flush to make the compaction behavior deterministic.

Test Plan: Run the test 1000 times in parallel and it still passes.

Reviewers: yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: kradhakrishnan, yiwu, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56841
2016-04-15 16:16:53 -07:00
sdong
4b6833aec1 Rename options.compaction_measure_io_stats to options.report_bg_io_stats and include flush too.
Summary: It is useful to print out IO stats in flush jobs too. Extend options.compaction_measure_io_stats to flush jobs and raname it.

Test Plan: Try db_bench and see the stats are printed out.

Reviewers: yhchiang

Reviewed By: yhchiang

Subscribers: kradhakrishnan, yiwu, IslamAbdelRahman, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56769
2016-04-15 10:22:18 -07:00
Dmitri Smirnov
c9d668c584 Fix unit tests issues on Windows (#1078) 2016-04-14 17:33:53 -07:00
sdong
535af525d6 BlockBasedTable::PrefixMayMatch() to skip index checking if we can't find a filter block.
Summary:
In the case where we can't find a filter block, there is not much benefit of doing the binary search and see whether the index key has the prefix. With the change, we blindly return true if we can't get the filter.
It also fixes missing row cases for reverse comparator with full bloom.

Test Plan: Add a test case that used to fail.

Reviewers: yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: kradhakrishnan, yiwu, hermanlee4, yoshinorim, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56697
2016-04-13 19:06:48 -07:00
Islam AbdelRahman
19ef3de57e Fix ManualCompactionPartial test flakiness
Summary: The reason for this test flakiness is that we try to verify that number of files in L0 is 3 after flushing the 3rd file although we may have a compaction running in the background that may finish before we do the check and the 3 L0 files are converted to 1 L1 file

Test Plan: Run a modified version of the test that sleep before doing the check

Reviewers: sdong, andrewkr, kradhakrishnan, yhchiang

Reviewed By: yhchiang

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56643
2016-04-13 10:38:45 -07:00
sdong
dff4c48ede BlockBasedTable::PrefixMayMatch: no need to find data block after full bloom checking
Summary:
Full block checking should be a good enough indication of prefix existance. No need to further check data block.
This also fixes wrong results when using prefix bloom and reverse bitwise comparator.

Test Plan: Will add a unit test.

Reviewers: yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: hermanlee4, yoshinorim, yiwu, kradhakrishnan, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56625
2016-04-12 16:25:54 -07:00
Yueh-Hsuan Chiang
13e6c8e97a Relax an assertion in Compaction::ShouldStopBefore
Summary:
In some case, it is possible to have two concesutive SST files might sharing
same boundary keys.  However, in the assertion in Compaction::ShouldStopBefore,
it exclude such possibility.

This patch fix this issue by relaxing the assertion to allow the equal case.

Test Plan: rocksdb tests

Reviewers: IslamAbdelRahman, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D55875
2016-04-11 20:15:52 -07:00
Yueh-Hsuan Chiang
ae21d71e94 Fixed a bug in RocksDB Statistics where flush is considered as compaction
Summary: Fixed a bug in RocksDB Statistics where flush is considered as compaction

Test Plan: unit test

Reviewers: sdong, IslamAbdelRahman, rven, kradhakrishnan, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D56271
2016-04-11 19:59:25 -07:00
Jay Edgar
2448f80375 Make sure that if use_mmap_reads is on use_os_buffer is also on
Summary: The code assumes that if use_mmap_reads is on then use_os_buffer is also on.  This make sense as by using memory mapped files for reading you are expecting the OS to cache what it needs.  Add code to make sure the user does not turn off use_os_buffer when they turn on use_mmap_reads

Test Plan: New test: DBTest.MMapAndBufferOptions

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56397
2016-04-08 14:30:15 -07:00
Andrew Kryczka
2391ef7214 Embed column family name in SST file
Summary:
Added the column family name to the properties block. This property
is omitted only if the property is unavailable, such as when RepairDB()
writes SST files.

In a next diff, I will change RepairDB to use this new property for
deciding to which column family an existing SST file belongs. If this
property is missing, it will add it to the "unknown" column family (same
as its existing behavior).

Test Plan:
New unit test:

  $ ./db_table_properties_test --gtest_filter=DBTablePropertiesTest.GetColumnFamilyNameProperty

Reviewers: IslamAbdelRahman, yhchiang, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D55605
2016-04-06 23:10:32 -07:00
Igor Canadi
ab4c62332e Don't use version in the error message
Summary: We use object `v` in the error message, which is not initialized if the edit is column family manipulation. This doesn't provide much useful info, so this diff is removing it. Instead, it dumps actual VersionEdit contents.

Test Plan: compiles. would be great to get tests in version_set_test.cc that cover cases where a file write fails

Reviewers: sdong, yhchiang, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D56349
2016-04-06 15:00:15 -07:00
Aaron Gao
cc87075d63 No need to limit to 20 files in UpdateAccumulatedStats() if options.max_open_files=-1
Summary:
There is a hardcoded constraint in our statistics collection that prevents reading properties from more than 20 SST files. This means our statistics will be very inaccurate for databases with > 20 files since additional files are just ignored. The purpose of constraining the number of files used is to bound the I/O performed during statistics collection, since these statistics need to be recomputed every time the database reopened.

However, this constraint doesn't take into account the case where option "max_open_files" is -1. In that case, all the file metadata has already been read, so MaybeInitializeFileMetaData() won't incur any I/O cost. so this diff gets rid of the 20-file constraint in case max_open_files == -1.

Test Plan:
write into unit test db/db_properties_test.cc - "ValidateSampleNumber".
We generate 20 files with 2 rows and 10 files with 1 row.
If max_open_files !=-1, the `rocksdb.estimate-num-keys` should be (10*1 + 10*2)/20 * 30 = 45. Otherwise, it should be the ground truth, 50.
{F1089153}

Reviewers: andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D56253
2016-04-01 16:19:12 -07:00
Islam AbdelRahman
8a1a603fdb Eliminate std::deque initialization while iterating over merge operands
Summary:
This patch is similar to D52563, When we iterate over a DB with merge operands we keep creating std::queue to store the operands, optimize this by reusing merge_operands_ data member

Before the patch

```
./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq" --db="/dev/shm/bench_merge_memcpy_on_the_fly/" --merge_operator="put" --merge_keys=10000 --num=10000

DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
mergerandom  :       3.757 micros/op 266141 ops/sec;   29.4 MB/s ( updates:10000)
DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
readseq      :       0.413 micros/op 2423538 ops/sec;  268.1 MB/s
DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
readseq      :       0.451 micros/op 2219071 ops/sec;  245.5 MB/s
DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
readseq      :       0.420 micros/op 2382039 ops/sec;  263.5 MB/s
DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
readseq      :       0.408 micros/op 2452017 ops/sec;  271.3 MB/s

DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
mergerandom  :       3.947 micros/op 253376 ops/sec;   28.0 MB/s ( updates:10000)
DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
readseq      :       0.441 micros/op 2266473 ops/sec;  250.7 MB/s
DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
readseq      :       0.471 micros/op 2122033 ops/sec;  234.8 MB/s
DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
readseq      :       0.440 micros/op 2271407 ops/sec;  251.3 MB/s
DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
readseq      :       0.429 micros/op 2331471 ops/sec;  257.9 MB/s
```

with the patch

```
./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq" --db="/dev/shm/bench_merge_memcpy_on_the_fly/" --merge_operator="put" --merge_keys=10000 --num=10000

DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
mergerandom  :       4.080 micros/op 245092 ops/sec;   27.1 MB/s ( updates:10000)
DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
readseq      :       0.308 micros/op 3241843 ops/sec;  358.6 MB/s
DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
readseq      :       0.312 micros/op 3200408 ops/sec;  354.0 MB/s
DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
readseq      :       0.332 micros/op 3013962 ops/sec;  333.4 MB/s
DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
readseq      :       0.300 micros/op 3328017 ops/sec;  368.2 MB/s

DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
mergerandom  :       3.973 micros/op 251705 ops/sec;   27.8 MB/s ( updates:10000)
DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
readseq      :       0.320 micros/op 3123752 ops/sec;  345.6 MB/s
DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
readseq      :       0.335 micros/op 2986641 ops/sec;  330.4 MB/s
DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
readseq      :       0.339 micros/op 2950047 ops/sec;  326.4 MB/s
DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
readseq      :       0.319 micros/op 3131565 ops/sec;  346.4 MB/s
```

Test Plan: make check -j64

Reviewers: yhchiang, andrewkr, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56031
2016-04-01 15:48:55 -07:00
Islam AbdelRahman
f38540b12a WriteBatchWithIndex micro optimization
Summary:
  - Put key offset and key size in WriteBatchIndexEntry
  - Use vector for comparators in WriteBatchEntryComparator

I use a slightly modified version of @yoshinorim code to benchmark
https://gist.github.com/IslamAbdelRahman/b120f4fba8d6ff7d58d2

For Put I create a transaction that put a 1000000 keys and measure the time spent without commit.
For GetForUpdate I read the keys that I added in the Put transaction.

Original time:

```
 rm -rf /dev/shm/rocksdb-example/
 ./txn_bench put 1000000
 1000000 OK Ops | took      3.679 seconds
 ./txn_bench get_for_update 1000000
 1000000 OK Ops | took      3.940 seconds
```

New Time

```
  rm -rf /dev/shm/rocksdb-example/
 ./txn_bench put 1000000
 1000000 OK Ops | took      2.727 seconds
 ./txn_bench get_for_update 1000000
 1000000 OK Ops | took      3.880 seconds
```

It looks like there is no significant improvement in GetForUpdate() but we can see ~30% improvement in Put()

Test Plan: unittests

Reviewers: yhchiang, anthony, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D55539
2016-04-01 15:23:46 -07:00
Marton Trencseni
9b51987521 Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes.
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.

Test Plan:
'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK.
I didn't run the Java tests, I don't have Java set up on my devserver.

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56133
2016-04-01 10:42:39 -07:00
sdong
2feafa3db9 Change some RocksDB default options
Summary: Change some RocksDB default options to make it more friendly to server workloads.

Test Plan: Run all existing tests

Reviewers: yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: sumeet, muthu, benj, MarkCallaghan, igor, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D55941
2016-03-31 17:12:18 -07:00
Sandeep Joshi
63e8f1b55b Formatted lines to adhere to 80 char limit 2016-03-31 08:26:55 +05:30
Sandeep Joshi
24420947d9 Replace kHeader by WriteBatchInternal::kHeader in few more places
kHeader was moved from write_batch.cc to header file because
it is being used wherever the number "12" was being used to
check for record size
2016-03-30 23:13:00 +05:30
Sandeep Joshi
3bdbe89614 Merge branch 'magic12' of https://github.com/dcengines/rocksdb into magic12 2016-03-30 23:09:39 +05:30
zensan
78711524b7 In all the places where log records are read, there was a check that
record.size() should not be less than 12.

This "magic number" seems to be the WriteBatch header (8 byte sequence
and 4 byte count).   Replaced all the places where "12" was used
by WriteBatchInternal::kHeader.
2016-03-30 23:05:22 +05:30
Islam AbdelRahman
99ffb3d533 Fix perf_context::merge_operator_time_nanos calculation
Summary: We were not measuring the time spent in merge_operator when called from Version::Get()

Test Plan: added a unittest

Reviewers: sdong, yhchiang

Reviewed By: yhchiang

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D55905
2016-03-25 18:29:43 -07:00
Yueh-Hsuan Chiang
ad2fdaa823 Correct a typo in a comment
Summary: Correct a typo in a comment

Test Plan: No code change.

Reviewers: sdong, kradhakrishnan, IslamAbdelRahman

Reviewed By: kradhakrishnan, IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D55803
2016-03-24 19:39:13 -07:00
Yueh-Hsuan Chiang
be9816b3d9 Fix data race issue when sub-compaction is used in CompactionJob
Summary:
When subcompaction is used, all subcompactions share the same Compaction
pointer in CompactionJob while each subcompaction all keeps their mutable
stats in SubcompactionState.  However, there're still some mutable part
that is currently store in the shared Compaction pointer.

This patch makes two changes:

1. Make the shared Compaction pointer const so that it can never be modified
   during the compaction.
2. Move necessary states from Compaction to SubcompactionState.
3. Make functions of Compaction const if the function does not modify
   its internal state.

Test Plan: rocksdb and MyRocks test

Reviewers: sdong, kradhakrishnan, andrewkr, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, yoshinorim, gunnarku, leveldb

Differential Revision: https://reviews.facebook.net/D55923
2016-03-24 19:36:39 -07:00
Praveen Rao
583157f710 Avoid overloaded virtual function 2016-03-22 17:10:31 -07:00
Praveen Rao
136b8e0cad Merge from master 2016-03-22 12:38:44 -07:00
Praveen Rao
2dcbb3b4f3 Addressed review comments 2016-03-22 12:07:15 -07:00
sdong
b1fafcaca6 Revert "Adding pin_l0_filter_and_index_blocks_in_cache feature."
This reverts commit 522de4f59e.

It has bug of index block cleaning up.
2016-03-21 11:50:42 -07:00
agiardullo
fbbb8a6144 Add test for Snapshot 0
Summary:
I ran into this assert when stress testing transactions.  It's pretty easy to repro.

Changing VersionSet::last_sequence_ to start at 1 seems pretty straightforward.  We would just need to change the 4 callers of SetLastSequence(), including recovery code.  I'd make this change myself, but I do not have enough time to test changes to recovery code-paths this week.  But checking in this test case (disabled) for future fixing.

Test Plan: n/a

Reviewers: yhchiang, kradhakrishnan, andrewkr, anthony, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D55311
2016-03-18 16:16:20 -07:00
Andrew Kryczka
e182f03c1e Add unit tests for RepairDB
Summary:
Basic test cases:

- Manifest is lost or corrupt
- Manifest refers to too many or too few SST files
- SST file is corrupt
- Unflushed data is present when RepairDB is called

Depends on D55065 for its CreateFile() function in file_utils

Test Plan: Ran the tests.

Reviewers: IslamAbdelRahman, yhchiang, yoshinorim, sdong

Reviewed By: sdong

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D55485
2016-03-18 15:18:42 -07:00
Praveen Rao
7d371863e5 travis build fixes 2016-03-18 14:43:22 -07:00
Praveen Rao
4f1c74a46e merge from master 2016-03-18 12:48:01 -07:00
Praveen Rao
f8c2189307 Publish log numbers for column family to wal_filter, and provide log number in the record callback 2016-03-18 12:32:15 -07:00
Marton Trencseni
44756260ae Reset block cache in failing unit test.
Test Plan: make -j40 check OPT=-g, on both /tmp and /dev/shm

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D55701
2016-03-18 06:13:54 +00:00
Marton Trencseni
522de4f59e Adding pin_l0_filter_and_index_blocks_in_cache feature.
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
When the table reader is destroyed, it releases the pinned blocks (if there were any). This has to happen before the cache is destroyed, so I had to introduce a TableReader::Close(), to guarantee the order of destruction.

Test Plan:
Added two unit tests for this. Existing unit tests run fine (default is pin_l0_filter_and_index_blocks_in_cache=false).

DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32
  Mac: OK.
  Linux: with D55287 patched in it's OK.

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54801
2016-03-17 22:40:01 +00:00
dhruba borthakur
33d568611d Merge pull request #1040 from bureau14/master
Fixes warnings and ensure correct int behavior on 32-bit platforms.
2016-03-17 03:23:30 -07:00
Edouard A
02e62ebbc8 Fixes warnings and ensure correct int behavior on 32-bit platforms. 2016-03-16 22:57:57 +01:00
Islam AbdelRahman
3ff98bd209 Fix no compression test
Summary:
DBBlockCacheTest.TestWithCompressedBlockCache is depending on compression using snappy, so this test fail when snappy is not available
block this test when we don't have snappy

https://ci-builds.fb.com/view/rocksdb/job/rocksdb_no_compression/833/console

Test Plan: run the test when compression libraries are not avaliable

Reviewers: sdong, yiwu

Reviewed By: yiwu

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D55413
2016-03-15 12:17:40 -07:00
Dhruba Borthakur
1a2cc27e01 ColumnFamilyOptions SanitizeOptions is buggy on 32-bit platforms.
Summary:
The pre-existing code is trying to clamp between 65,536 and 0,
resulting in clamping to 65,536, resulting in very small buffers,
resulting in ShouldFlushNow() being true quite easily,
resulting in assertion failing and database performance
being "not what it should be".

https://github.com/facebook/rocksdb/issues/1018

Test Plan: make check

Reviewers: sdong, andrewkr, IslamAbdelRahman, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D55455
2016-03-14 16:21:54 -07:00
sdong
b2ae5950ba Index Reader should not be reused after DB restart
Summary:
In block based table reader, wow we put index reader to block cache, which can be retrieved after DB restart. However, index reader may reference internal comparator, which can be destroyed after DB restarts, causing problems.
Fix it by making cache key identical per table reader.

Test Plan: Add a new test which failed with out the commit but now pass.

Reviewers: IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: maro, yhchiang, kradhakrishnan, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D55287
2016-03-14 10:04:09 -07:00
Islam AbdelRahman
580fede347 Aggregate hot Iterator counters in LocalStatistics (DBIter::Next perf regression)
Summary:
This patch bump the counters in the frequent code path DBIter::Next() / DBIter::Prev() in a local data members and send them to Statistics when the iterator is destroyed
A better solution will be to have thread_local implementation for Statistics

New performance
```
readseq      :       0.035 micros/op 28597881 ops/sec; 3163.7 MB/s
     1,851,568,819      stalled-cycles-frontend   #   31.29% frontend cycles idle    [49.86%]
       884,929,823      stalled-cycles-backend    #   14.95% backend  cycles idle    [50.21%]
readreverse  :       0.071 micros/op 14077393 ops/sec; 1557.3 MB/s
     3,239,575,993      stalled-cycles-frontend   #   27.36% frontend cycles idle    [49.96%]
     1,558,253,983      stalled-cycles-backend    #   13.16% backend  cycles idle    [50.14%]

```

Existing performance

```
readreverse  :       0.174 micros/op 5732342 ops/sec;  634.1 MB/s
    20,570,209,389      stalled-cycles-frontend   #   70.71% frontend cycles idle    [50.01%]
    18,422,816,837      stalled-cycles-backend    #   63.33% backend  cycles idle    [50.04%]

readseq      :       0.119 micros/op 8400537 ops/sec;  929.3 MB/s
    15,634,225,844      stalled-cycles-frontend   #   79.07% frontend cycles idle    [49.96%]
    14,227,427,453      stalled-cycles-backend    #   71.95% backend  cycles idle    [50.09%]
```

Test Plan: unit tests

Reviewers: yhchiang, sdong, igor

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D55107
2016-03-11 19:01:12 -08:00
Baris Yazici
e8e6cf0173 fix: handle_fatal_signal (sig=6) in std::vector<std::string, std::allocator<std::string> >::_M_range_check | c++/4.8.2/bits/stl_vector.h:794 #174
Summary:
Fix for https://github.com/facebook/mysql-5.6/issues/174

When there is no old files to purge, vector.at(i) function was crashing

if (old_info_log_file_count != 0 &&
      old_info_log_file_count >= db_options_.keep_log_file_num) {
    std::sort(old_info_log_files.begin(), old_info_log_files.end());
    size_t end = old_info_log_file_count - db_options_.keep_log_file_num;
    for (unsigned int i = 0; i <= end; i++) {
      std::string& to_delete = old_info_log_files.at(i);

Added check to old_info_log_file_count be non zero.

Test Plan: run existing tests

Reviewers: gunnarku, vasilep, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: andrewkr, webscalesql-eng, dhruba

Differential Revision: https://reviews.facebook.net/D55245
2016-03-11 11:11:45 -08:00
Andrew Kryczka
d9620239d2 Cleanup stale manifests outside of full purge
Summary:
- Keep track of obsolete manifests in VersionSet
- Updated FindObsoleteFiles() to put obsolete manifests in the JobContext for later use by PurgeObsoleteFiles()
- Added test case that verifies a stale manifest is deleted by a non-full purge

Test Plan:
  $ ./backupable_db_test --gtest_filter=BackupableDBTest.ChangeManifestDuringBackupCreation

Reviewers: IslamAbdelRahman, yoshinorim, sdong

Reviewed By: sdong

Subscribers: andrewkr, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D55269
2016-03-10 18:16:21 -08:00
Yi Wu
f71fc77b7c Cache to have an option to fail Cache::Insert() when full
Summary:
Cache to have an option to fail Cache::Insert() when full. Update call sites to check status and handle error.

I totally have no idea what's correct behavior of all the call sites when they encounter error. Please let me know if you see something wrong or more unit test is needed.

Test Plan: make check -j32, see tests pass.

Reviewers: anthony, yhchiang, andrewkr, IslamAbdelRahman, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D54705
2016-03-10 17:35:19 -08:00
Yueh-Hsuan Chiang
765597fa78 Update compaction score right after CompactFiles forms a compaction
Summary:
This is a follow-up patch of https://reviews.facebook.net/D54891.
As the information about files being compacted will also be used
when making compaction decision, it is necessary to update the compaction
score when a compaction plan has been made but not yet execute.

This patch adds a missing call to update the compaction score in
CompactFiles().

Test Plan: compact_files_test

Reviewers: sdong, IslamAbdelRahman, kradhakrishnan, yiwu, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D55227
2016-03-10 14:34:28 -08:00
Yueh-Hsuan Chiang
aa3f02d50c Improve comment in compaction.h and compaction_picker.h
Summary:
ReleaseCompactionFiles must be called when DB mutex is held,
but the documentation is mission.

Test Plan: no code change

Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D54987
2016-03-08 16:46:41 -08:00
sdong
294bdf9ee2 Change Property name from "rocksdb.current_version_number" to "rocksdb.current-super-version-number"
Summary: I realized I again is wrong about the naming convention. Let me change it to the correct one.

Test Plan: Run unit tests.

Reviewers: IslamAbdelRahman, kradhakrishnan, yhchiang, andrewkr

Reviewed By: andrewkr

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D55041
2016-03-04 18:15:29 -08:00
Yueh-Hsuan Chiang
a7d4eb2f34 Fix a bug where flush does not happen when a manual compaction is running
Summary:
Currently, when rocksdb tries to run manual compaction to refit data into a level,
there's a ReFitLevel() process that requires no bg work is currently running.
When RocksDB plans to ReFitLevel(), it will do the following:

 1. pause scheduling new bg work.
 2. wait until all bg work finished
 3. do the ReFitLevel()
 4. unpause scheduling new bg work.

However, as it pause scheduling new bg work at step one and waiting for all bg work
finished in step 2, RocksDB will stop flushing until all bg work is done (which
could take a long time.)

This patch fix this issue by changing the way ReFitLevel() pause the background work:

1. pause scheduling compaction.
2. wait until all bg work finished.
3. pause scheduling flush
4. do ReFitLevel()
5. unpause both flush and compaction.

The major difference is that.  We only pause scheduling compaction in step 1 and wait
for all bg work finished in step 2.  This prevent flush being blocked for a long time.
Although there's a very rare case that ReFitLevel() might be in starvation in step 2,
but it's less likely the case as flush typically finish very fast.

Test Plan: existing test.

Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D55029
2016-03-04 14:24:52 -08:00
Islam AbdelRahman
dfe96c72c3 Fix WriteLevel0TableForRecovery file delete protection
Summary:
The call to

```
CaptureCurrentFileNumberInPendingOutputs()
```

should be before

```
versions_->NewFileNumber()
```
Right now we are not actually protecting the file from being deleted

Test Plan: make check

Reviewers: sdong, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D54645
2016-03-03 18:25:07 -08:00
sdong
ef204df7ef Compaction always needs to be removed from level0_compactions_in_progress_ for universal compaction
Summary: We always put compaction to level0_compactions_in_progress_ for universal compaction, so we should also remove it. The bug causes assert failure when running manual compaction.

Test Plan:
TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom,compact --subcompactions=16 --compaction_style=1
always fails on my host. After the fix, it doesn't fail any more.

Reviewers: IslamAbdelRahman, andrewkr, kradhakrishnan, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D55017
2016-03-02 21:23:28 -08:00