Commit Graph

3203 Commits

Author SHA1 Message Date
sdong
c4cef07f1b Update DBTestUniversalCompaction.UniversalCompactionSingleSortedRun to use max_size_amplification_percent = 0
Summary: With max_size_amplification_percent = 0 to make sure that DBTestUniversalCompaction.UniversalCompactionSingleSortedRun tests the configuration to compact to one single sorted run.

Test Plan: Run all existing tests

Reviewers: yhchiang, andrewkr, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60021
2016-06-27 15:19:27 -07:00
charsyam
4f2b0946d1 fix simple typos (#1183) 2016-06-25 08:29:40 +01:00
Andrew Kryczka
3b7ed677de ColumnFamilyOptions API [CF + RepairDB part 3/3]
Summary:
Overload RepairDB to take vector-of-ColumnFamilyDescriptor, which tells
us CF name + options. Also takes a ColumnFamilyOptions for unspecified column
families encountered during the repair.

One potentially confusing thing is that we store options in the constructor and
don't invoke AddColumnFamily() until discovering the CF in ScanTable. This is
because we don't know the CF ID until we find a table belonging to that CF.

Depends on D59781.

Test Plan:
  $ ./repair_test

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D59853
2016-06-24 16:29:43 -07:00
Andrew Kryczka
56ac686292 Detect column family from properties [CF + RepairDB part 2/3]
Summary:
This diff uses the CF ID and CF name properties in the SST file
to associate recovered data with the proper column family. Depends on D59775.

- In ScanTable(), create column families in VersionSet each time a new one is discovered (via reading SST file properties)
- In ConvertLogToTable(), dump an SST file for every column family with data in the WAL
- In AddTables(), make a VersionEdit per-column family that adds all of that CF's tables

Test Plan:
  $ ./repair_test

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D59781
2016-06-24 13:12:13 -07:00
Andrew Kryczka
343507afb1 Refactor to use VersionSet [CF + RepairDB part 1/3]
Summary:
To support column families, it is easiest to use VersionSet to manage
our column families (if we don't have Versions then ColumnFamilyData always
behaves as a dummy column family). This diff only refactors the existing repair
logic to use VersionSet; the next two parts will add support for multiple
column families.

Test Plan:
  $ ./repair_test

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D59775
2016-06-24 11:19:40 -07:00
omegaga
c4e19b77e8 Add a read option to enable background purge when cleaning up iterators
Summary:
Add a read option `background_purge_on_iterator_cleanup` to avoid deleting files in foreground when destroying iterators.
Instead, a job is scheduled in high priority queue and would be executed in a separate background thread.

Test Plan: Add a variant of PurgeObsoleteFileTest. Turn on background purge option in the new test, and use sleeping task to ensure files are deleted in background.

Reviewers: IslamAbdelRahman, sdong

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59499
2016-06-21 18:41:23 -07:00
Islam AbdelRahman
fa813f7478 Update DB::AddFile() to ingest the file to the lowest possible level
Summary:
DB::AddFile() right now always add the ingested file to L0
update the logic to add the file to the lowest possible level

Test Plan: unit tests

Reviewers: jkedgar, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D59637
2016-06-21 17:57:59 -07:00
sdong
7b79238b65 Deprectate filter_deletes
Summary: filter_deltes is not a frequently used feature. Remove it.

Test Plan: Run all test suites.

Reviewers: igor, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59427
2016-06-17 10:30:47 -07:00
Islam AbdelRahman
30a24f2d3d Add InternalStats and logging for AddFile()
Summary:
We dont report the bytes that we ingested from AddFile which make the write amplification numbers incorrect
Update InternalStats and add logging for AddFile()

Test Plan: Make sure the code compile and existing tests pass

Reviewers: lightmark, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59763
2016-06-16 16:21:41 -07:00
sdong
249e796dfc Fix Flaky DBCompactionTest.SkipStatsUpdateTest
Summary: DBCompactionTest.SkipStatsUpdateTest sometimes fails. I don't see any verification related to the deletes issued. Remove them to avoid the uncertainty.

Test Plan: Run the test.

Reviewers: IslamAbdelRahman, andrewkr, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59613
2016-06-15 12:00:51 -07:00
Islam AbdelRahman
f5177c761f Remove wasteful instrumentation in FullMerge (stacked on D59577)
Summary:
[ This diff is stacked on top of D59577 ]

We keep calling timer.ElapsedNanos() on every call to MergeOperator::FullMerge even when statistics are disabled, this is wasteful.

I run the readseq benchmark on a DB containing 100K merge operands for 100K keys (1 operand per key) with 1GB block cache
I see slight performance improvment

Original results

```
$ ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=100000 --num=100000 --db="/dev/shm/100K_merge_compacted/" --cache_size=1073741824 --use_existing_db --disable_auto_compactions
------------------------------------------------
DB path: [/dev/shm/100K_merge_compacted/]
readseq      :       0.498 micros/op 2006597 ops/sec;  222.0 MB/s
DB path: [/dev/shm/100K_merge_compacted/]
readseq      :       0.295 micros/op 3393627 ops/sec;  375.4 MB/s
DB path: [/dev/shm/100K_merge_compacted/]
readseq      :       0.285 micros/op 3511155 ops/sec;  388.4 MB/s
DB path: [/dev/shm/100K_merge_compacted/]
readseq      :       0.286 micros/op 3500470 ops/sec;  387.2 MB/s
DB path: [/dev/shm/100K_merge_compacted/]
readseq      :       0.283 micros/op 3530751 ops/sec;  390.6 MB/s
DB path: [/dev/shm/100K_merge_compacted/]
readseq      :       0.289 micros/op 3464811 ops/sec;  383.3 MB/s
DB path: [/dev/shm/100K_merge_compacted/]
readseq      :       0.277 micros/op 3612814 ops/sec;  399.7 MB/s
DB path: [/dev/shm/100K_merge_compacted/]
readseq      :       0.283 micros/op 3539640 ops/sec;  391.6 MB/s
DB path: [/dev/shm/100K_merge_compacted/]
readseq      :       0.285 micros/op 3503766 ops/sec;  387.6 MB/s
```

After patch

```
$ ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=100000 --num=100000 --db="/dev/shm/100K_merge_compacted/" --cache_size=1073741824 --use_existing_db --disable_auto_compactions
------------------------------------------------
DB path: [/dev/shm/100K_merge_compacted/]
readseq      :       0.476 micros/op 2100119 ops/sec;  232.3 MB/s
DB path: [/dev/shm/100K_merge_compacted/]
readseq      :       0.278 micros/op 3600887 ops/sec;  398.4 MB/s
DB path: [/dev/shm/100K_merge_compacted/]
readseq      :       0.275 micros/op 3636698 ops/sec;  402.3 MB/s
DB path: [/dev/shm/100K_merge_compacted/]
readseq      :       0.271 micros/op 3691661 ops/sec;  408.4 MB/s
DB path: [/dev/shm/100K_merge_compacted/]
readseq      :       0.273 micros/op 3661534 ops/sec;  405.1 MB/s
DB path: [/dev/shm/100K_merge_compacted/]
readseq      :       0.276 micros/op 3627106 ops/sec;  401.3 MB/s
DB path: [/dev/shm/100K_merge_compacted/]
readseq      :       0.272 micros/op 3682635 ops/sec;  407.4 MB/s
DB path: [/dev/shm/100K_merge_compacted/]
readseq      :       0.266 micros/op 3758331 ops/sec;  415.8 MB/s
DB path: [/dev/shm/100K_merge_compacted/]
readseq      :       0.266 micros/op 3761907 ops/sec;  416.2 MB/s
```

Test Plan: make check -j64

Reviewers: yhchiang, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59583
2016-06-13 16:22:14 -07:00
Islam AbdelRahman
7c919deccc Reuse TimedFullMerge instead of FullMerge + instrumentation
Summary:
We have alot of code duplication whenever we call FullMerge we keep duplicating the instrumentation and statistics code
This is a simple diff to refactor the code to use TimedFullMerge instead of FullMerge

Test Plan: COMPILE_WITH_ASAN=1 make check -j64

Reviewers: andrewkr, yhchiang, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59577
2016-06-13 16:17:26 -07:00
Yi Wu
bc8af90e8c add option to not flush memtable on open()
Summary:
Add option to not flush memtable on open()
In case the option is enabled, don't delete existing log files by not updating log numbers to MANIFEST.
Will still flush if we need to (e.g. memtable full in the middle). In that case we also flush final memtable.
If wal_recovery_mode = kPointInTimeRecovery, do not halt immediately after encounter corruption. Instead, check if seq id of next log file is last_log_sequence + 1. In that case we continue recovery.

Test Plan: See unit test.

Reviewers: dhruba, horuff, sdong

Reviewed By: sdong

Subscribers: benj, yhchiang, andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57813
2016-06-13 11:34:16 -07:00
sdong
6faddd7c55 Merge db/slice.cc into util/slice.cc
Summary: It confuses some compilers to have slice.cc under multiple directories. Merge them.

Test Plan: Run existing tests

Reviewers: andrewkr, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59409
2016-06-10 16:37:36 -07:00
sdong
5009b5326b BlockBasedTable::FullFilterKeyMayMatch() Should skip prefix bloom if full key bloom exists
Summary: Currently, if users define both of full key bloom and prefix bloom in SST files. During Get(), if full key bloom shows the key may exist, we still go ahead and check prefix bloom. This is wasteful. If bloom filter for full keys exists, we should always ignore prefix bloom in Get().

Test Plan: Run existing tests

Reviewers: yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57825
2016-06-10 16:27:56 -07:00
sdong
20699df843 memtable_prefix_bloom_bits -> memtable_prefix_bloom_bits_ratio and deprecate memtable_prefix_bloom_probes
Summary:
memtable_prefix_bloom_probes is not a critical option. Remove it to reduce number of options.
It's easier for users to make mistakes with memtable_prefix_bloom_bits, turn it to memtable_prefix_bloom_bits_ratio

Test Plan: Run all existing tests

Reviewers: yhchiang, igor, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: gunnarku, yoshinorim, MarkCallaghan, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59199
2016-06-10 12:12:10 -07:00
Wanning Jiang
56887f6cb8 Backup Options
Summary: Backup options file to private directory

Test Plan:
backupable_db_test.cc, BackupOptions
	   Modify DB options by calling OpenDB for 3 times. Check the latest options file is in the right place. Also check no redundent files are backuped.

Reviewers: andrewkr

Reviewed By: andrewkr

Subscribers: leveldb, dhruba, andrewkr

Differential Revision: https://reviews.facebook.net/D59373
2016-06-09 19:03:10 -07:00
Anirban Rahut
a73b26f601 Adding test for contiguous WAL detection
Summary:
Add a test to detect that when WAL gets truncated,
seq no's are checked to be contiguous.

This test is put in ColumnFamilyTest as it has the necessary
infrastructure/functions for flushing column families, which
we use to ensure 2 active WAL files

Test Plan:
This is a test, no feature has been added.
This test fails today and hence disabled

Reviewers: sdong

Reviewed By: sdong

Subscribers: lgalanis, dhruba, andrewkr, pritamdamania

Differential Revision: https://reviews.facebook.net/D59253
2016-06-07 18:04:15 -07:00
Aaron Gao
e532877940 Add statistics field to show total size of index and filter blocks in block cache
Summary: With `table_options.cache_index_and_filter_blocks = true`, index and filter blocks are stored in block cache. Then people are curious how much of the block cache total size is used by indexes and bloom filters. It will be nice we have a way to report that. It can help people tune performance and plan for optimized hardware setting. We add several enum values for db Statistics. BLOCK_CACHE_INDEX/FILTER_BYTES_INSERT - BLOCK_CACHE_INDEX/FILTER_BYTES_ERASE = current INDEX/FILTER total block size in bytes.

Test Plan:
write a test case called `DBBlockCacheTest.IndexAndFilterBlocksStats`. The result is:
```
[gzh@dev9927.prn1 ~/local/rocksdb]  make db_block_cache_test -j64 && ./db_block_cache_test --gtest_filter=DBBlockCacheTest.IndexAndFilterBlocksStats
Makefile:101: Warning: Compiling in debug mode. Don't use the resulting binary in production
  GEN      util/build_version.cc
  make: `db_block_cache_test' is up to date.
  Note: Google Test filter = DBBlockCacheTest.IndexAndFilterBlocksStats
  [==========] Running 1 test from 1 test case.
  [----------] Global test environment set-up.
  [----------] 1 test from DBBlockCacheTest
  [ RUN      ] DBBlockCacheTest.IndexAndFilterBlocksStats
  [       OK ] DBBlockCacheTest.IndexAndFilterBlocksStats (689 ms)
  [----------] 1 test from DBBlockCacheTest (689 ms total)

  [----------] Global test environment tear-down
  [==========] 1 test from 1 test case ran. (689 ms total)
  [  PASSED  ] 1 test.
```

Reviewers: IslamAbdelRahman, andrewkr, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D58677
2016-06-03 10:47:47 -07:00
Jan Doms
02ec8154e5 allow updating block cache capacity from C (#1149) 2016-06-03 14:04:51 +01:00
Andrew Kryczka
842958651f Fix race condition in SwitchMemtable
Summary:
MemTableList::current_ could be written by background flush thread and
simultaneously read in the user thread (NumNotFlushed() is used in
SwitchMemtable()). Use the lock to prevent this case. Found the error from tsan.

Related: D58833

Test Plan:
  $ OPT=-g COMPILE_WITH_TSAN=1 make -j64 db_test
  $ TEST_TMPDIR=/dev/shm/rocksdb ./db_test --gtest_filter=DBTest.RepeatedWritesToSameKey

Reviewers: lightmark, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D59139
2016-06-02 17:11:45 -07:00
PraveenSinghRao
3a276b0cbe Add a callback for when memtable is moved to immutable (#1137)
* Create a callback for memtable becoming immutable

Create a callback for memtable becoming immutable

Create a callback for memtable becoming immutable

moved notification outside the lock

Move sealed notification to unlocked portion of SwitchMemtable

* fix lite build
2016-06-02 11:57:31 -07:00
Mike Kolupaev
936973d145 Small tweaks to logging to track the number of immutable memtables
Summary:
We see some write stalls because of number of unflushed memtables. With existing logging I couldn't figure out what's happening exactly. See internal task t11446054 for details if interested. This diff adds:
- logging of memtable creation at info level; I wanted it on multiple occasions for different reasons; also include number of immutable memtables,
- logging of number of remaining immutable memtables after a flush.

Test Plan: ran tests

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58833
2016-06-01 11:11:33 -07:00
siddontang
21c047ab49 add readahead size option (#1146) 2016-06-01 10:48:50 -07:00
Reid Horuff
5d85fdb2c5 add missing lock 2016-05-31 12:26:48 -07:00
sdong
345fd73faf Fix flaky DBTestDynamicLevel.DynamicLevelMaxBytesBase2
Summary: We added more table properties for each SST file, so when using 2KB SST file size, the estimated size of SST files is off by almost half, causing the LSM tree structure not as expected. Fix it by making file size 4x as previously, as well as LSM base size. Also avoid the sleeping based synchronization and turn to use sync points.

Test Plan: Run paralell unit tests multiple times and make sure they always pass.

Reviewers: IslamAbdelRahman, kradhakrishnan

Reviewed By: kradhakrishnan

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58749
2016-05-26 10:13:24 -07:00
krad
8fc75de327 Minor fix to disable DynamicLevelMaxBytesBase2 2016-05-24 17:45:50 -07:00
Ashish Shenoy
99765ed855 Clean up the ComputeCompactionScore() API
Summary: Make CompactionOptionsFIFO a part of mutable_cf_options

Test Plan: UT

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, lgalanis, dhruba

Differential Revision: https://reviews.facebook.net/D58653
2016-05-23 15:55:29 -07:00
Shen Li
def2f7bd0e Expose report_bg_io_stats option in the C API. (#1131) 2016-05-23 13:13:47 -07:00
siddontang
8f1214531e C API: Expose DeleteFileInRange (#1132) 2016-05-23 04:19:47 -07:00
Sage Weil
11f329bd40 db/db_impl: restrict WALRecoveryMode when using recycled log files
kPointInTimeRecovery is indistinguishable from
kTolerateCorruptedTailRecords in recycle mode since we define
the "end" of the log as the first corrupt record we encounter.

kAbsoluteConsistency doesn't make sense because even a clean
shutdown leaves old junk at the end of the log file.

Signed-off-by: Sage Weil <sage@redhat.com>
2016-05-22 22:00:15 -07:00
Sage Weil
2b2a898e0b db/log_reader: combine kBadRecord{Len,Checksum} for readability
These vary only by the corruption string reported.

Signed-off-by: Sage Weil <sage@redhat.com>
2016-05-22 22:00:15 -07:00
Sage Weil
34df1c94d5 db/log_reader: treat bad record length or checksum as EOF
If we are in kTolerateCorruptedTailRecords, treat these
errors as the end of the log.  This is particularly
important for recycled logs, where we will regularly see
corrupted headers (bad length or checksum) when replaying
a log.  If we are aligned with a block boundary or get lucky,
we will land on an old header and see the log number
mismatch, but more commonly we will land midway through
some previous block and record and effectively see noise.
These must be treated as the end of the log in order for
recycling to work.

This makes the LogTest.Recycle/1 test pass.

We also modify a number of existing tests because the
recycled log files behave fundamentally differently in that
they always stop when they reach the first bad record.

Signed-off-by: Sage Weil <sage@redhat.com>
2016-05-22 22:00:15 -07:00
Sage Weil
7947aba68c db/log_reader: move kBadRecord{Len,Checksum} handling into ReadRecord
The behavior here needs to depend on the WAL recovery mode.  No functional
change in this patch.

Signed-off-by: Sage Weil <sage@redhat.com>
2016-05-22 22:00:15 -07:00
Sage Weil
847e471db6 db/log_test: add recycle log test
This currently fails because we do not properly map a
corrupt header to the logical end of the log.

Signed-off-by: Sage Weil <sage@redhat.com>
2016-05-22 22:00:15 -07:00
Aaron Orenstein
2073cf3775 Eliminate use of 'using namespace std'. Also remove a number of ADL references to std functions.
Summary: Reduce use of argument-dependent name lookup in RocksDB.

Test Plan: 'make check' passed.

Reviewers: andrewkr

Reviewed By: andrewkr

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58203
2016-05-20 07:42:18 -07:00
Richard Cairns Jr
f6e404c20a Added "number of merge operands" to statistics in ssts.
Summary:
A couple of notes from the diff:
  - The namespace block I added at the top of table_properties_collector.cc was in reaction to an issue i was having with PutVarint64 and reusing the "val" string.  I'm not sure this is the cleanest way of doing this, but abstracting this out at least results in the correct behavior.
  - I chose "rocksdb.merge.operands" as the property name.  I am open to suggestions for better names.
  - The change to sst_dump_tool.cc seems a bit inelegant to me.  Is there a better way to do the if-else block?

Test Plan:
I added a test case in table_properties_collector_test.cc.  It adds two merge operands and checks to make sure that both of them are reflected by GetMergeOperands.  It also checks to make sure the wasPropertyPresent bool is properly set in the method.

Running both of these tests should pass:
./table_properties_collector_test
./sst_dump_test

Reviewers: IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58119
2016-05-19 14:24:48 -07:00
omegaga
3c69f77c67 Move IO failure test to separate file
Summary:
This is a part of effort to reduce the size of db_test.cc. We move the following tests to a separate file `db_io_failure_test.cc`:

* DropWrites
* DropWritesFlush
* NoSpaceCompactRange
* NonWritableFileSystem
* ManifestWriteError
* PutFailsParanoid

Test Plan: Run `make check` to see if the tests are working properly.

Reviewers: sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58341
2016-05-18 17:09:20 -07:00
Islam AbdelRahman
c70a9335de Fix mutex unlock issue between scheduled compaction and ReleaseCompactionFiles()
Summary:
NotifyOnCompactionCompleted can unlock the mutex.
That mean that we can schedule a background compaction that will start before we ReleaseCompactionFiles().

Test Plan:
added unittest
existing unittest

Reviewers: yhchiang, sdong

Reviewed By: sdong

Subscribers: yoshinorim, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58065
2016-05-18 14:56:30 -07:00
Reid Horuff
a6254f2bd4 Long outstanding prepare test
Summary: This tests that a prepared transaction is not lost after several crashes, restarts, and memtable flushes.

Test Plan: TwoPhaseLongPrepareTest

Reviewers: sdong

Subscribers: hermanlee4, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58185
2016-05-17 18:57:06 -07:00
Aaron Gao
43afd72bee [rocksdb] make more options dynamic
Summary:
make more ColumnFamilyOptions dynamic:
- compression
- soft_pending_compaction_bytes_limit
- hard_pending_compaction_bytes_limit
- min_partial_merge_operands
- report_bg_io_stats
- paranoid_file_checks

Test Plan:
Add sanity check in `db_test.cc` for all above options except for soft_pending_compaction_bytes_limit and hard_pending_compaction_bytes_limit.
All passed.

Reviewers: andrewkr, sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57519
2016-05-17 13:11:56 -07:00
Islam AbdelRahman
f6aedb62c0 Fix Transaction memory leak
Summary:
- Make sure we clean up recovered_transactions_ on DBImpl destructor
- delete leaked txns and env in TransactionTest

Test Plan: Run transaction_test under valgrind

Reviewers: sdong, andrewkr, yhchiang, horuff

Reviewed By: horuff

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58263
2016-05-16 16:32:55 -07:00
krad
a08c8c851a Added PersistentCache abstraction
Summary:
Added a new abstraction to cache page to RocksDB designed for the read
cache use.

RocksDB current block cache is more of an object cache. For the persistent read cache
project, what we need is a page cache equivalent. This changes adds a cache
abstraction to RocksDB to cache pages called PersistentCache. PersistentCache can cache
uncompressed pages or raw pages (content as in filesystem). The user can
choose to operate PersistentCache either in  COMPRESSED or UNCOMPRESSED mode.

Blame Rev:

Test Plan: Run unit tests

Reviewers: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D55707
2016-05-15 22:17:18 -07:00
Reid Horuff
a400336398 TransactionLogIterator sequence gap fix
Summary: DBTestXactLogIterator.TransactionLogIterator was failing due the sequence gaps. This was caused by an off-by-one error when calculating the new sequence number after recovering from logs.

Test Plan: db_log_iter_test

Reviewers: andrewkr

Subscribers: andrewkr, hermanlee4, dhruba, IslamAbdelRahman

Differential Revision: https://reviews.facebook.net/D58053
2016-05-12 13:54:08 -07:00
Islam AbdelRahman
560358dc93 Fix data race in GetObsoleteFiles()
Summary:
GetObsoleteFiles() and LogAndApply() functions modify obsolete_manifests_ vector
we need to make sure that the mutex is held when we modify the obsolete_manifests_

Test Plan: run the test under TSAN

Reviewers: andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58011
2016-05-10 19:30:09 -07:00
Reid Horuff
c27061dae7 [rocksdb] 2PC double recovery bug fix
Summary:
1. prepare()
2. crash
3. recover
4. commit()
5. crash
6. data is lost

This is due to the transaction data still only residing in the WAL but because the logs were flushed on the first recovery the data is ignored on the second recovery. We must scan all logs found on recovery and only ignore redundant data at the time of replay. It is not possible to know which logs still contain relevant data at time of recovery. We cannot simply ignore a log because all of the non-2pc data it contains has already been written to L0.

The changes made to MemTableInserter are to ensure that prepared sections are still recovered even if all of the non-2pc data in that log has already been flushed to L0.

Test Plan: Provided test.

Reviewers: sdong

Subscribers: andrewkr, hermanlee4, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57729
2016-05-10 14:06:07 -07:00
Reid Horuff
a657ee9a9c [rocksdb] Recovery path sequence miscount fix
Summary:
Consider the following WAL with 4 batch entries prefixed with their sequence at time of memtable insert.
[1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(a)]
[1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(b)]
[4: COMMIT(a)]
[7: COMMIT(b)]

The first two batches do not consume any sequence numbers so are both prefixed with seq=1.
For 2pc commit, memtable insertion takes place before COMMIT batch is written to WAL.
We can see that sequence number consumption takes place between WAL entries giving us the seemingly sparse sequence prefix for WAL entries.
This is a valid WAL.

Because with 2PC markers one WriteBatch points to another batch containing its inserts a writebatch can consume more or less sequence numbers than the number of sequence consuming entries that it contains.

We can see that, given the entries in the WAL, 6 sequence ids were consumed. Yet on recovery the maximum sequence consumed would be 7 + 3 (the number of sequence numbers consumed by COMMIT(b))

So, now upon recovery we must track the actual consumption of sequence numbers.
In the provided scenario there will be no sequence gaps, but it is possible to produce a sequence gap. This should not be a problem though. correct?

Test Plan: provided test.

Reviewers: sdong

Subscribers: andrewkr, leveldb, dhruba, hermanlee4

Differential Revision: https://reviews.facebook.net/D57645
2016-05-10 14:06:07 -07:00
Reid Horuff
8a66c85e90 [rocksdb] Two Phase Transaction
Summary:
Two Phase Commit addition to RocksDB.

See wiki: https://github.com/facebook/rocksdb/wiki/Two-Phase-Commit-Implementation
Quip: https://fb.quip.com/pxZrAyrx53r3

Depends on:
WriteBatch modification: https://reviews.facebook.net/D54093
Memtable Log Referencing and Prepared Batch Recovery: https://reviews.facebook.net/D56919

Test Plan:
- SimpleTwoPhaseTransactionTest
- PersistentTwoPhaseTransactionTest.
- TwoPhaseRollbackTest
- TwoPhaseMultiThreadTest
- TwoPhaseLogRollingTest
- TwoPhaseEmptyWriteTest
- TwoPhaseExpirationTest

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: leveldb, hermanlee4, andrewkr, vasilep, dhruba, santoshb

Differential Revision: https://reviews.facebook.net/D56925
2016-05-10 14:06:07 -07:00
Reid Horuff
1b8a2e8fdd [rocksdb] Memtable Log Referencing and Prepared Batch Recovery
Summary:
This diff is built on top of WriteBatch modification: https://reviews.facebook.net/D54093 and adds the required functionality to rocksdb core necessary for rocksdb to support 2PC.

modfication of DBImpl::WriteImpl()
- added two arguments *uint64_t log_used = nullptr, uint64_t log_ref = 0;
- *log_used is an output argument which will return the log number which the incoming batch was inserted into, 0 if no WAL insert took place.
-  log_ref is a supplied log_number which all memtables inserted into will reference after the batch insert takes place. This number will reside in 'FindMinPrepLogReferencedByMemTable()' until all Memtables insertinto have flushed.

- Recovery/writepath is now aware of prepared batches and commit and rollback markers.

Test Plan: There is currently no test on this diff. All testing of this functionality takes place in the Transaction layer/diff but I will add some testing.

Reviewers: IslamAbdelRahman, sdong

Subscribers: leveldb, santoshb, andrewkr, vasilep, dhruba, hermanlee4

Differential Revision: https://reviews.facebook.net/D56919
2016-05-10 14:06:07 -07:00
Reid Horuff
0460e9dcce Modification of WriteBatch to support two phase commit
Summary: Adds three new WriteBatch data types: Prepare(xid), Commit(xid), Rollback(xid). Prepare(xid) should precede the (single) operation to which is applies. There can obviously be multiple Prepare(xid) markers. There should only be one Rollback(xid) or Commit(xid) marker yet not both. None of this logic is currently enforced and will most likely be implemented further up such as in the memtableinserter. All three markers are similar to PutLogData in that they are writebatch meta-data, ie stored but not counted. All three markers differ from PutLogData in that they will actually be written to disk. As for WriteBatchWithIndex, Prepare, Commit, Rollback are all implemented just as PutLogData and none are tested just as PutLogData.

Test Plan: single unit test in write_batch_test.

Reviewers: hermanlee4, sdong, anthony

Subscribers: leveldb, dhruba, vasilep, andrewkr

Differential Revision: https://reviews.facebook.net/D57867
2016-05-10 14:06:07 -07:00
Islam AbdelRahman
d86f9b9c3f Fix lite build
Summary: Fix lite build

Test Plan: run under lite

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57945
2016-05-09 16:08:30 -07:00
Islam AbdelRahman
4b31723433 Add bottommost_compression option
Summary:
Add a new option that can be used to set a specific compression algorithm for bottommost level.
This option will only affect levels larger than base level.

I have also updated CompactionJobInfo to include the compression algorithm used in compaction

Test Plan:
added new unittest
existing unittests

Reviewers: andrewkr, yhchiang, sdong

Reviewed By: sdong

Subscribers: lightmark, andrewkr, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D57669
2016-05-09 15:57:19 -07:00
sdong
bfb6b1b8a8 Estimate pending compaction bytes more accurately
Summary: Currently we estimate bytes needed for compaction by assuming fanout value to be level multiplier. It overestimates when size of a level exceeds the target by large. We estimate by the ratio of actual sizes in levels instead.

Test Plan: Fix existing test cases and add a new one.

Reviewers: IslamAbdelRahman, igor, yhchiang

Reviewed By: yhchiang

Subscribers: MarkCallaghan, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57789
2016-05-09 15:30:02 -07:00
Yi Wu
730f7e2e21 Fix win build
Summary: Fixing error with win build where we compare int64_t with size_t.

Test Plan: make check

Reviewers: andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57885
2016-05-09 11:52:28 -07:00
Andrew Kryczka
269f6b2e2d Revert "Modification of WriteBatch to support two phase commit"
Summary: Revert D54093 and D57453

Test Plan: running make check

Reviewers: horuff, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57819
2016-05-06 16:58:24 -07:00
sdong
7ccb8d6ef3 BlockBasedTable::Get() not to use prefix bloom if read_options.total_order_seek = true
Summary: This is to provide a way for users to skip prefix bloom in point look-up.

Test Plan: Add a new unit test scenario.

Reviewers: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57747
2016-05-06 10:16:11 -07:00
Islam AbdelRahman
967476eaee Fix valgrind (DBIteratorTest.ReadAhead)
Summary: This test is failing under valgrind because we dont delete the Env that we allocated

Test Plan: run the test under valgrind

Reviewers: andrewkr, yhchiang, yiwu, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57693
2016-05-05 11:24:08 -07:00
Yi Wu
a4ea345b04 Fixing lite build
Summary: Fixing lite build broke in unit test. `FilesPerLevel()` depends on `DB::GetProperty()`, which lite build doesn't support.

Test Plan: OPT=-DROCKSDB_LITE make check -j64

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57651
2016-05-04 17:20:52 -07:00
Yi Wu
24a24f013d Enable configurable readahead for iterators
Summary:
Add an option `iterator_readahead_size` to `ReadOptions` to enable
configurable readahead for iterators similar to the corresponding
option for compaction.

Test Plan:
```
make commit_prereq
```

Reviewers: kumar.rangarajan, ott, igor, sdong

Reviewed By: sdong

Subscribers: yiwu, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D55419
2016-05-04 15:25:58 -07:00
Islam AbdelRahman
ff4b3fb5b4 Fix Iterator::Prev memory pinning bug
Summary: We should not use IterKey::SetKey with copy = false except if we are pinning the iterator thru it's life time, otherwise we may release the temporarily pinned blocks and in this case the IterKey will be pointing to freed memory

Test Plan: added a new test

Reviewers: sdong, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57561
2016-05-03 16:50:01 -07:00
Islam AbdelRahman
6e801b0bd1 Eliminate memcpy in Iterator::Prev() by pinning blocks for keys spanning multiple blocks
Summary:
This diff is stacked on top of this diff https://reviews.facebook.net/D56493
The current Iterator::Prev() implementation need to copy every value since the underlying Iterator may move after reading the value.
This can be optimized by making sure that the block containing the value is pinned until the Iterator move. which will improve the throughput by up to 1.5X

master
```
==> 1000000_Keys_100Byte.txt <==
readreverse  :       0.449 micros/op 2225887 ops/sec;  246.2 MB/s
readreverse  :       0.433 micros/op 2311508 ops/sec;  255.7 MB/s
readreverse  :       0.436 micros/op 2294335 ops/sec;  253.8 MB/s
readreverse  :       0.471 micros/op 2121295 ops/sec;  234.7 MB/s
readreverse  :       0.465 micros/op 2152227 ops/sec;  238.1 MB/s
readreverse  :       0.454 micros/op 2203011 ops/sec;  243.7 MB/s
readreverse  :       0.451 micros/op 2216095 ops/sec;  245.2 MB/s
readreverse  :       0.462 micros/op 2162447 ops/sec;  239.2 MB/s
readreverse  :       0.476 micros/op 2099151 ops/sec;  232.2 MB/s
readreverse  :       0.472 micros/op 2120710 ops/sec;  234.6 MB/s

avg : 242.34 MB/s

==> 1000000_Keys_1KB.txt <==
readreverse  :       1.013 micros/op 986793 ops/sec;  978.7 MB/s
readreverse  :       0.942 micros/op 1061136 ops/sec; 1052.5 MB/s
readreverse  :       0.951 micros/op 1051901 ops/sec; 1043.3 MB/s
readreverse  :       0.932 micros/op 1072894 ops/sec; 1064.1 MB/s
readreverse  :       1.024 micros/op 976720 ops/sec;  968.7 MB/s
readreverse  :       0.935 micros/op 1069169 ops/sec; 1060.4 MB/s
readreverse  :       1.012 micros/op 988132 ops/sec;  980.1 MB/s
readreverse  :       0.962 micros/op 1039579 ops/sec; 1031.1 MB/s
readreverse  :       0.991 micros/op 1008924 ops/sec; 1000.7 MB/s
readreverse  :       1.004 micros/op 996144 ops/sec;  988.0 MB/s

avg : 1016.76 MB/s

==> 1000000_Keys_10KB.txt <==
readreverse  :       4.167 micros/op 239952 ops/sec; 2346.9 MB/s
readreverse  :       4.070 micros/op 245713 ops/sec; 2403.3 MB/s
readreverse  :       4.572 micros/op 218733 ops/sec; 2139.4 MB/s
readreverse  :       4.497 micros/op 222388 ops/sec; 2175.2 MB/s
readreverse  :       4.203 micros/op 237920 ops/sec; 2327.1 MB/s
readreverse  :       4.206 micros/op 237756 ops/sec; 2325.5 MB/s
readreverse  :       4.181 micros/op 239149 ops/sec; 2339.1 MB/s
readreverse  :       4.157 micros/op 240552 ops/sec; 2352.8 MB/s
readreverse  :       4.187 micros/op 238848 ops/sec; 2336.1 MB/s
readreverse  :       4.106 micros/op 243575 ops/sec; 2382.4 MB/s

avg : 2312.78 MB/s

==> 100000_Keys_100KB.txt <==
readreverse  :      41.281 micros/op 24224 ops/sec; 2366.0 MB/s
readreverse  :      39.722 micros/op 25175 ops/sec; 2458.9 MB/s
readreverse  :      40.319 micros/op 24802 ops/sec; 2422.5 MB/s
readreverse  :      39.762 micros/op 25149 ops/sec; 2456.4 MB/s
readreverse  :      40.916 micros/op 24440 ops/sec; 2387.1 MB/s
readreverse  :      41.188 micros/op 24278 ops/sec; 2371.4 MB/s
readreverse  :      40.061 micros/op 24962 ops/sec; 2438.1 MB/s
readreverse  :      40.221 micros/op 24862 ops/sec; 2428.4 MB/s
readreverse  :      40.084 micros/op 24947 ops/sec; 2436.7 MB/s
readreverse  :      40.655 micros/op 24597 ops/sec; 2402.4 MB/s

avg : 2416.79 MB/s

==> 10000_Keys_1MB.txt <==
readreverse  :     298.038 micros/op 3355 ops/sec; 3355.3 MB/s
readreverse  :     335.001 micros/op 2985 ops/sec; 2985.1 MB/s
readreverse  :     286.956 micros/op 3484 ops/sec; 3484.9 MB/s
readreverse  :     329.954 micros/op 3030 ops/sec; 3030.8 MB/s
readreverse  :     306.428 micros/op 3263 ops/sec; 3263.5 MB/s
readreverse  :     330.749 micros/op 3023 ops/sec; 3023.5 MB/s
readreverse  :     328.903 micros/op 3040 ops/sec; 3040.5 MB/s
readreverse  :     324.853 micros/op 3078 ops/sec; 3078.4 MB/s
readreverse  :     320.488 micros/op 3120 ops/sec; 3120.3 MB/s
readreverse  :     320.536 micros/op 3119 ops/sec; 3119.8 MB/s

avg : 3150.21 MB/s
```

After memcpy elimination
```

==> 1000000_Keys_100Byte.txt <==
readreverse  :       0.395 micros/op 2529890 ops/sec;  279.9 MB/s
readreverse  :       0.368 micros/op 2715922 ops/sec;  300.5 MB/s
readreverse  :       0.384 micros/op 2603929 ops/sec;  288.1 MB/s
readreverse  :       0.375 micros/op 2663286 ops/sec;  294.6 MB/s
readreverse  :       0.357 micros/op 2802180 ops/sec;  310.0 MB/s
readreverse  :       0.363 micros/op 2757684 ops/sec;  305.1 MB/s
readreverse  :       0.372 micros/op 2689603 ops/sec;  297.5 MB/s
readreverse  :       0.379 micros/op 2638599 ops/sec;  291.9 MB/s
readreverse  :       0.375 micros/op 2663803 ops/sec;  294.7 MB/s
readreverse  :       0.375 micros/op 2665579 ops/sec;  294.9 MB/s

avg: 295.72 MB/s (1.22 X)

==> 1000000_Keys_1KB.txt <==
readreverse  :       0.879 micros/op 1138112 ops/sec; 1128.8 MB/s
readreverse  :       0.842 micros/op 1187998 ops/sec; 1178.3 MB/s
readreverse  :       0.837 micros/op 1194915 ops/sec; 1185.1 MB/s
readreverse  :       0.845 micros/op 1182983 ops/sec; 1173.3 MB/s
readreverse  :       0.877 micros/op 1140308 ops/sec; 1131.0 MB/s
readreverse  :       0.849 micros/op 1177581 ops/sec; 1168.0 MB/s
readreverse  :       0.915 micros/op 1093284 ops/sec; 1084.3 MB/s
readreverse  :       0.863 micros/op 1159418 ops/sec; 1149.9 MB/s
readreverse  :       0.895 micros/op 1117670 ops/sec; 1108.5 MB/s
readreverse  :       0.852 micros/op 1174116 ops/sec; 1164.5 MB/s

avg: 1147.17 MB/s (1.12 X)

==> 1000000_Keys_10KB.txt <==
readreverse  :       3.870 micros/op 258386 ops/sec; 2527.2 MB/s
readreverse  :       3.568 micros/op 280296 ops/sec; 2741.5 MB/s
readreverse  :       4.005 micros/op 249694 ops/sec; 2442.2 MB/s
readreverse  :       3.550 micros/op 281719 ops/sec; 2755.5 MB/s
readreverse  :       3.562 micros/op 280758 ops/sec; 2746.1 MB/s
readreverse  :       3.507 micros/op 285125 ops/sec; 2788.8 MB/s
readreverse  :       3.463 micros/op 288739 ops/sec; 2824.1 MB/s
readreverse  :       3.428 micros/op 291734 ops/sec; 2853.4 MB/s
readreverse  :       3.553 micros/op 281491 ops/sec; 2753.2 MB/s
readreverse  :       3.535 micros/op 282885 ops/sec; 2766.9 MB/s

avg : 2719.89 MB/s (1.17 X)

==> 100000_Keys_100KB.txt <==
readreverse  :      22.815 micros/op 43830 ops/sec; 4281.0 MB/s
readreverse  :      29.957 micros/op 33381 ops/sec; 3260.4 MB/s
readreverse  :      25.334 micros/op 39473 ops/sec; 3855.4 MB/s
readreverse  :      23.037 micros/op 43409 ops/sec; 4239.8 MB/s
readreverse  :      27.810 micros/op 35958 ops/sec; 3512.1 MB/s
readreverse  :      30.327 micros/op 32973 ops/sec; 3220.6 MB/s
readreverse  :      29.704 micros/op 33665 ops/sec; 3288.2 MB/s
readreverse  :      29.423 micros/op 33987 ops/sec; 3319.6 MB/s
readreverse  :      23.334 micros/op 42856 ops/sec; 4185.9 MB/s
readreverse  :      29.969 micros/op 33368 ops/sec; 3259.1 MB/s

avg : 3642.21 MB/s (1.5 X)

==> 10000_Keys_1MB.txt <==
readreverse  :     244.748 micros/op 4085 ops/sec; 4085.9 MB/s
readreverse  :     230.208 micros/op 4343 ops/sec; 4344.0 MB/s
readreverse  :     235.655 micros/op 4243 ops/sec; 4243.6 MB/s
readreverse  :     235.730 micros/op 4242 ops/sec; 4242.2 MB/s
readreverse  :     237.346 micros/op 4213 ops/sec; 4213.3 MB/s
readreverse  :     227.306 micros/op 4399 ops/sec; 4399.4 MB/s
readreverse  :     194.957 micros/op 5129 ops/sec; 5129.4 MB/s
readreverse  :     238.359 micros/op 4195 ops/sec; 4195.4 MB/s
readreverse  :     221.588 micros/op 4512 ops/sec; 4513.0 MB/s
readreverse  :     235.911 micros/op 4238 ops/sec; 4239.0 MB/s

avg : 4360.52 MB/s (1.38 X)
```

Test Plan: COMPILE_WITH_ASAN=1 make check -j64

Reviewers: andrewkr, yhchiang, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56511
2016-05-02 21:46:30 -07:00
Warren Falk
b8cf9130f8 Fix #1110, 32-bit build failure on Mac OSX (#1112)
Using explicit 64-bit type in conditional in platforms above 32-bits
This appears to be necessary on Mac OSX as std::conditional does not appear to short circuit and evaluates the third template arg
Making the third template arg be 64 bits explicitly works around this problem and will work on both 32 bit and 64+ bit platforms.
2016-05-02 10:04:37 -07:00
Islam AbdelRahman
21441c09bd Fix calling GetCurrentMutableCFOptions in CompactionJob::ProcessKeyValueCompaction()
Summary: GetCurrentMutableCFOptions() can only be called when DB mutex is held so we cannot call it in CompactionJob::ProcessKeyValueCompaction() since it's not holding the db mutex

Test Plan: make check -j64

Reviewers: sdong, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57471
2016-04-29 17:00:50 -07:00
Reid Horuff
6e56a114be Modification of WriteBatch to support two phase commit
Summary: Adds three new WriteBatch data types: Prepare(xid), Commit(xid), Rollback(xid). Prepare(xid) should precede the (single) operation to which is applies. There can obviously be multiple Prepare(xid) markers. There should only be one Rollback(xid) or Commit(xid) marker yet not both. None of this logic is currently enforced and will most likely be implemented further up such as in the memtableinserter. All three markers are similar to PutLogData in that they are writebatch meta-data, ie stored but not counted. All three markers differ from PutLogData in that they will actually be written to disk. As for WriteBatchWithIndex, Prepare, Commit, Rollback are all implemented just as PutLogData and none are tested just as PutLogData.

Test Plan: single unit test in write_batch_test.

Reviewers: hermanlee4, sdong, anthony

Subscribers: andrewkr, vasilep, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D54093
2016-04-29 11:50:30 -07:00
Yi Wu
a92049e3e7 Added EventListener::OnTableFileCreationStarted() callback
Summary: Added EventListener::OnTableFileCreationStarted. EventListener::OnTableFileCreated will be called on failure case. User can check creation status via TableFileCreationInfo::status.

Test Plan: unit test.

Reviewers: dhruba, yhchiang, ott, sdong

Reviewed By: sdong

Subscribers: sdong, kradhakrishnan, IslamAbdelRahman, andrewkr, yhchiang, leveldb, ott, dhruba

Differential Revision: https://reviews.facebook.net/D56337
2016-04-29 11:35:00 -07:00
sdong
6a14f7a976 Change several option defaults
Summary:
Changing several option defaults:
 options.max_open_files changes from 5000 to -1
 options.base_background_compactions changes from max_background_compactions to 1
 options.wal_recovery_mode changes from kTolerateCorruptedTailRecords to kTolerateCorruptedTailRecords
 options.compaction_pri changes from kByCompensatedSize to kByCompensatedSize

Test Plan: Write unit tests to see OldDefaults() works as expected.

Reviewers: IslamAbdelRahman, yhchiang, igor

Reviewed By: igor

Subscribers: MarkCallaghan, yiwu, kradhakrishnan, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56427
2016-04-28 17:50:58 -07:00
Peter Mattis
c6c770a1ac Use prefix_same_as_start to avoid iteration in FindNextUserEntryInternal. (#1102)
This avoids excessive iteration in tombstone fields.
2016-04-28 16:48:03 -07:00
sdong
992a8f83b7 Not enable jemalloc status printing if USE_CLANG=1
Summary: Warning is printed out with USE_CLANG=1 when including jemalloc.h. Disable it in that case.

Test Plan: Run db_bench with USE_CLANG=1 and not. Make sure they can all build and jemalloc status is printed out in the case where USE_CLANG is not set.

Reviewers: andrewkr, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57399
2016-04-28 16:16:14 -07:00
Andrew Kryczka
a06faa6327 Skip PresetCompressionDict test for lite
Summary:
This test relies on "rocksdb.num-files-at-levelN" property that isn't
implemented in rocksdb lite. So we will compile it only for non-lite builds.

Test Plan:
  $ make -j40 check 'OPT=-g -DROCKSDB_LITE'

Reviewers: sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57387
2016-04-28 15:11:28 -07:00
Andrew Kryczka
0f428c5619 Fix compression dictionary clang osx error
Summary:
There was one narrowing conversion in D52287 that only showed up with
clang on osx.

Test Plan:
  $ make clean && USE_CLANG=1 DISABLE_JEMALLOC=1 TEST_TMPDIR=/dev/shm/rocksdb OPT=-g make -j32 check

Reviewers: sdong, lightmark, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57357
2016-04-28 10:42:10 -07:00
Li Peng
6d4832a998 Merge pull request #1101 from flyd1005/wip-fix-typo
fix typos and remove duplicated words
2016-04-28 02:30:44 -07:00
Andrew Kryczka
54de13abac Fix compression dictionary clang errors
Summary: There were a few narrowing conversions that clang didn't like.

Test Plan:
  $ make clean && USE_CLANG=1 DISABLE_JEMALLOC=1 TEST_TMPDIR=/dev/shm/rocksdb OPT=-g make -j32 check

Reviewers: IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57351
2016-04-27 18:30:04 -07:00
Islam AbdelRahman
0850bc5147 Fix build on machines without jemalloc
Summary: It looks like we mistakenly enable JEMALLOC even if it's not available on the machine, that's why travis is failing

Test Plan:
check on my devserver
check on my mac

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57345
2016-04-27 18:25:19 -07:00
Andrew Kryczka
843d2e3137 Shared dictionary compression using reference block
Summary:
This adds a new metablock containing a shared dictionary that is used
to compress all data blocks in the SST file. The size of the shared dictionary
is configurable in CompressionOptions and defaults to 0. It's currently only
used for zlib/lz4/lz4hc, but the block will be stored in the SST regardless of
the compression type if the user chooses a nonzero dictionary size.

During compaction, computes the dictionary by randomly sampling the first
output file in each subcompaction. It pre-computes the intervals to sample
by assuming the output file will have the maximum allowable length. In case
the file is smaller, some of the pre-computed sampling intervals can be beyond
end-of-file, in which case we skip over those samples and the dictionary will
be a bit smaller. After the dictionary is generated using the first file in a
subcompaction, it is loaded into the compression library before writing each
block in each subsequent file of that subcompaction.

On the read path, gets the dictionary from the metablock, if it exists. Then,
loads that dictionary into the compression library before reading each block.

Test Plan: new unit test

Reviewers: yhchiang, IslamAbdelRahman, cyan, sdong

Reviewed By: sdong

Subscribers: andrewkr, yoshinorim, kradhakrishnan, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D52287
2016-04-27 17:36:03 -07:00
Sergey Makarenko
1c80dfab24 Print memory allocation counters
Summary:
Introduced option to dump malloc statistics using new option flag.
    Added new command line option to db_bench tool to enable this
    funtionality.
    Also extended build to support environments with/without jemalloc.

Test Plan:
1) Build rocksdb using `make` command. Launch the following command
    `./db_bench --benchmarks=fillrandom --dump_malloc_stats=true
    --num=10000000` end verified that jemalloc dump is present in LOG file.
    2) Build rocksdb using `DISABLE_JEMALLOC=1  make db_bench -j32` and ran
    the same db_bench tool and found the following message in LOG file:
    "Please compile with jemalloc to enable malloc dump".
    3) Also built rocksdb using `make` command on MacOS to verify behavior
    in non-FB environment.
    Also to debug build configuration change temporary changed
    AM_DEFAULT_VERBOSITY = 1 in Makefile to see compiler and build
    tools output. For case 1) -DROCKSDB_JEMALLOC was present in compiler
    command line. For both 2) and 3) this flag was not present.

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57321
2016-04-27 16:23:33 -07:00
sdong
ac0e54b4c6 CompactedDB should not be used if there is outstanding WAL files
Summary: CompactedDB skips memtable. So we shouldn't use compacted DB if there is outstanding WAL files.

Test Plan: Change to options.max_open_files = -1 perf context test to create a compacted DB, which we shouldn't do.

Reviewers: yhchiang, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57057
2016-04-26 14:22:07 -07:00
Islam AbdelRahman
d719b095dc Introduce PinnedIteratorsManager (Reduce PinData() overhead / Refactor PinData)
Summary:
While trying to reuse PinData() / ReleasePinnedData() .. to optimize away some memcpys I realized that there is a significant overhead for using PinData() / ReleasePinnedData if they were called many times.
This diff refactor the pinning logic by introducing PinnedIteratorsManager a centralized component that will be created once and will be notified whenever we need to Pin an Iterator. This implementation have much less overhead than the original implementation

Test Plan:
make check -j64
COMPILE_WITH_ASAN=1 make check -j64

Reviewers: yhchiang, sdong, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56493
2016-04-26 12:41:07 -07:00
Islam AbdelRahman
7c14abf2c7 Improve BytewiseComparatorImpl::FindShortestSeparator
Summary:
The current implementation find the first different byte and try to increment it, if it cannot it return the original key
we can improve this by keep going after the first different byte to find the first non 0xFF byte and increment it

After trying this patch on some logdevice sst files I see decrease in there index block size by 8.5%

Test Plan: existing tests and updated test

Reviewers: yhchiang, andrewkr, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56241
2016-04-25 23:02:14 -07:00
Islam AbdelRahman
f3eb0b5b8c Make EventListenerTest.CompactionReasonLevel more deterministic
Summary:
In this test some times automatic compactions do everything and Manual compaction become a no-op.
Update the test to make sure manual compaction is not a no-op

Test Plan: run the test

Reviewers: andrewkr, yhchiang, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57189
2016-04-25 18:18:35 -07:00
sdong
7b78d623f7 Shouldn't report default column family's compaction stats as DB compaction stats
Summary:
Now we collect compaction stats per column family, but report default colum family's stat as compaction stats for DB.
Fix it by reporting compaction stats per column family instead.

Test Plan: Run db_bench with --num_column_families=4 and see the number fixed.

Reviewers: IslamAbdelRahman, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57063
2016-04-25 13:56:59 -07:00
Yueh-Hsuan Chiang
24110ce90c Correct Statistics FLUSH_WRITE_BYTES
Summary:
In https://reviews.facebook.net/D56271, we fixed an issue where
we consider flush as compaction.  However, that makes us mistakenly
count FLUSH_WRITE_BYTES twice (one in flush_job and one in db_impl.)

This patch removes the one incremented in db_impl.

Test Plan: db_test

Reviewers: yiwu, andrewkr, IslamAbdelRahman, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57111
2016-04-25 12:01:01 -07:00
dx9
b71c4e613f Alpine Linux Build (#990)
* Musl libc does not provide adaptive mutex. Added feature test for PTHREAD_MUTEX_ADAPTIVE_NP.

* Musl libc does not provide backtrace(3). Added a feature check for backtrace(3).

* Fixed compiler error.

* Musl libc does not implement backtrace(3). Added platform check for libexecinfo.

* Alpine does not appear to support gcc -pg option. By default (gcc has PIE option enabled) it fails with:

gcc: error: -pie and -pg|p|profile are incompatible when linking

When -fno-PIE and -nopie are used it fails with:

/usr/lib/gcc/x86_64-alpine-linux-musl/5.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find gcrt1.o: No such file or directory

Added gcc -pg platform test and output PROFILING_FLAGS accordingly. Replaced pg var in Makefile with PROFILING_FLAGS.

* fix segfault when TEST_IOCTL_FRIENDLY_TMPDIR is undefined and default candidates are not suitable

* use ASSERT_DOUBLE_EQ instead of ASSERT_EQ

* When compiled with ROCKSDB_MALLOC_USABLE_SIZE UniversalCompactionFourPaths and UniversalCompactionSecondPathRatio tests fail due to premature memtable flushes on systems with 16-byte alignment. Arena runs out of block space before GenerateNewFile() completes.

Increased options.write_buffer_size.
2016-04-22 16:49:12 -07:00
Naitik Shah
99a3bf8f62 Merge pull request #1068 from daaku/c-purge-old-backups
rocksdb_backup_engine_purge_old_backups for C libraries
2016-04-22 13:49:59 -07:00
Naitik Shah
c146c9be18 rocksdb_create_mem_env to allow C libraries to create mem env (#1066) 2016-04-22 13:25:05 -07:00
Naitik Shah
6da70c5815 expose more options in the c api (#1067) 2016-04-22 13:24:09 -07:00
Naitik Shah
6f01687aae C rocksdb_create_iterators to expose NewIterators (#1069) 2016-04-22 13:22:21 -07:00
Andrew Kryczka
73a847ef89 Add per-level compression ratio property
Summary:
This is needed so we can measure compression ratio improvements
achieved by D52287.

The property compares raw data size against the total file size for a given
level. If the level is empty it should return 0.0.

Test Plan: new unit test

Reviewers: IslamAbdelRahman, yhchiang, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D56967
2016-04-20 18:46:54 -07:00
Dmitri Smirnov
ee221d2de0 Introduce XPRESS compresssion on Windows. (#1081)
Comparable with Snappy on comp ratio.
  Implemented using Windows API, does not require external package.
  Avaiable since Windows 8 and server 2012.
  Use -DXPRESS=1 with CMake to enable.
2016-04-19 22:54:24 -07:00
Islam AbdelRahman
b95510ddf4 Fix DBTest.RateLimitedDelete flakiness
Summary: We need to enable sync_point processing before creating the SstFileManager to ensure that we are holding the bg delete scheduler thread from running

Test Plan:
run the test
debug using printf

Reviewers: sdong, yhchiang, yiwu, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56871
2016-04-19 14:05:48 -07:00
Yi Wu
725184b04e Fix db_block_cache_test in lite build
Summary: D56715 move some of the tests from db_test to db_block_cache_test. Some of them should be disabled in lite build.

Test Plan:
    make check -j32
    OPT='-DROCKSDB_LITE' make check -j32

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D56907
2016-04-18 11:34:11 -07:00
Yi Wu
290883d94a Fix lite build
Summary: Fix rocksdb lite build after D56715.

Test Plan:
  make -j40 'OPT=-g -DROCKSDB_LITE'

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D56895
2016-04-18 10:47:10 -07:00
sdong
23089fd281 write_callback_test: clean test directory before running tests
Summary: write_callback_test fails if previous run didn't finish cleanly. Clean the DB before runing the test.

Test Plan: Run the test that see it doesn't fail any more.

Reviewers: andrewkr, yhchiang, yiwu, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: kradhakrishnan, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56859
2016-04-18 10:18:41 -07:00
Yi Wu
792762c42c Split db_test.cc
Summary: Split db_test.cc into several files. Moving several helper functions into DBTestBase.

Test Plan: make check

Reviewers: sdong, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: dhruba, andrewkr, kradhakrishnan, yhchiang, leveldb, sdong

Differential Revision: https://reviews.facebook.net/D56715
2016-04-18 09:42:50 -07:00
Victor Tyutyunov
e5c614e1df Fixing snapshot 0 assertion
Summary:
Solution is not to change db sequence number to start from 1 because 0 value is used in multiple other places.
Fix covers only compact_iterator::findEarliestVisibleSnapshot with updated logic to support snapshot's numbering starting from 0.

Test Plan:
run:
  make all check

it should pass all tests

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: lgalanis, mgalushka, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56601
2016-04-16 01:47:15 -07:00
sdong
6d436a3f85 DBTest.HardLimit made more deterministic
Summary: In DBTest.HardLimit, multiple flushes may merge into one, based on thread scheduling. Avoid it by waiting each flush to finish before generating the next one.

Test Plan: Run test in parallel several times and see it doesn't fail any more.

Reviewers: yhchiang, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: yiwu, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56853
2016-04-15 17:36:57 -07:00
sdong
9d35ae649e Make DBTestUniversalCompaction.IncreaseUniversalCompactionNumLevels more deterministic
Summary: DBTestUniversalCompaction, IncreaseUniversalCompactionNumLevels fails one in about 30 runs when running in parallel. We wait for compaction after each flush to make the compaction behavior deterministic.

Test Plan: Run the test 1000 times in parallel and it still passes.

Reviewers: yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: kradhakrishnan, yiwu, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56841
2016-04-15 16:16:53 -07:00
sdong
4b6833aec1 Rename options.compaction_measure_io_stats to options.report_bg_io_stats and include flush too.
Summary: It is useful to print out IO stats in flush jobs too. Extend options.compaction_measure_io_stats to flush jobs and raname it.

Test Plan: Try db_bench and see the stats are printed out.

Reviewers: yhchiang

Reviewed By: yhchiang

Subscribers: kradhakrishnan, yiwu, IslamAbdelRahman, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56769
2016-04-15 10:22:18 -07:00
Dmitri Smirnov
c9d668c584 Fix unit tests issues on Windows (#1078) 2016-04-14 17:33:53 -07:00
sdong
535af525d6 BlockBasedTable::PrefixMayMatch() to skip index checking if we can't find a filter block.
Summary:
In the case where we can't find a filter block, there is not much benefit of doing the binary search and see whether the index key has the prefix. With the change, we blindly return true if we can't get the filter.
It also fixes missing row cases for reverse comparator with full bloom.

Test Plan: Add a test case that used to fail.

Reviewers: yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: kradhakrishnan, yiwu, hermanlee4, yoshinorim, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56697
2016-04-13 19:06:48 -07:00
Islam AbdelRahman
19ef3de57e Fix ManualCompactionPartial test flakiness
Summary: The reason for this test flakiness is that we try to verify that number of files in L0 is 3 after flushing the 3rd file although we may have a compaction running in the background that may finish before we do the check and the 3 L0 files are converted to 1 L1 file

Test Plan: Run a modified version of the test that sleep before doing the check

Reviewers: sdong, andrewkr, kradhakrishnan, yhchiang

Reviewed By: yhchiang

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56643
2016-04-13 10:38:45 -07:00
sdong
dff4c48ede BlockBasedTable::PrefixMayMatch: no need to find data block after full bloom checking
Summary:
Full block checking should be a good enough indication of prefix existance. No need to further check data block.
This also fixes wrong results when using prefix bloom and reverse bitwise comparator.

Test Plan: Will add a unit test.

Reviewers: yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: hermanlee4, yoshinorim, yiwu, kradhakrishnan, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56625
2016-04-12 16:25:54 -07:00
Yueh-Hsuan Chiang
13e6c8e97a Relax an assertion in Compaction::ShouldStopBefore
Summary:
In some case, it is possible to have two concesutive SST files might sharing
same boundary keys.  However, in the assertion in Compaction::ShouldStopBefore,
it exclude such possibility.

This patch fix this issue by relaxing the assertion to allow the equal case.

Test Plan: rocksdb tests

Reviewers: IslamAbdelRahman, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D55875
2016-04-11 20:15:52 -07:00
Yueh-Hsuan Chiang
ae21d71e94 Fixed a bug in RocksDB Statistics where flush is considered as compaction
Summary: Fixed a bug in RocksDB Statistics where flush is considered as compaction

Test Plan: unit test

Reviewers: sdong, IslamAbdelRahman, rven, kradhakrishnan, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D56271
2016-04-11 19:59:25 -07:00
Jay Edgar
2448f80375 Make sure that if use_mmap_reads is on use_os_buffer is also on
Summary: The code assumes that if use_mmap_reads is on then use_os_buffer is also on.  This make sense as by using memory mapped files for reading you are expecting the OS to cache what it needs.  Add code to make sure the user does not turn off use_os_buffer when they turn on use_mmap_reads

Test Plan: New test: DBTest.MMapAndBufferOptions

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56397
2016-04-08 14:30:15 -07:00
Andrew Kryczka
2391ef7214 Embed column family name in SST file
Summary:
Added the column family name to the properties block. This property
is omitted only if the property is unavailable, such as when RepairDB()
writes SST files.

In a next diff, I will change RepairDB to use this new property for
deciding to which column family an existing SST file belongs. If this
property is missing, it will add it to the "unknown" column family (same
as its existing behavior).

Test Plan:
New unit test:

  $ ./db_table_properties_test --gtest_filter=DBTablePropertiesTest.GetColumnFamilyNameProperty

Reviewers: IslamAbdelRahman, yhchiang, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D55605
2016-04-06 23:10:32 -07:00
Igor Canadi
ab4c62332e Don't use version in the error message
Summary: We use object `v` in the error message, which is not initialized if the edit is column family manipulation. This doesn't provide much useful info, so this diff is removing it. Instead, it dumps actual VersionEdit contents.

Test Plan: compiles. would be great to get tests in version_set_test.cc that cover cases where a file write fails

Reviewers: sdong, yhchiang, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D56349
2016-04-06 15:00:15 -07:00
Aaron Gao
cc87075d63 No need to limit to 20 files in UpdateAccumulatedStats() if options.max_open_files=-1
Summary:
There is a hardcoded constraint in our statistics collection that prevents reading properties from more than 20 SST files. This means our statistics will be very inaccurate for databases with > 20 files since additional files are just ignored. The purpose of constraining the number of files used is to bound the I/O performed during statistics collection, since these statistics need to be recomputed every time the database reopened.

However, this constraint doesn't take into account the case where option "max_open_files" is -1. In that case, all the file metadata has already been read, so MaybeInitializeFileMetaData() won't incur any I/O cost. so this diff gets rid of the 20-file constraint in case max_open_files == -1.

Test Plan:
write into unit test db/db_properties_test.cc - "ValidateSampleNumber".
We generate 20 files with 2 rows and 10 files with 1 row.
If max_open_files !=-1, the `rocksdb.estimate-num-keys` should be (10*1 + 10*2)/20 * 30 = 45. Otherwise, it should be the ground truth, 50.
{F1089153}

Reviewers: andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D56253
2016-04-01 16:19:12 -07:00
Islam AbdelRahman
8a1a603fdb Eliminate std::deque initialization while iterating over merge operands
Summary:
This patch is similar to D52563, When we iterate over a DB with merge operands we keep creating std::queue to store the operands, optimize this by reusing merge_operands_ data member

Before the patch

```
./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq" --db="/dev/shm/bench_merge_memcpy_on_the_fly/" --merge_operator="put" --merge_keys=10000 --num=10000

DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
mergerandom  :       3.757 micros/op 266141 ops/sec;   29.4 MB/s ( updates:10000)
DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
readseq      :       0.413 micros/op 2423538 ops/sec;  268.1 MB/s
DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
readseq      :       0.451 micros/op 2219071 ops/sec;  245.5 MB/s
DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
readseq      :       0.420 micros/op 2382039 ops/sec;  263.5 MB/s
DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
readseq      :       0.408 micros/op 2452017 ops/sec;  271.3 MB/s

DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
mergerandom  :       3.947 micros/op 253376 ops/sec;   28.0 MB/s ( updates:10000)
DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
readseq      :       0.441 micros/op 2266473 ops/sec;  250.7 MB/s
DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
readseq      :       0.471 micros/op 2122033 ops/sec;  234.8 MB/s
DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
readseq      :       0.440 micros/op 2271407 ops/sec;  251.3 MB/s
DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
readseq      :       0.429 micros/op 2331471 ops/sec;  257.9 MB/s
```

with the patch

```
./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq" --db="/dev/shm/bench_merge_memcpy_on_the_fly/" --merge_operator="put" --merge_keys=10000 --num=10000

DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
mergerandom  :       4.080 micros/op 245092 ops/sec;   27.1 MB/s ( updates:10000)
DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
readseq      :       0.308 micros/op 3241843 ops/sec;  358.6 MB/s
DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
readseq      :       0.312 micros/op 3200408 ops/sec;  354.0 MB/s
DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
readseq      :       0.332 micros/op 3013962 ops/sec;  333.4 MB/s
DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
readseq      :       0.300 micros/op 3328017 ops/sec;  368.2 MB/s

DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
mergerandom  :       3.973 micros/op 251705 ops/sec;   27.8 MB/s ( updates:10000)
DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
readseq      :       0.320 micros/op 3123752 ops/sec;  345.6 MB/s
DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
readseq      :       0.335 micros/op 2986641 ops/sec;  330.4 MB/s
DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
readseq      :       0.339 micros/op 2950047 ops/sec;  326.4 MB/s
DB path: [/dev/shm/bench_merge_memcpy_on_the_fly/]
readseq      :       0.319 micros/op 3131565 ops/sec;  346.4 MB/s
```

Test Plan: make check -j64

Reviewers: yhchiang, andrewkr, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56031
2016-04-01 15:48:55 -07:00
Islam AbdelRahman
f38540b12a WriteBatchWithIndex micro optimization
Summary:
  - Put key offset and key size in WriteBatchIndexEntry
  - Use vector for comparators in WriteBatchEntryComparator

I use a slightly modified version of @yoshinorim code to benchmark
https://gist.github.com/IslamAbdelRahman/b120f4fba8d6ff7d58d2

For Put I create a transaction that put a 1000000 keys and measure the time spent without commit.
For GetForUpdate I read the keys that I added in the Put transaction.

Original time:

```
 rm -rf /dev/shm/rocksdb-example/
 ./txn_bench put 1000000
 1000000 OK Ops | took      3.679 seconds
 ./txn_bench get_for_update 1000000
 1000000 OK Ops | took      3.940 seconds
```

New Time

```
  rm -rf /dev/shm/rocksdb-example/
 ./txn_bench put 1000000
 1000000 OK Ops | took      2.727 seconds
 ./txn_bench get_for_update 1000000
 1000000 OK Ops | took      3.880 seconds
```

It looks like there is no significant improvement in GetForUpdate() but we can see ~30% improvement in Put()

Test Plan: unittests

Reviewers: yhchiang, anthony, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D55539
2016-04-01 15:23:46 -07:00
Marton Trencseni
9b51987521 Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes.
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.

Test Plan:
'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK.
I didn't run the Java tests, I don't have Java set up on my devserver.

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56133
2016-04-01 10:42:39 -07:00
sdong
2feafa3db9 Change some RocksDB default options
Summary: Change some RocksDB default options to make it more friendly to server workloads.

Test Plan: Run all existing tests

Reviewers: yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: sumeet, muthu, benj, MarkCallaghan, igor, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D55941
2016-03-31 17:12:18 -07:00
Sandeep Joshi
63e8f1b55b Formatted lines to adhere to 80 char limit 2016-03-31 08:26:55 +05:30
Sandeep Joshi
24420947d9 Replace kHeader by WriteBatchInternal::kHeader in few more places
kHeader was moved from write_batch.cc to header file because
it is being used wherever the number "12" was being used to
check for record size
2016-03-30 23:13:00 +05:30
Sandeep Joshi
3bdbe89614 Merge branch 'magic12' of https://github.com/dcengines/rocksdb into magic12 2016-03-30 23:09:39 +05:30
zensan
78711524b7 In all the places where log records are read, there was a check that
record.size() should not be less than 12.

This "magic number" seems to be the WriteBatch header (8 byte sequence
and 4 byte count).   Replaced all the places where "12" was used
by WriteBatchInternal::kHeader.
2016-03-30 23:05:22 +05:30
Islam AbdelRahman
99ffb3d533 Fix perf_context::merge_operator_time_nanos calculation
Summary: We were not measuring the time spent in merge_operator when called from Version::Get()

Test Plan: added a unittest

Reviewers: sdong, yhchiang

Reviewed By: yhchiang

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D55905
2016-03-25 18:29:43 -07:00
Yueh-Hsuan Chiang
ad2fdaa823 Correct a typo in a comment
Summary: Correct a typo in a comment

Test Plan: No code change.

Reviewers: sdong, kradhakrishnan, IslamAbdelRahman

Reviewed By: kradhakrishnan, IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D55803
2016-03-24 19:39:13 -07:00
Yueh-Hsuan Chiang
be9816b3d9 Fix data race issue when sub-compaction is used in CompactionJob
Summary:
When subcompaction is used, all subcompactions share the same Compaction
pointer in CompactionJob while each subcompaction all keeps their mutable
stats in SubcompactionState.  However, there're still some mutable part
that is currently store in the shared Compaction pointer.

This patch makes two changes:

1. Make the shared Compaction pointer const so that it can never be modified
   during the compaction.
2. Move necessary states from Compaction to SubcompactionState.
3. Make functions of Compaction const if the function does not modify
   its internal state.

Test Plan: rocksdb and MyRocks test

Reviewers: sdong, kradhakrishnan, andrewkr, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, yoshinorim, gunnarku, leveldb

Differential Revision: https://reviews.facebook.net/D55923
2016-03-24 19:36:39 -07:00
Praveen Rao
583157f710 Avoid overloaded virtual function 2016-03-22 17:10:31 -07:00
Praveen Rao
136b8e0cad Merge from master 2016-03-22 12:38:44 -07:00
Praveen Rao
2dcbb3b4f3 Addressed review comments 2016-03-22 12:07:15 -07:00
sdong
b1fafcaca6 Revert "Adding pin_l0_filter_and_index_blocks_in_cache feature."
This reverts commit 522de4f59e.

It has bug of index block cleaning up.
2016-03-21 11:50:42 -07:00
agiardullo
fbbb8a6144 Add test for Snapshot 0
Summary:
I ran into this assert when stress testing transactions.  It's pretty easy to repro.

Changing VersionSet::last_sequence_ to start at 1 seems pretty straightforward.  We would just need to change the 4 callers of SetLastSequence(), including recovery code.  I'd make this change myself, but I do not have enough time to test changes to recovery code-paths this week.  But checking in this test case (disabled) for future fixing.

Test Plan: n/a

Reviewers: yhchiang, kradhakrishnan, andrewkr, anthony, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D55311
2016-03-18 16:16:20 -07:00
Andrew Kryczka
e182f03c1e Add unit tests for RepairDB
Summary:
Basic test cases:

- Manifest is lost or corrupt
- Manifest refers to too many or too few SST files
- SST file is corrupt
- Unflushed data is present when RepairDB is called

Depends on D55065 for its CreateFile() function in file_utils

Test Plan: Ran the tests.

Reviewers: IslamAbdelRahman, yhchiang, yoshinorim, sdong

Reviewed By: sdong

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D55485
2016-03-18 15:18:42 -07:00
Praveen Rao
7d371863e5 travis build fixes 2016-03-18 14:43:22 -07:00
Praveen Rao
4f1c74a46e merge from master 2016-03-18 12:48:01 -07:00
Praveen Rao
f8c2189307 Publish log numbers for column family to wal_filter, and provide log number in the record callback 2016-03-18 12:32:15 -07:00
Marton Trencseni
44756260ae Reset block cache in failing unit test.
Test Plan: make -j40 check OPT=-g, on both /tmp and /dev/shm

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D55701
2016-03-18 06:13:54 +00:00
Marton Trencseni
522de4f59e Adding pin_l0_filter_and_index_blocks_in_cache feature.
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
When the table reader is destroyed, it releases the pinned blocks (if there were any). This has to happen before the cache is destroyed, so I had to introduce a TableReader::Close(), to guarantee the order of destruction.

Test Plan:
Added two unit tests for this. Existing unit tests run fine (default is pin_l0_filter_and_index_blocks_in_cache=false).

DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32
  Mac: OK.
  Linux: with D55287 patched in it's OK.

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54801
2016-03-17 22:40:01 +00:00
dhruba borthakur
33d568611d Merge pull request #1040 from bureau14/master
Fixes warnings and ensure correct int behavior on 32-bit platforms.
2016-03-17 03:23:30 -07:00
Edouard A
02e62ebbc8 Fixes warnings and ensure correct int behavior on 32-bit platforms. 2016-03-16 22:57:57 +01:00
Islam AbdelRahman
3ff98bd209 Fix no compression test
Summary:
DBBlockCacheTest.TestWithCompressedBlockCache is depending on compression using snappy, so this test fail when snappy is not available
block this test when we don't have snappy

https://ci-builds.fb.com/view/rocksdb/job/rocksdb_no_compression/833/console

Test Plan: run the test when compression libraries are not avaliable

Reviewers: sdong, yiwu

Reviewed By: yiwu

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D55413
2016-03-15 12:17:40 -07:00
Dhruba Borthakur
1a2cc27e01 ColumnFamilyOptions SanitizeOptions is buggy on 32-bit platforms.
Summary:
The pre-existing code is trying to clamp between 65,536 and 0,
resulting in clamping to 65,536, resulting in very small buffers,
resulting in ShouldFlushNow() being true quite easily,
resulting in assertion failing and database performance
being "not what it should be".

https://github.com/facebook/rocksdb/issues/1018

Test Plan: make check

Reviewers: sdong, andrewkr, IslamAbdelRahman, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D55455
2016-03-14 16:21:54 -07:00
sdong
b2ae5950ba Index Reader should not be reused after DB restart
Summary:
In block based table reader, wow we put index reader to block cache, which can be retrieved after DB restart. However, index reader may reference internal comparator, which can be destroyed after DB restarts, causing problems.
Fix it by making cache key identical per table reader.

Test Plan: Add a new test which failed with out the commit but now pass.

Reviewers: IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: maro, yhchiang, kradhakrishnan, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D55287
2016-03-14 10:04:09 -07:00
Islam AbdelRahman
580fede347 Aggregate hot Iterator counters in LocalStatistics (DBIter::Next perf regression)
Summary:
This patch bump the counters in the frequent code path DBIter::Next() / DBIter::Prev() in a local data members and send them to Statistics when the iterator is destroyed
A better solution will be to have thread_local implementation for Statistics

New performance
```
readseq      :       0.035 micros/op 28597881 ops/sec; 3163.7 MB/s
     1,851,568,819      stalled-cycles-frontend   #   31.29% frontend cycles idle    [49.86%]
       884,929,823      stalled-cycles-backend    #   14.95% backend  cycles idle    [50.21%]
readreverse  :       0.071 micros/op 14077393 ops/sec; 1557.3 MB/s
     3,239,575,993      stalled-cycles-frontend   #   27.36% frontend cycles idle    [49.96%]
     1,558,253,983      stalled-cycles-backend    #   13.16% backend  cycles idle    [50.14%]

```

Existing performance

```
readreverse  :       0.174 micros/op 5732342 ops/sec;  634.1 MB/s
    20,570,209,389      stalled-cycles-frontend   #   70.71% frontend cycles idle    [50.01%]
    18,422,816,837      stalled-cycles-backend    #   63.33% backend  cycles idle    [50.04%]

readseq      :       0.119 micros/op 8400537 ops/sec;  929.3 MB/s
    15,634,225,844      stalled-cycles-frontend   #   79.07% frontend cycles idle    [49.96%]
    14,227,427,453      stalled-cycles-backend    #   71.95% backend  cycles idle    [50.09%]
```

Test Plan: unit tests

Reviewers: yhchiang, sdong, igor

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D55107
2016-03-11 19:01:12 -08:00
Baris Yazici
e8e6cf0173 fix: handle_fatal_signal (sig=6) in std::vector<std::string, std::allocator<std::string> >::_M_range_check | c++/4.8.2/bits/stl_vector.h:794 #174
Summary:
Fix for https://github.com/facebook/mysql-5.6/issues/174

When there is no old files to purge, vector.at(i) function was crashing

if (old_info_log_file_count != 0 &&
      old_info_log_file_count >= db_options_.keep_log_file_num) {
    std::sort(old_info_log_files.begin(), old_info_log_files.end());
    size_t end = old_info_log_file_count - db_options_.keep_log_file_num;
    for (unsigned int i = 0; i <= end; i++) {
      std::string& to_delete = old_info_log_files.at(i);

Added check to old_info_log_file_count be non zero.

Test Plan: run existing tests

Reviewers: gunnarku, vasilep, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: andrewkr, webscalesql-eng, dhruba

Differential Revision: https://reviews.facebook.net/D55245
2016-03-11 11:11:45 -08:00
Andrew Kryczka
d9620239d2 Cleanup stale manifests outside of full purge
Summary:
- Keep track of obsolete manifests in VersionSet
- Updated FindObsoleteFiles() to put obsolete manifests in the JobContext for later use by PurgeObsoleteFiles()
- Added test case that verifies a stale manifest is deleted by a non-full purge

Test Plan:
  $ ./backupable_db_test --gtest_filter=BackupableDBTest.ChangeManifestDuringBackupCreation

Reviewers: IslamAbdelRahman, yoshinorim, sdong

Reviewed By: sdong

Subscribers: andrewkr, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D55269
2016-03-10 18:16:21 -08:00
Yi Wu
f71fc77b7c Cache to have an option to fail Cache::Insert() when full
Summary:
Cache to have an option to fail Cache::Insert() when full. Update call sites to check status and handle error.

I totally have no idea what's correct behavior of all the call sites when they encounter error. Please let me know if you see something wrong or more unit test is needed.

Test Plan: make check -j32, see tests pass.

Reviewers: anthony, yhchiang, andrewkr, IslamAbdelRahman, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D54705
2016-03-10 17:35:19 -08:00
Yueh-Hsuan Chiang
765597fa78 Update compaction score right after CompactFiles forms a compaction
Summary:
This is a follow-up patch of https://reviews.facebook.net/D54891.
As the information about files being compacted will also be used
when making compaction decision, it is necessary to update the compaction
score when a compaction plan has been made but not yet execute.

This patch adds a missing call to update the compaction score in
CompactFiles().

Test Plan: compact_files_test

Reviewers: sdong, IslamAbdelRahman, kradhakrishnan, yiwu, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D55227
2016-03-10 14:34:28 -08:00
Yueh-Hsuan Chiang
aa3f02d50c Improve comment in compaction.h and compaction_picker.h
Summary:
ReleaseCompactionFiles must be called when DB mutex is held,
but the documentation is mission.

Test Plan: no code change

Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D54987
2016-03-08 16:46:41 -08:00
sdong
294bdf9ee2 Change Property name from "rocksdb.current_version_number" to "rocksdb.current-super-version-number"
Summary: I realized I again is wrong about the naming convention. Let me change it to the correct one.

Test Plan: Run unit tests.

Reviewers: IslamAbdelRahman, kradhakrishnan, yhchiang, andrewkr

Reviewed By: andrewkr

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D55041
2016-03-04 18:15:29 -08:00
Yueh-Hsuan Chiang
a7d4eb2f34 Fix a bug where flush does not happen when a manual compaction is running
Summary:
Currently, when rocksdb tries to run manual compaction to refit data into a level,
there's a ReFitLevel() process that requires no bg work is currently running.
When RocksDB plans to ReFitLevel(), it will do the following:

 1. pause scheduling new bg work.
 2. wait until all bg work finished
 3. do the ReFitLevel()
 4. unpause scheduling new bg work.

However, as it pause scheduling new bg work at step one and waiting for all bg work
finished in step 2, RocksDB will stop flushing until all bg work is done (which
could take a long time.)

This patch fix this issue by changing the way ReFitLevel() pause the background work:

1. pause scheduling compaction.
2. wait until all bg work finished.
3. pause scheduling flush
4. do ReFitLevel()
5. unpause both flush and compaction.

The major difference is that.  We only pause scheduling compaction in step 1 and wait
for all bg work finished in step 2.  This prevent flush being blocked for a long time.
Although there's a very rare case that ReFitLevel() might be in starvation in step 2,
but it's less likely the case as flush typically finish very fast.

Test Plan: existing test.

Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D55029
2016-03-04 14:24:52 -08:00
Islam AbdelRahman
dfe96c72c3 Fix WriteLevel0TableForRecovery file delete protection
Summary:
The call to

```
CaptureCurrentFileNumberInPendingOutputs()
```

should be before

```
versions_->NewFileNumber()
```
Right now we are not actually protecting the file from being deleted

Test Plan: make check

Reviewers: sdong, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D54645
2016-03-03 18:25:07 -08:00
sdong
ef204df7ef Compaction always needs to be removed from level0_compactions_in_progress_ for universal compaction
Summary: We always put compaction to level0_compactions_in_progress_ for universal compaction, so we should also remove it. The bug causes assert failure when running manual compaction.

Test Plan:
TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom,compact --subcompactions=16 --compaction_style=1
always fails on my host. After the fix, it doesn't fail any more.

Reviewers: IslamAbdelRahman, andrewkr, kradhakrishnan, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D55017
2016-03-02 21:23:28 -08:00
sdong
e79ad9e184 Add Iterator Property rocksdb.iterator.version_number
Summary: We want to provide a way to detect whether an iterator is stale and needs to be recreated. Add a iterator property to return version number.

Test Plan: Add two unit tests for it.

Reviewers: IslamAbdelRahman, yhchiang, anthony, kradhakrishnan, andrewkr

Reviewed By: andrewkr

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54921
2016-03-02 16:23:59 -08:00
sdong
19ea40f8b6 Subcompaction boundary keys should not terminate after an empty level
Summary: Now we skip to add boundary keys to subcompaction candidates since we see an empty level. This makes subcompaction almost disabled for universal compaction. We should consider all files instead.

Test Plan: Run existing tests.

Reviewers: IslamAbdelRahman, andrewkr, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D55005
2016-03-02 15:45:07 -08:00
sdong
74b660702e Rename iterator property "rocksdb.iterator.is.key.pinned" => "rocksdb.iterator.is-key-pinned"
Summary: Rename iterator property to folow property naming convention.

Test Plan: Run all existing tests.

Reviewers: andrewkr, anthony, yhchiang, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54957
2016-03-01 13:47:12 -08:00
Islam AbdelRahman
6743135ea1 Fix DB::AddFile() issue when PurgeObsoleteFiles() is called
Summary:
In some situations the DB will scan all existing files in the DB path and delete the ones that are Obsolete.
If this happen during adding an external sst file. this could cause the file to be deleted while we are adding it.
This diff fix this issue

Test Plan:
unit test to reproduce the bug
existing unit tests

Reviewers: sdong, yhchiang, andrewkr

Reviewed By: andrewkr

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D54627
2016-03-01 12:05:29 -08:00
sdong
432f3adf2c Add DB Property "rocksdb.current_version_number"
Summary: Add a DB Property "rocksdb.current_version_number" for users to monitor version changes and stale iterators.

Test Plan: Add a unit test.

Reviewers: andrewkr, yhchiang, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54927
2016-03-01 10:55:40 -08:00
sdong
b5b1db167a Recompute compaction score after scheduling manual compaction
Summary: After we made manual compaction runnable concurrently with automaticallly compaction, we need to run ComputeCompactionScore() to prepare a coming compaction picking call before the compaction finishes.

Test Plan: Run existing tests.

Reviewers: yhchiang, IslamAbdelRahman, andrewkr, kradhakrishnan, anthony, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54891
2016-02-29 17:17:51 -08:00
sdong
1f5954147b Introduce Iterator::GetProperty() and replace Iterator::IsKeyPinned()
Summary:
Add Iterator::GetProperty(), a way for users to communicate with iterator, and turn Iterator::IsKeyPinned() with it.
As a follow-up, I'll ask a property as the version number attached to the iterator

Test Plan: Rerun existing tests and add a negative test case.

Reviewers: yhchiang, andrewkr, kradhakrishnan, anthony, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54783
2016-02-29 14:01:31 -08:00
Andrew Kryczka
69c471bd9b Handle concurrent manifest update and backup creation
Summary:
Fixed two related race conditions in backup creation.

(1) CreateNewBackup() uses DB::DisableFileDeletions() to prevent table files
from being deleted while it is copying; however, the MANIFEST file could still
rotate during this time. The fix is to stop deleting the old manifest in the
rotation logic. It will be deleted safely later when PurgeObsoleteFiles() runs
(can only happen when file deletions are enabled).

(2) CreateNewBackup() did not account for the CURRENT file being mutable.
This is significant because the files returned by GetLiveFiles() contain a
particular manifest filename, but the manifest to which CURRENT refers can
change at any time. This causes problems when CURRENT changes between the call
to GetLiveFiles() and when it's copied to the backup directory. To workaround this, I
manually forge a CURRENT file referring to the manifest filename returned in
GetLiveFiles().

(2) also applies to the checkpointing code, so let me know if this approach is
good and I'll make the same change there.

Test Plan:
new test for roll manifest during backup creation.

running the test before this change:

  $ ./backupable_db_test --gtest_filter=BackupableDBTest.ChangeManifestDuringBackupCreation
  ...
  IO error: /tmp/rocksdbtest-9383/backupable_db/MANIFEST-000001: No such file or directory

running the test after this change:

  $ ./backupable_db_test --gtest_filter=BackupableDBTest.ChangeManifestDuringBackupCreation
  ...
  [ RUN      ] BackupableDBTest.ChangeManifestDuringBackupCreation
  [       OK ] BackupableDBTest.ChangeManifestDuringBackupCreation (2836 ms)

Reviewers: IslamAbdelRahman, anthony, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D54711
2016-02-29 12:56:55 -08:00
sdong
8800975fb0 Make DBTestUniversalCompaction.IncreaseUniversalCompactionNumLevels more robust
Summary:
Based on thread scheduling, DBTestUniversalCompaction.IncreaseUniversalCompactionNumLevels can fail to flush enough files to trigger expected compactions. Fix it by waiting for flush after inserting each key.
There are failrue reported:

db/db_universal_compaction_test.cc:1134: Failure
Expected: (NumTableFilesAtLevel(options.num_levels - 1, 1)) > (0), actual: 0 vs 0

but I can't repro it. Try to fix the bug and see whether it goes away.

Test Plan: Run the test multiple time.

Reviewers: IslamAbdelRahman, anthony, andrewkr, kradhakrishnan, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54747
2016-02-26 11:59:31 -08:00
sdong
82f15fb15d Add test to make sure DropColumnFamily doesn't impact existing iterators
Summary: Add a test case in ColumnFamilyTest.ReadDroppedColumnFamily to make sure existing iterator is not impacted by column family dropping.

Test Plan: N/A

Reviewers: igor, yhchiang, anthony, andrewkr, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54657
2016-02-24 10:25:38 -08:00
sdong
38201b3599 Fix assert failure when DBImpl::SyncWAL() conflicts with log rolling
Summary: DBImpl::SyncWAL() releases db mutex before calling DBImpl::MarkLogsSynced(), while inside DBImpl::MarkLogsSynced() we assert there is none or one outstanding log file. However, a memtable switch can happen in between and causing two or outstanding logs there, failing the assert. The diff adds a unit test that repros the issue and fix the assert so that the unit test passes.

Test Plan: Run the new tests.

Reviewers: anthony, kolmike, yhchiang, IslamAbdelRahman, kradhakrishnan, andrewkr

Reviewed By: andrewkr

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54621
2016-02-23 11:42:15 -08:00
Andrew Kryczka
b046916656 Redo SyncPoints for flush while rolling test
Summary:
There was a race condition in the test where the rolling thread
acquired the mutex before the flush thread pinned the logger. Rather than add
more complicated synchronization to fix it, I followed Siying's suggestion to
use SyncPoint in the test code.

Comments in the LoadDependency() invocation explain the reason for each of the
sync points.

Test Plan:
Ran test 1000 times for tsan/asan. Will wait for all sandcastle tests
to finish before committing since this is a tricky test.

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D54615
2016-02-22 21:32:19 -08:00
Mike Kolupaev
eef63ef807 Fixed CompactFiles() spuriously failing or corrupting DB
Summary:
We started getting two kinds of crashes since we started using `DB::CompactFiles()`:
(1) `CompactFiles()` fails saying something like "/data/logdevice/4440/shard12/012302.sst: No such file or directory", and presumably makes DB read-only,
(2) DB fails to open saying "Corruption: Can't access /267000.sst: IO error: /data/logdevice/4440/shard1/267000.sst: No such file or directory".

AFAICT, both can be explained by background thread deleting compaction output as "obsolete" while it's being written, before it's committed to manifest. If it ends up committed to the manifest, we get (2); if compaction notices the disappearance and fails, we get (1). The internal tasks t10068021 and t10134177 have some details about the investigation that led to this.

Test Plan: `make -j check`; the new test fails to reopen the DB without the fix

Reviewers: yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, sdong

Differential Revision: https://reviews.facebook.net/D54561
2016-02-22 13:54:58 -08:00
Dmitri Smirnov
d37d348da8 This addresses build issues on Windows
https://github.com/facebook/rocksdb/issues/1002
2016-02-19 12:29:54 -08:00
Andrew Kryczka
d825fc70d4 Use condition variable in log roller test
Summary:
Previously I just slept until the flush_thread was "probably" ready
since proper synchronization in test cases seemed like overkill. But then tsan
complained about it, so I did the synchronization (mostly) properly now.

Test Plan:
  $ COMPILE_WITH_TSAN=1 make -j32 auto_roll_logger_test
  $ ./auto_roll_logger_test

Reviewers: anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D54399
2016-02-18 18:03:53 -08:00
Islam AbdelRahman
df9ba6df62 Introduce SstFileManager::SetMaxAllowedSpaceUsage() to cap disk space usage
Summary:
Introude SstFileManager::SetMaxAllowedSpaceUsage() that can be used to limit the maximum space usage allowed for RocksDB.
When this limit is exceeded WriteImpl() will fail and return Status::Aborted()

Test Plan: unit testing

Reviewers: yhchiang, anthony, andrewkr, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D53763
2016-02-17 15:20:23 -08:00
Andrew Kryczka
3943d16780 Fix race conditions in auto-rolling logger
Summary:
For GetLogFileSize() and Flush(), they previously did not follow the
synchronization pattern for accessing logger_. This meant ResetLogger() could
cause logger_ destruction while the unsynchronized functions were accessing it,
causing a segfault.

Also made the mutex instance variable mutable so we can preserve
GetLogFileSize()'s const-ness.

Test Plan:
new test case, it's quite ugly because both threads need to access
one of the functions with SyncPoints (PosixLogger::Flush()), and also special
handling is needed to prevent the mutex and sync points from conflicting.

Reviewers: kradhakrishnan, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D54237
2016-02-17 12:06:45 -08:00
reid horuff
a7b6f0748a Improve write_with_callback_test to sync WAL
Summary: Currently write_with_callback_test does not test with WAL syncing enabled. This addresses that.

Test Plan: write_with_callback_test

Reviewers: anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba, hermanlee4

Differential Revision: https://reviews.facebook.net/D54255
2016-02-16 14:04:14 -08:00
reid horuff
5bcf952a87 Fix WriteImpl empty batch hanging issue
Summary: There is an issue in DBImpl::WriteImpl where if an empty writebatch comes in and sync=true then the logs will be marked as being synced yet the sync never actually happens because there is no data in the writebatch. This causes the next incoming batch to hang while waiting for the logs to complete syncing. This fix syncs logs even if the writebatch is empty.

Test Plan: DoubleEmptyBatch unit test in transaction_test.

Reviewers: yoshinorim, hermanlee4, sdong, ngbronson, anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54057
2016-02-16 12:21:33 -08:00
Mike Kolupaev
44371501f0 Fixed a segfault when compaction fails
Summary: We've hit it today.

Test Plan: `make -j check`; didn't reproduce the issue

Reviewers: yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D54219
2016-02-16 11:11:16 -08:00
Jonathan Wiepert
7bd284c374 Separeate main from bench functionality to allow cusomizations
Summary: Isolate db_bench functionality from main so custom benchmark code can be written and managed

Test Plan:
Tested commands
./build_tools/regression_build_test.sh
./db_bench --db=/tmp/rocksdbtest-12321/dbbench --stats_interval_seconds=1 --num=1000
./db_bench --db=/tmp/rocksdbtest-12321/dbbench --stats_interval_seconds=1 --num=1000 --reads=500 --writes=500
./db_bench --db=/tmp/rocksdbtest-12321/dbbench --stats_interval_seconds=1 --num=1000 --merge_keys=100 --numdistinct=100 --num_column_families=3 --num_hot_column_families=1
./db_bench --stats_interval_seconds=1 --num=1000 --bloom_locality=1 --seed=5 --threads=5
./db_bench --duration=60 --value_size=50 --seek_nexts=10 --reverse_iterator=true --usee_uint64_comparator=true --batch-size=5
./db_bench --duration=60 --value_size=50 --seek_nexts=10 --reverse_iterator=true --use_uint64_comparator=true --batch_size=5
./db_bench --duration=60 --value_size=50 --seek_nexts=10 --reverse_iterator=true --usee_uint64_comparator=true --batch-size=5

Test Results - https://phabricator.fb.com/P56130387

Additional tests for:
./db_bench --duration=60 --value_size=50 --seek_nexts=10 --reverse_iterator=true --use_uint64_comparator=true --batch_size=5 --key_size=8 --merge_operator=put
./db_bench --stats_interval_seconds=1 --num=1000 --bloom_locality=1 --seed=5 --threads=5 --merge_operator=uint64add

Results: https://phabricator.fb.com/P56130607

Reviewers: yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D53991
2016-02-16 06:17:31 -08:00
sdong
92a9ccf1a6 Add a new compaction priority that picks file whose overlapping ratio is smallest
Summary:
Add a new compaction priority as following:
For every file, we calculate total size of files overalapping with the file in the next level, over the file's size itself. The file with smallest ratio will be picked first.
My "db_bench --fillrandom" shows about 5% less compaction than kOldestSmallestSeqFirst if --hard_pending_compaction_bytes_limit value to keep LSM tree in shape. If not limiting hard_pending_compaction_bytes_limit, improvement is only 1% or 2%.

Test Plan: Add a unit test

Reviewers: andrewkr, kradhakrishnan, anthony, IslamAbdelRahman, yhchiang

Reviewed By: yhchiang

Subscribers: MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54075
2016-02-11 15:59:19 -08:00
Peter Mattis
239aaf2fc0 Use user_comparator when comparing against iterate_upper_bound.
Fixes #983.
2016-02-11 08:47:16 -05:00
Baraa Hamodi
21e95811d1 Updated all copyright headers to the new format. 2016-02-09 15:12:00 -08:00
Yueh-Hsuan Chiang
4a8cbf4e31 Allows Get and MultiGet to read directly from SST files.
Summary:
Add kSstFileTier to ReadTier, which allows Get and MultiGet to
read only directly from SST files and skip mem-tables.

    kSstFileTier = 0x2      // data in SST files.
                          // Note that this ReadTier currently only supports
                          // Get and MultiGet and does not support iterators.

Test Plan: add new test in db_test.

Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: igor, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D53511
2016-02-09 11:20:22 -08:00
sdong
a76e9093f0 Fix LITE db_test build broken by previous commit
Summary: Previous commit introduces a test that is not supported in LITE. Fix it.

Test Plan: Build the test with ROCKSDB_LITE.

Reviewers: kradhakrishnan, IslamAbdelRahman, anthony, yhchiang, andrewkr

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D53901
2016-02-05 14:29:09 -08:00
sdong
b1887c5dd9 Explictly fail when memtable doesn't support concurrent insert
Summary: If users turn on concurrent insert but the memtable doesn't support it, they might see unexcepted crash. Fix it by explicitly fail.

Test Plan:
Run different setting of stress_test and make sure it fails correctly.
Will add a unit test too.

Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, andrewkr, ngbronson

Reviewed By: ngbronson

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D53895
2016-02-05 14:15:50 -08:00
reid horuff
6f71d3b68b Improve perf of Pessimistic Transaction expirations (and optimistic transactions)
Summary:
copy from task 8196669:

1) Optimistic transactions do not support batching writes from different threads.
2) Pessimistic transactions do not support batching writes if an expiration time is set.

In these 2 cases, we currently do not do any write batching in DBImpl::WriteImpl() because there is a WriteCallback that could decide at the last minute to abort the write.  But we could support batching write operations with callbacks if we make sure to process the callbacks correctly.

To do this, we would first need to modify write_thread.cc to stop preventing writes with callbacks from being batched together.  Then we would need to change DBImpl::WriteImpl() to call all WriteCallback's in a batch, only write the batches that succeed, and correctly set the state of each batch's WriteThread::Writer.

Test Plan: Added test WriteWithCallbackTest to write_callback_test.cc which creates multiple client threads and verifies that writes are batched and executed properly.

Reviewers: hermanlee4, anthony, ngbronson

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52863
2016-02-05 10:44:13 -08:00
Islam AbdelRahman
8e6172bc57 Add BlockBasedTableOptions::index_block_restart_interval
Summary: Add a new option to BlockBasedTableOptions that will allow us to change the restart interval for the index block

Test Plan: unit tests

Reviewers: yhchiang, anthony, andrewkr, sdong

Reviewed By: sdong

Subscribers: march, dhruba

Differential Revision: https://reviews.facebook.net/D53721
2016-02-05 10:22:37 -08:00
Nathan Bronson
2c1db5ea51 always invalidate sequential-insertion cache for concurrent skiplist adds
Summary:
InlineSkipList::InsertConcurrently should invalidate the
sequential-insertion cache prev_[] for all inserts of multi-level nodes,
not just those that increase the height of the skip list.  The invariant
for prev_ is that prev_[i] (i > 0) is supposed to be the predecessor of
prev_[0] at level i.  Before this diff InsertConcurrently could violate
this constraint when inserting a multi-level node after prev_[i] but
before prev_[0].

This diff also reenables kConcurrentSkipList as db_test's
MultiThreaded/MultiThreadedDBTest.MultiThreaded/29.

Test Plan:
1. unit tests
2. temporarily hack kConcurrentSkipList timing so that it is fast but has a 1.5% failure rate on my dev box (1ms stagger on thread launch, 1s test duration, failure rate baseline over 1000 runs)
3. observe 1000 passes post-fix

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: MarkCallaghan, dhruba

Differential Revision: https://reviews.facebook.net/D53751
2016-02-03 11:08:16 -08:00
Andrew Kryczka
284aa613a7 Eliminate duplicated property constants
Summary:
Before this diff, there were duplicated constants to refer to properties (user-
facing API had strings and InternalStats had an enum). I noticed these were
inconsistent in terms of which constants are provided, names of constants, and
documentation of constants. Overall it seemed annoying/error-prone to maintain
these duplicated constants.

So, this diff gets rid of InternalStats's constants and replaces them with a map
keyed on the user-facing constant. The value in that map contains a function
pointer to get the property value, so we don't need to do string matching while
holding db->mutex_. This approach has a side benefit of making many small
handler functions rather than a giant switch-statement.

Test Plan: db_properties_test passes, running "make commit-prereq -j32"

Reviewers: sdong, yhchiang, kradhakrishnan, IslamAbdelRahman, rven, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D53253
2016-02-02 19:14:56 -08:00
Nathan Bronson
5fcd1ba30a disable kConcurrentSkipList multithreaded test
Summary: Disable test that is intermittently failing

Test Plan: unit tests

Reviewers: igor, andrewkr, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D53715
2016-02-02 18:24:47 -08:00
Tomas Kolda
a62c519bb6 RollLogFile tries to find non conflicting file until there is no conflict. 2016-02-02 10:33:49 +01:00
Tomas Kolda
57a95a7001 Making use of GetSystemTimePreciseAsFileTime dynamic - code review fixes 2016-02-02 10:23:56 +01:00
Tomas Kolda
502d41f150 Making use of GetSystemTimePreciseAsFileTime dynamic to not
break compatibility with Windows 7. The issue with rotated logs
was fixed other way.
2016-02-02 10:23:56 +01:00
Nathan Bronson
9c2cf9479b Fix for --allow_concurrent_memtable_write with batching
Summary:
Concurrent memtable adds were incorrectly computing
the last sequence number for a write batch group when the
write batches were not solitary.  This is the cause of
https://github.com/facebook/mysql-5.6/issues/155

Test Plan:
1. unit tests
2. new unit test
3. parallel db_bench stress tests with batch size of 10 and asserts enabled

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: IslamAbdelRahman, MarkCallaghan, dhruba

Differential Revision: https://reviews.facebook.net/D53595
2016-02-01 20:41:57 -08:00
Siying Dong
7b943da1b2 Merge pull request #967 from SherlockNoMad/ValueSize
Add histogram for value size per operation
2016-02-01 17:59:08 -08:00
Islam AbdelRahman
1ad8182950 Fix WriteBatchTest.ManyUpdates, WriteBatchTest.LargeKeyValue under clang
Summary:
Fix current clang failure
https://ci-builds.fb.com/view/rocksdb/job/rocksdb_clang_build/1398/console

Test Plan:
make sure that both clang and g++ compilation succeed

USE_CLANG=1 make check -j64
make check -j64

Reviewers: anthony, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D53667
2016-02-01 16:07:53 -08:00
sdong
ad7ecca72d Add unit tests to verify large key/value
Summary:
Add unit tests:
(1) insert entries of 8MB key and 3GB value to DB
(2) insert entry of 3GB key and 3GB value into write batch and make sure we can read it.
(3) insert 3 billions of key-value pairs into write batch and make sure we can read it.
Disable them because not all platform can run it.

Test Plan: Run the tests

Reviewers: IslamAbdelRahman, yhchiang, kradhakrishnan, andrewkr, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D53619
2016-02-01 15:09:50 -08:00
Andrew Kryczka
fdd70d1495 Skip filters for last L0 file if hit-optimized
Summary:
Following up on D53493, we can still enable the filter-skipping
optimization for last file in L0. It's correct to assume the key will be present
in the last L0 file when we're hit-optimized and L0 is deepest.

The FilePicker encapsulates the state for traversing each level's files, so I
needed to make it expose whether the returned file is last in its level.

Test Plan:
verified below test fails before this patch and passes afterwards.
The change to how the test memtable is populated is needed so file 1 has keys
(0, 30, 60), file 2 has keys (10, 40, 70), etc.

  $ ./db_universal_compaction_test --gtest_filter=UniversalCompactionNumLevels/DBTestUniversalCompaction.OptimizeFiltersForHits/*

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D53583
2016-02-01 14:58:46 -08:00
Dmytro Ivchenko
aa5e3b7c04 PerfContext::ToString() add option to exclude zero counters
Test Plan: Added unit test to check w/ w/o zeros scenarios

Reviewers: yhchiang

Reviewed By: yhchiang

Subscribers: sdong, dhruba

Differential Revision: https://reviews.facebook.net/D52809
2016-02-01 13:41:13 -08:00
Yueh-Hsuan Chiang
1d854fa3d4 Fixed the asan error on column_family_test
Summary:
Fixed the asan error on column_family_test caused by not disabling
SyncPoint.

Test Plan: column_family_test

Reviewers: anthony, rven, kradhakrishnan, sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D53505
2016-02-01 12:45:45 -08:00
SherlockNoMad
37159a6448 Add histogram for value size per operation 2016-01-31 18:09:24 -08:00
Venkatesh Radhakrishnan
3b2a1ddd2e Add options.base_background_compactions as a number of compaction threads for low compaction debt
Summary:
If options.base_background_compactions is given, we try to schedule number of compactions not existing this number, only when L0 files increase to certain number, or pending compaction bytes more than certain threshold, we schedule compactions based on options.max_background_compactions.

The watermarks are calculated based on slowdown thresholds.

Test Plan:
Add new test cases in column_family_test.
Adding more unit tests.

Reviewers: IslamAbdelRahman, yhchiang, kradhakrishnan, rven, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D53409
2016-01-29 16:15:53 -08:00
sdong
6ee38bb15c Slowdown of writing to the last memtable should not override stopping
Summary: Now slowing down for the last mem table takes priority against some stopping conditions. This is logically confusing. Fix it.

Test Plan: Run all existing tests.

Reviewers: yhchiang, IslamAbdelRahman, kradhakrishnan, andrewkr, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D53529
2016-01-28 21:52:29 -08:00
Islam AbdelRahman
d6c838f1e1 Add SstFileManager (component tracking all SST file in DBs and control the deletion rate)
Summary:
Add a new class SstFileTracker that will be notified whenever a DB add/delete/move and sst file, it will also replace DeleteScheduler
SstFileTracker can be used later to abort writes when we exceed a specific size

Test Plan: unit tests

Reviewers: rven, anthony, yhchiang, sdong

Reviewed By: sdong

Subscribers: igor, lovro, march, dhruba

Differential Revision: https://reviews.facebook.net/D50469
2016-01-28 18:35:01 -08:00
sdong
4b50f13540 Should not skip bloom filter for L0 during the query.
Summary: It's a regression bug caused by e089db40f9. With the change, if options.optimize_filters_for_hits=true and there are only L0 files (like single level universal compaction), we skip all the files in L0, which is more than necessary. Fix it by always trying to query bloom filter for files in level 0.

Test Plan: Add a unit test for it.

Reviewers: anthony, rven, yhchiang, IslamAbdelRahman, kradhakrishnan, andrewkr

Reviewed By: andrewkr

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D53493
2016-01-27 16:16:39 -08:00
sdong
d20915d52a Disable stats about mutex duration by default
Summary: Measuring mutex duration will measure time inside DB mutex, which breaks our best practice. Add a stat level in Statistics class. By default, disable to measure the mutex operations.

Test Plan: Add a unit test to make sure it is off by default.

Reviewers: rven, anthony, IslamAbdelRahman, kradhakrishnan, andrewkr, yhchiang

Reviewed By: yhchiang

Subscribers: MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D53367
2016-01-26 14:56:55 -08:00
Islam AbdelRahman
0c433cd1eb Fix issue in Iterator::Seek when using Block based filter block with prefix_extractor
Summary: Similar to D53385 we need to check InDomain before checking the filter block.

Test Plan: unit tests

Reviewers: yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D53421
2016-01-26 14:47:42 -08:00
Andrew Kryczka
035857a312 Fix RocksDB lite build
Summary:
NewMemEnv() is defined in rocksdb lite but just returns nullptr --
would it be better to just not define it so we can catch issues like this at
compile-time?

Test Plan:
  $ make clean && OPT="-DTRAVIS -DROCKSDB_LITE" V=1 make -j32 db_test
  $ ./db_test --gtest_filter='DBTest.MemEnvTest'
  ...
  [  PASSED  ] 0 tests.

Reviewers: yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D53427
2016-01-26 13:15:36 -08:00
Yueh-Hsuan Chiang
955ecf8b49 Fix an ASAN error in compact_files_test
Summary:
compact_files_test enables SyncPoint but never disable it before
the test terminates.  As a result, it might cause heap-use-after-free
error when some code path trying to access the static variable of
SyncPoint when it has already gone out of scope after the main thread
dies.

Test Plan:
COMPILE_WITH_ASAN=1 make compact_files_test -j32
./compact_files_test

Reviewers: sdong, anthony, kradhakrishnan, rven, andrewkr, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D53379
2016-01-26 11:30:30 -08:00
Islam AbdelRahman
b0afcdeeac Fix bug in block based tables with full filter block and prefix_extractor
Summary:
Right now when we are creating a BlockBasedTable with fill filter block
we add to the filter all the prefixes that are InDomain() based on the prefix_extractor

the problem is that when we read a key from the file, we check the filter block for the prefix whether or not it's InDomain()

Test Plan: unit tests

Reviewers: yhchiang, rven, anthony, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D53385
2016-01-26 11:07:08 -08:00
Andrew Kryczka
167bd8856d [directory includes cleanup] Finish removing util->db dependencies 2016-01-26 10:49:24 -08:00
Andrew Kryczka
acd7d58695 [directory includes cleanup] Remove util->db dependency for ThreadStatusUtil
Summary:
We can avoid the dependency by forward-declaring ColumnFamilyData and
then treating it as a black box. That means callers of ThreadStatusUtil need to
explicitly provide more options, even if they can be derived from the
ColumnFamilyData, since ThreadStatusUtil doesn't include the definition.

This is part of a series of diffs to eliminate circular dependencies between
directories (e.g., db/* files depending on util/* files and vice-versa).

Test Plan:
  $ ./db_test --gtest_filter=DBTest.GetThreadStatus
  $ make -j32 commit-prereq

Reviewers: sdong, yhchiang, IslamAbdelRahman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D53361
2016-01-26 10:49:16 -08:00
Andrew Kryczka
46f9cd46af [directory includes cleanup] Move cross-function test points
Summary:
I split the db-specific test points out into a separate file under db/
directory. There were also a few bugs to fix in xfunc.{h,cc} that prevented it
from compiling previously; see https://reviews.facebook.net/D36825.

Test Plan:
compilation works now, below command works, will also run "make xfunc".

  $ make check ROCKSDB_XFUNC_TEST='managed_new' tests-regexp='DBTest' -j32

Reviewers: sdong, yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D53343
2016-01-26 10:49:05 -08:00
Venkatesh Radhakrishnan
b7ecf3d214 Fix intermittent hang in ColumnFamilyTest.FlushAndDropRaceCondition
Summary:
ColumnFamilyTest.FlushAndDropRaceCondition sometimes
hangs because the sync point, "FlushJob::InstallResults", sleeps
holding the DB mutex. Fixing it by releasing the mutex before sleeping.

Test Plan:
seq 1000 |parallel --gnu --eta 't=/dev/shm/rdb-{}; rm -rf $t;
mkdir $t && export TEST_TMPDIR=$t; ./column_family_test
-gtest_filter=*FlushAndDropRaceCondition* > $t/log-{}'

Reviewers: IslamAbdelRahman, anthony, kradhakrishnan, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D53349
2016-01-26 09:12:20 -08:00
sdong
da33dfe188 Parameterize DBTest.Randomized
Summary: Break down DBTest.Randomized to multiple gtest tests based on config type

Test Plan: Run the test and all tests. Make sure configurations are correctly set

Reviewers: yhchiang, IslamAbdelRahman, rven, kradhakrishnan, andrewkr, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D53247
2016-01-25 14:49:08 -08:00
sdong
fb9811ee9b Add a perf context level that doesn't measure time for mutex operations
Summary: Timing mutex operations can impact scalability of the system. Add a new perf context level that can measure time counters except for mutex.

Test Plan: Add a new unit test case to make sure it is not set.

Reviewers: IslamAbdelRahman, rven, kradhakrishnan, yhchiang, anthony

Reviewed By: anthony

Subscribers: MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D53199
2016-01-25 14:45:03 -08:00
Andrew Kryczka
3e9209a078 Updated GetProperty documentation
Summary: As titled. Also added the kBaseLevel string, which was missing earlier.

Test Plan: built

Reviewers: yhchiang, anthony, rven, kradhakrishnan, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D53271
2016-01-25 14:04:55 -08:00
Islam AbdelRahman
2fbc59a348 Disallow SstFileWriter from creating empty sst files
Summary:
SstFileWriter may create an sst file with no entries
Right now this will fail when being ingested using DB::AddFile() saying that the keys are corrupted

Test Plan: make check

Reviewers: yhchiang, rven, anthony, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D52815
2016-01-25 13:47:07 -08:00
agiardullo
8019aa9b55 improve test for manifest write failure
Summary: Improve testing per discussion in D52989

Test Plan: ran test

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D53211
2016-01-22 11:47:59 -08:00
agiardullo
bcd4ccbc33 Revert D7809
Summary: Revert the functionaility of D7809 (but I'm keeping the logging and test code).  We decided it was dangerous to ignore sync failures based on attempting to read the data written.  The read does not tell us whether the data was synced.

Test Plan: There was no test for the particular functionaility that was reverted.  Keeping the test code from D7809 that tests whether we set the DB to be readonly when paranoid checks are enabled.

Reviewers: rven, yhchiang, kradhakrishnan, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D52989
2016-01-21 17:14:43 -08:00
Andrew Kryczka
bb2888738c Cleanup property-related variable names
Summary:
I noticed these names were quite confusing while updating GetProperty
documentation.

Test Plan: running "make commit-prereq -j32"

Reviewers: yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D53073
2016-01-21 11:38:15 -08:00
Andrew Kryczka
29289333d0 Add named constants for remaining properties
Summary:
There were just these two properties that didn't have any named
constant.

Test Plan:
build and below test

  $ ./db_properties_test --gtest_filter=DBPropertiesTest.NumImmutableMemTable

Reviewers: yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D53103
2016-01-21 10:59:36 -08:00
Andrew Kryczka
eceb5cb1b7 Split db_test.cc (part 1: properties)
Summary:
Moved all the tests that verify property correctness into a separate
file. The goal is to reduce compile time and complexity of db_test. I didn't
add parallelism for db_properties_test, even though these tests were
parallelized in db_test, since the file is small enough that it won't matter.

Some of these moves may be controversial since it's hard to say whether the
test is "verifying property correctness," or "using properties to verify
rocksdb's correctness." I'm interested in any opinions.

Test Plan: ran db_properties_test, also waiting on "make commit-prereq -j32"

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D52995
2016-01-20 15:17:52 -08:00
sdong
94918ae84b db_bench: explicitly clear buffer in compress benchmark
Summary: It is reported that in compress benchmark in db_bench, zlib will cause an OOM. The suggestd fix was to clear the buffer.

Test Plan: Build and run compress benchmark.

Reviewers: IslamAbdelRahman, yhchiang, rven, andrewkr, kradhakrishnan, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52857
2016-01-19 18:11:46 -08:00
Siying Dong
39c3e94ff4 Merge pull request #954 from davidbernard/solaris_build
Solaris Build
2016-01-19 18:05:02 -08:00
Mike Kolupaev
34704d5c7b [easy] Fixed a crash in LogAndApply() when CF creation failed
Summary: That line used to dereference `column_family_data`, which is nullptr if we're creating a column family.

Test Plan: `make -j check`

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D52881
2016-01-19 11:46:52 -08:00
David Bernard
3f12e16f27 Make alloca.h optional 2016-01-19 06:17:31 +00:00
David Bernard
d78c6b28c4 Changes for build on solaris
Makefile adjust paths for solaris build
Makefile enable _GLIBCXX_USE_C99 so that std::to_string is available
db_compaction_test.cc Initialise a variable to avoid a compilation error
db_impl.cc Include <alloca.h>
db_test.cc Include <alloca.h>
Environment.java recognise solaris envrionment
options_bulder.cc Make log unambiguous
geodb_impl.cc Make log and floor unambiguous
2016-01-19 04:45:21 +00:00
Dmitri Smirnov
20d7902df9 Fix compile error.
Use constructor style initialization instead of a cast for
  simplicity.
2016-01-11 16:10:48 -08:00
sdong
235b162be1 Not scheduling more L1->L2 compaction if L0->L1 is pending with higher priority
Summary: When L0->L1 is pending, there may be one L1->L2 compaction going on which prevents the L0->L1 compaction from happening. If L1 needs more data to be moved to L2, then we may continue scheduling more L1->L2 compactions. The end result may be that L0->L1 compaction will not happen until L1 size drops to below target size. We can reduce the stalling because of number of L0 files by stopping schedling new L1->L2 compaction when L0's score is higher than L1.

Test Plan: Run all existing tests.

Reviewers: yhchiang, MarkCallaghan, rven, anthony, IslamAbdelRahman, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52401
2016-01-08 13:56:57 -08:00
sdong
9a8e3f73ed plain table reader: non-mmap mode to keep two recent buffers
Summary: In plain table reader's non-mmap mode, we only keep the most recent read buffer. However, for binary search, it is likely we come back to a location to read. To avoid one pread in such a case, we keep two read buffers. It should cover most of the cases.

Test Plan:
1. run tests
2. check the optimization works through strace when running
./table_reader_bench -mmap_read=false --num_keys2=1 -num_keys1=5000 -table_factory=plain_table --iterator --through_db

Reviewers: anthony, rven, kradhakrishnan, igor, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51171
2016-01-08 10:53:57 -08:00
Venkatesh Radhakrishnan
7ece10ecb6 DeleteFilesInRange: Mark files to be deleted as being compacted before applying change
Summary:
While running the myrocks regression suite, I found that while
dropping a table soon after inserting rows into it resulted in an
assertion failure in CheckConsistencyForDeletes for not finding
a file which was recently added or moved. Marking the files to be
deleted as being compacted before calling LogAndApplyChange
fixed the assertion failures.

Test Plan: DBCompactionTest.DeleteFileRange

Reviewers: IslamAbdelRahman, anthony, yhchiang, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, yoshinorim, leveldb

Differential Revision: https://reviews.facebook.net/D52599
2016-01-07 14:48:45 -08:00
Islam AbdelRahman
8c71eb5afc Optimize DBIter::Prev() by reducing stack overhead
Summary:
It looks like we are spending significant amount of time creating std::deque<std::string> every time we do Iterator::Prev()

{F921567}

By using merge_operands_ as a DBIter data member w create it once and reduce this overhead and see ~30% performance improvement when using Iterator::Prev() on hot data

Orignal performance

```
DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readreverse" --db="/dev/shm/bench_prev_opt/" --use_existing_db --disable_auto_compactions
readreverse  :       0.713 micros/op 1402219 ops/sec;  155.1 MB/s
readreverse  :       0.609 micros/op 1641386 ops/sec;  181.6 MB/s
readreverse  :       0.684 micros/op 1461150 ops/sec;  161.6 MB/s
readreverse  :       0.629 micros/op 1589842 ops/sec;  175.9 MB/s
readreverse  :       0.647 micros/op 1544530 ops/sec;  170.9 MB/s
```

After optimization

```
DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readreverse" --db="/dev/shm/bench_prev_opt/" --use_existing_db --disable_auto_compactions
readreverse  :       0.488 micros/op 2051189 ops/sec;  226.9 MB/s
readreverse  :       0.505 micros/op 1980892 ops/sec;  219.1 MB/s
readreverse  :       0.541 micros/op 1846971 ops/sec;  204.3 MB/s
readreverse  :       0.497 micros/op 2013612 ops/sec;  222.8 MB/s
readreverse  :       0.480 micros/op 2082665 ops/sec;  230.4 MB/s
```

Test Plan: make check -j64

Reviewers: sdong, anthony, rven, igor, yhchiang

Reviewed By: yhchiang

Subscribers: jkedgar, dhruba

Differential Revision: https://reviews.facebook.net/D52563
2016-01-07 07:59:14 -08:00
Yueh-Hsuan Chiang
6935eb24e0 Add ColumnFamilyHandle::GetDescriptor()
Summary:
This patch addes ColumnFamilyHandle::GetDescriptor(), which allows
developers to obtain the CF options and names of the associated column
family given its handle.

  // Returns the up-to-date descriptor used by the current handle.  Since it
  // returns the up-to-date information, this call might internally locks
  // and releases DB mutex to access the up-to-date CF options.
  virtual ColumnFamilyDescriptor GetDescriptor() = 0;

Test Plan: augment column_family_test

Reviewers: sdong, yoshinorim, IslamAbdelRahman, rven, kradhakrishnan, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51543
2016-01-06 18:14:01 -08:00
Reid Horuff
da032495d3 Optimize GetLatestSequenceForKey
Summary: DBImpl::GetLatestSequenceForKey() can do memcpy's to load a value that will never be used.  This can be optimized by changing all the Get() functions called to optionally not fetch the value (and only fetch the sequencenumber).

Test Plan: optimistic_transaction_test and transaction_test

Reviewers: anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba, hermanlee4

Differential Revision: https://reviews.facebook.net/D52227
2016-01-06 13:43:22 -08:00
agiardullo
51adc5457a fix sporadic failure in fault_injection_test
Summary: Need to make sure the background task gets scheduled before it goes out of scope.

Test Plan: ran test.  Will see if sporadic valgrind failures go away.

Reviewers: kradhakrishnan

Reviewed By: kradhakrishnan

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D52575
2016-01-06 11:51:53 -08:00
agiardullo
a2422f0533 fix potential test SleepingTask race condition
Summary: Make sure SleepingTask has bene run before it goes out of scope.

Test Plan: run test

Reviewers: kradhakrishnan

Reviewed By: kradhakrishnan

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D52581
2016-01-06 11:51:09 -08:00
Mark Callaghan
4041903ecd Enhance db_bench write rate limit
Summary:
1) changes tools/{benchmark,run_flash_bench}.sh to optionally use the write rate limit
2) removes code for --writes_per_second and switches the 'background' write rate limit
to use --benchmark_write_rate_limit

Replaces https://reviews.facebook.net/D49113

Task ID: #9555881

Blame Rev:

Test Plan:
tools/run_flash_bench.sh

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D52485
2016-01-04 12:01:27 -08:00
Venkatesh Radhakrishnan
d74c9f0a57 DeleteFilesInRange: Clean job context if no files deleted
Summary:
We need to clean the job context if we end up not deleting any
files because no files are in the range specified.

Test Plan: DBCompactionTest.DeleteFileRange

Reviewers: sdong, anthony, yhchiang, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D52467
2016-01-04 10:55:31 -08:00
Igor Canadi
1dec5b8f5d Merge pull request #916 from warrenfalk/capi_huge_page_option
Expose the option, memtable_prefix_bloom_huge_page_tlb_size, via C API
2016-01-04 09:24:30 -08:00
Warren Falk
0fde291abe expose memtable_prefix_bloom_huge_page_tlb_size option to C API 2015-12-31 09:27:03 -05:00
Warren Falk
7e81dba5cf Support creation of "full" format bloom filter from C API 2015-12-31 09:26:49 -05:00
Nathan Bronson
ac16663bd6 use -Werror=missing-field-initializers, to closer match MyRocks build
Summary:
myrocks seems to build rocksdb using
-Wmissing-field-initializers (and treats warnings as errors).  This diff
adds that flag to the rocksdb build, and fixes the compilation failures
that result.  I have not checked for any other differences in the build
flags for rocksdb build as part of myrocks.

Test Plan: make check

Reviewers: sdong, rven

Reviewed By: rven

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D52443
2015-12-30 14:56:18 -08:00
Venkatesh Radhakrishnan
7238be090e Fix clang build in db_compaction_test
Summary:
Fix CLANG build error caused by type mismatch. Changed type to
size_t.

Test Plan: Clang build and make check

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D52413
2015-12-29 16:44:12 -08:00
Venkatesh Radhakrishnan
63ddb783db Delete files in given key range
Summary:
This is an initial diff for providing the ability to delete
files which are completely within a given range of keys.

Test Plan: DBCompactionTest.DeleteRange

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D52293
2015-12-29 13:22:13 -08:00
sdong
d8677a8d2c Upgrade internal CLANG version for FB-internal gcc 4.8.1
Summary: After removing two move operations, we can make CLANG 3.7 build pass under GCC 4.8.1.

Test Plan: USE_CLANG=1 ROCKSDB_FBCODE_BUILD_WITH_481=1 make all -j32

Reviewers: yhchiang, IslamAbdelRahman, rven, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52365
2015-12-29 10:33:23 -08:00
Siying Dong
22c0ed8a5f Disable Visual Studio Warning C4351
Currently Windows build is broken because of Warning C4351. Disable the warning before figuring out the right way to fix it.
2015-12-28 15:06:34 -08:00
sdong
fcafac053f Fix memory leak in ColumnFamilyTest.WriteStall*
Summary: ColumnFamilyTest.WriteStallSingleColumnFamily and ColumnFamilyTest.WriteStallTwoColumnFamilies didn't clean up test state cleanly, causing memory leak. Fix it.

Test Plan: Run the two tests in valgrind and make sure they now pass.

Reviewers: yhchiang, anthony, rven, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52347
2015-12-28 12:30:21 -08:00
sdong
11672df19a Fix CLANG errors introduced by 7d87f02799
Summary: Fix some CLANG errors introduced in 7d87f02799

Test Plan: Build with both of CLANG and gcc

Reviewers: rven, yhchiang, kradhakrishnan, anthony, IslamAbdelRahman, ngbronson

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52329
2015-12-28 10:00:58 -08:00
Nathan Bronson
7d87f02799 support for concurrent adds to memtable
Summary:
This diff adds support for concurrent adds to the skiplist memtable
implementations.  Memory allocation is made thread-safe by the addition of
a spinlock, with small per-core buffers to avoid contention.  Concurrent
memtable writes are made via an additional method and don't impose a
performance overhead on the non-concurrent case, so parallelism can be
selected on a per-batch basis.

Write thread synchronization is an increasing bottleneck for higher levels
of concurrency, so this diff adds --enable_write_thread_adaptive_yield
(default off).  This feature causes threads joining a write batch
group to spin for a short time (default 100 usec) using sched_yield,
rather than going to sleep on a mutex.  If the timing of the yield calls
indicates that another thread has actually run during the yield then
spinning is avoided.  This option improves performance for concurrent
situations even without parallel adds, although it has the potential to
increase CPU usage (and the heuristic adaptation is not yet mature).

Parallel writes are not currently compatible with
inplace updates, update callbacks, or delete filtering.
Enable it with --allow_concurrent_memtable_write (and
--enable_write_thread_adaptive_yield).  Parallel memtable writes
are performance neutral when there is no actual parallelism, and in
my experiments (SSD server-class Linux and varying contention and key
sizes for fillrandom) they are always a performance win when there is
more than one thread.

Statistics are updated earlier in the write path, dropping the number
of DB mutex acquisitions from 2 to 1 for almost all cases.

This diff was motivated and inspired by Yahoo's cLSM work.  It is more
conservative than cLSM: RocksDB's write batch group leader role is
preserved (along with all of the existing flush and write throttling
logic) and concurrent writers are blocked until all memtable insertions
have completed and the sequence number has been advanced, to preserve
linearizability.

My test config is "db_bench -benchmarks=fillrandom -threads=$T
-batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T
-level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8
-max_background_flushes=8 --disable_wal --write_buffer_size=160000000
--block_size=16384 --allow_concurrent_memtable_write" on a two-socket
Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive.  With 1
thread I get ~440Kops/sec.  Peak performance for 1 socket (numactl
-N1) is slightly more than 1Mops/sec, at 16 threads.  Peak performance
across both sockets happens at 30 threads, and is ~900Kops/sec, although
with fewer threads there is less performance loss when the system has
background work.

Test Plan:
1. concurrent stress tests for InlineSkipList and DynamicBloom
2. make clean; make check
3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench
4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench
5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench
6. make clean; OPT=-DROCKSDB_LITE make check
7. verify no perf regressions when disabled

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba

Differential Revision: https://reviews.facebook.net/D50589
2015-12-25 11:03:40 -08:00
sdong
5b2587b5cb DBTest.HardLimit use special memtable
Summary: DBTest.HardLimit fails in appveyor build. Use special mem table to make the test behavior depends less on platform

Test Plan: Run the test with JEMALLOC both on and off.

Reviewers: yhchiang, kradhakrishnan, rven, anthony, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52317
2015-12-25 10:25:34 -08:00
Siying Dong
298ba27ae2 Merge pull request #846 from yuslepukhin/enble_c4244_lossofdata
Enable MS compiler warning c4244.
2015-12-23 22:59:42 -08:00
Siying Dong
7810aa802a Merge pull request #899 from zhipeng-jia/fix_clang_warning
Fix clang warnings
2015-12-23 22:58:52 -08:00
Siying Dong
4c5560d70a Merge pull request #895 from zhipeng-jia/develop
Fix computation of size of last sub-compaction
2015-12-23 22:45:03 -08:00
sdong
d43da8ae0d DBTest.DelayedWriteRate: fix assert of sign and unsign comparison
Summary: DBTest.DelayedWriteRate has sign and unsign comparisons that break Windows build. Fix it.

Test Plan: Build and run the test modified.

Reviewers: IslamAbdelRahman, rven, anthony, yhchiang, kradhakrishnan

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52311
2015-12-23 22:38:12 -08:00
sdong
15b8902264 Change default options.delayed_write_rate
Summary: We now have a mechanism to further slowdown writes. Double default options.delayed_write_rate to try to keep the default behavior closer to it used to be.

Test Plan: Run all tests.

Reviewers: IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: yhchiang, kradhakrishnan, rven, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52281
2015-12-23 14:51:55 -08:00
Zhipeng Jia
73b175a773 Fix clang warnings regarding unnecessary std::move 2015-12-24 04:10:00 +08:00
sdong
b9f77ba12b When slowdown is triggered, reduce the write rate
Summary: It's usually hard for users to set a value of options.delayed_write_rate. With this diff, after slowdown condition triggers, we greedily reduce write rate if estimated pending compaction bytes increase. If estimated compaction pending bytes drop, we increase the write rate.

Test Plan:
Add a unit test
Test with db_bench setting:
TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom -num=10000000 --soft_pending_compaction_bytes_limit=1000000000 --hard_pending_compaction_bytes_limit=3000000000 --delayed_write_rate=100000000

and make sure without the commit, write stop will happen, but with the commit, it will not happen.

Reviewers: igor, anthony, rven, yhchiang, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52131
2015-12-23 11:33:15 -08:00
Andrew Kryczka
445d5b8c5c Fix clang build
Summary:
Missed this in https://reviews.facebook.net/D51633 because I didn't
wait for 'make commit-prereq' to finish

Test Plan: make clean && USE_CLANG=1 make -j32 all

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D52275
2015-12-23 10:49:42 -08:00
Andrew Kryczka
e089db40f9 Skip bottom-level filter block caching when hit-optimized
Summary:
When Get() or NewIterator() trigger file loads, skip caching the filter block if
(1) optimize_filters_for_hits is set and (2) the file is on the bottommost
level. Also skip checking filters under the same conditions, which means that
for a preloaded file or a file that was trivially-moved to the bottom level, its
filter block will eventually expire from the cache.

- added parameters/instance variables in various places in order to propagate the config ("skip_filters") from version_set to block_based_table_reader
- in BlockBasedTable::Rep, this optimization prevents filter from being loaded when the file is opened simply by setting filter_policy = nullptr
- in BlockBasedTable::Get/BlockBasedTable::NewIterator, this optimization prevents filter from being used (even if it was loaded already) by setting filter = nullptr

Test Plan:
updated unit test:

  $ ./db_test --gtest_filter=DBTest.OptimizeFiltersForHits

will also run 'make check'

Reviewers: sdong, igor, paultuckfield, anthony, rven, kradhakrishnan, IslamAbdelRahman, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D51633
2015-12-23 10:15:07 -08:00
Zhipeng Jia
aa515823bc Fix clang warning 2015-12-23 19:23:58 +08:00
Islam AbdelRahman
d005c66faf Report compaction reason in CompactionListener
Summary:
Add CompactionReason to CompactionJobInfo
This will allow users to understand why compaction started which will help options tuning

Test Plan:
added new tests
make check -j64

Reviewers: yhchiang, anthony, kradhakrishnan, sdong, rven

Reviewed By: rven

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D51975
2015-12-22 11:37:19 -08:00
Zhipeng Jia
728f944f0d Fix computation of size of last sub-compaction 2015-12-22 18:37:51 +08:00
Zhipeng Jia
e0abec1580 Sorting std::vector instead of using std::set 2015-12-22 14:34:57 +08:00
Alex Yang
33e09c0e19 add call to install superversion and schedule work in enableautocompactions
Summary:
This patch fixes https://github.com/facebook/mysql-5.6/issues/121

There is a recent change in rocksdb to disable auto compactions on startup: https://reviews.facebook.net/D51147. However, there is a small timing window where a column family needs to be compacted and schedules a compaction, but the scheduled compaction fails when it checks the disable_auto_compactions setting. The expectation is once the application is ready, it will call EnableAutoCompactions() to allow new compactions to go through. However, if the Column family is stalled because L0 is full, and no writes can go through, it is possible the column family may never have a new compaction request get scheduled. EnableAutoCompaction() should probably schedule an new flush and compaction event when it resets disable_auto_compaction.

Using InstallSuperVersionAndScheduleWork, we call SchedulePendingFlush,
SchedulePendingCompaction, as well as MaybeScheduleFlushOrcompaction on all the
column families to avoid the situation above.

This is still a first pass for feedback.
Could also just call SchedePendingFlush and SchedulePendingCompaction directly.

Test Plan:
Run on Asan build
cd _build-5.6-ASan/ && ./mysql-test/mtr --mem --big --testcase-timeout=36000 --suite-timeout=12000 --parallel=16 --suite=rocksdb,rocksdb_rpl,rocksdb_sys_vars --mysqld=--default-storage-engine=rocksdb --mysqld=--skip-innodb --mysqld=--default-tmp-storage-engine=MyISAM --mysqld=--rocksdb rocksdb_rpl.rpl_rocksdb_stress_crash --repeat=1000

Ensure that it no longer hangs during the test.

Reviewers: hermanlee4, yhchiang, anthony

Reviewed By: anthony

Subscribers: leveldb, yhchiang, dhruba

Differential Revision: https://reviews.facebook.net/D51747
2015-12-21 10:06:49 -08:00
Zhipeng Jia
24c7dae130 Fix clang warning regarding implicit conversion 2015-12-21 23:57:55 +08:00
Reid Horuff
97ea8afaaf compaction assertion triggering test fix for sequence zeroing assertion trip 2015-12-18 16:08:31 -08:00
Nathan Bronson
a48382399d Fix use-after free in db_bench
Test Plan: valgrind db_bench

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D52101
2015-12-18 06:42:57 -08:00
Zhipeng Jia
131f7ddf63 fix typo: sr to picking_sr 2015-12-18 17:02:36 +08:00
sdong
c37729a6a6 db_bench: --soft_pending_compaction_bytes_limit should set options.soft_pending_compaction_bytes_limit
Summary: Fix a bug that options.soft_pending_compaction_bytes_limit is not actually set with --soft_pending_compaction_bytes_limit

Test Plan: Run db_bench with this parameter and make sure the parameter is set correctly.

Reviewers: anthony, kradhakrishnan, yhchiang, IslamAbdelRahman, igor, rven

Reviewed By: rven

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52125
2015-12-17 18:28:56 -08:00
Venkatesh Radhakrishnan
7b12ae97d4 Add signalall after removing item from manual_compaction deque
Summary:
When there are waiting manual compactions, we need to signal
them after removing the current manual compaction from the deque.

Test Plan: ColumnFamilytTest.SameCFManualManualCommaction

Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D52119
2015-12-17 16:59:00 -08:00
sdong
d72b31774e Slowdown when writing to the last write buffer
Summary: Now if inserting to mem table is much faster than writing to files, there is no mechanism users can rely on to avoid stopping for reaching options.max_write_buffer_number. With the commit, if there are more than four maximum write buffers configured, we slow down to the rate of options.delayed_write_rate while we reach the last one.

Test Plan:
1. Add a new unit test.
2. Run db_bench with

./db_bench --benchmarks=fillrandom --num=10000000 --max_background_flushes=6 --batch_size=32 -max_write_buffer_number=4 --delayed_write_rate=500000 --statistics

based on hard drive and see stopping is avoided with the commit.

Reviewers: yhchiang, IslamAbdelRahman, anthony, rven, kradhakrishnan, igor

Reviewed By: igor

Subscribers: MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52047
2015-12-17 10:49:08 -08:00
Islam AbdelRahman
aececc209e Introduce ReadOptions::pin_data (support zero copy for keys)
Summary:
This patch update the Iterator API to introduce new functions that allow users to keep the Slices returned by key() valid as long as the Iterator is not deleted

ReadOptions::pin_data : If true keep loaded blocks in memory as long as the iterator is not deleted
Iterator::IsKeyPinned() : If true, this mean that the Slice returned by key() is valid as long as the iterator is not deleted

Also add a new option BlockBasedTableOptions::use_delta_encoding to allow users to disable delta_encoding if needed.

Benchmark results (using https://phabricator.fb.com/P20083553)

```
// $ du -h /home/tec/local/normal.4K.Snappy/db10077
// 6.1G    /home/tec/local/normal.4K.Snappy/db10077

// $ du -h /home/tec/local/zero.8K.LZ4/db10077
// 6.4G    /home/tec/local/zero.8K.LZ4/db10077

// Benchmarks for shard db10077
// _build/opt/rocks/benchmark/rocks_copy_benchmark \
//      --normal_db_path="/home/tec/local/normal.4K.Snappy/db10077" \
//      --zero_db_path="/home/tec/local/zero.8K.LZ4/db10077"

// First run
// ============================================================================
// rocks/benchmark/RocksCopyBenchmark.cpp          relative  time/iter  iters/s
// ============================================================================
// BM_StringCopy                                                 1.73s  576.97m
// BM_StringPiece                                   103.74%      1.67s  598.55m
// ============================================================================
// Match rate : 1000000 / 1000000

// Second run
// ============================================================================
// rocks/benchmark/RocksCopyBenchmark.cpp          relative  time/iter  iters/s
// ============================================================================
// BM_StringCopy                                              611.99ms     1.63
// BM_StringPiece                                   203.76%   300.35ms     3.33
// ============================================================================
// Match rate : 1000000 / 1000000
```

Test Plan: Unit tests

Reviewers: sdong, igor, anthony, yhchiang, rven

Reviewed By: rven

Subscribers: dhruba, lovro, adsharma

Differential Revision: https://reviews.facebook.net/D48999
2015-12-16 12:08:30 -08:00
Gunnar Kudrjavets
97265f5f14 Fix minor bugs in delete operator, snprintf, and size_t usage
Summary:
List of changes:

1) Fix the snprintf() usage in cases where wrong variable was used to determine the output buffer size.

2) Remove unnecessary checks before calling delete operator.

3) Increase code correctness by using size_t type when getting vector's size.

4) Unify the coding style by removing namespace::std usage at the top of the file to confirm to the majority usage.

5) Fix various lint errors pointed out by 'arc lint'.

Test Plan:
Code review and build:

git diff
make clean
make -j 32 commit-prereq
arc lint

Reviewers: kradhakrishnan, sdong, rven, anthony, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51849
2015-12-15 15:26:20 -08:00
Zhipeng Jia
99ae549d37 Fix typo 2015-12-15 23:47:47 +08:00
Islam AbdelRahman
636cd3c714 Clean up listener_test (reuse db_test_util)
Summary: Reuse db_test_util in listener_test

Test Plan:
make listener_test -j64 && ./listener_test
USE_CLANG=1 make listener_test -j64 && ./listener_test

Reviewers: yhchiang, rven, kradhakrishnan, anthony

Reviewed By: anthony

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D51939
2015-12-14 13:36:32 -08:00
Venkatesh Radhakrishnan
030215bf01 Running manual compactions in parallel with other automatic or manual compactions in restricted cases
Summary:
This diff provides a framework for doing manual
compactions in parallel with other compactions. We now have a deque of manual compactions. We also pass manual compactions as an argument from RunManualCompactions down to
BackgroundCompactions, so that RunManualCompactions can be reentrant.
Parallelism is controlled by the two routines
ConflictingManualCompaction to allow/disallow new parallel/manual
compactions based on already existing ManualCompactions. In this diff, by default manual compactions still have to run exclusive of other compactions. However, by setting the compaction option, exclusive_manual_compaction to false, it is possible to run other compactions in parallel with a manual compaction. However, we are still restricted to one manual compaction per column family at a time. All of these restrictions will be relaxed in future diffs.
I will be adding more tests later.

Test Plan: Rocksdb regression + new tests + valgrind

Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong

Reviewed By: sdong

Subscribers: yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47973
2015-12-14 11:20:34 -08:00
Dmitri Smirnov
aca403d2b5 Fix another rebase problems. 2015-12-11 17:33:40 -08:00
Dmitri Smirnov
a6fbdd64e0 Fix rebase issues and new code warnings. 2015-12-11 16:56:24 -08:00
Dmitri Smirnov
3fa68af316 Enable MS compiler warning c4244.
Mostly due to the fact that there are differences in sizes of int,long
  on 64 bit systems vs GNU.
2015-12-11 16:52:41 -08:00
Dmitri Smirnov
236fe21c92 Enable MS compiler warning c4244.
Mostly due to the fact that there are differences in sizes of int,long
  on 64 bit systems vs GNU.
2015-12-11 16:47:34 -08:00
agiardullo
3bfd3d39a3 Use SST files for Transaction conflict detection
Summary:
Currently, transactions can fail even if there is no actual write conflict.  This is due to relying on only the memtables to check for write-conflicts.  Users have to tune memtable settings to try to avoid this, but it's hard to figure out exactly how to tune these settings.

With this diff, TransactionDB will use both memtables and SST files to determine if there are any write conflicts.  This relies on the fact that BlockBasedTable stores sequence numbers for all writes that happen after any open snapshot.  Also, D50295 is needed to prevent SingleDelete from disappearing writes (the TODOs in this test code will be fixed once the other diff is approved and merged).

Note that Optimistic transactions will still rely on tuning memtable settings as we do not want to read from SST while on the write thread.  Also, memtable settings can still be used to reduce how often TransactionDB needs to read SST files.

Test Plan: unit tests, db bench

Reviewers: rven, yhchiang, kradhakrishnan, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb, yoshinorim

Differential Revision: https://reviews.facebook.net/D50475
2015-12-11 12:34:11 -08:00
Yueh-Hsuan Chiang
f0a8e5a2d8 Fixed the valgrind error in ColumnFamilyTest::CreateAndDropRace
Summary: Fixed the valgrind error in ColumnFamilyTest::CreateAndDropRace

Test Plan: valgrind --error-exitcode=2 --leak-check=full ./column_family_test

Reviewers: kradhakrishnan, rven, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51795
2015-12-10 11:53:53 -08:00
agiardullo
9e44629061 Change SingleDelete to support conflict checking
Summary: For Transactions, we want to start using the SST files to do write conflict checking.  To do this, we need to make sure that compaction never removes all writes if an earlier snapshot exists.  So I had to change the way we process SingleDeletes to sometimes leave a SingleDelete behind when we encounter a Put followed by a SingleDelete.  See the comments in this diff for a more detailed explanation.

Test Plan: added more unit tests

Reviewers: rven, igor, kradhakrishnan, IslamAbdelRahman, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D50295
2015-12-10 11:35:38 -08:00
charsyam
c30b499541 fix typos in comments 2015-12-11 01:54:48 +09:00
sdong
56e77f0967 Deprecate options.soft_rate_limit and add options.soft_pending_compaction_bytes_limit
Summary: Deprecate options.soft_rate_limit, which is hard to tune, with options.soft_pending_compaction_bytes_limit, which would trigger the slowdown if estimated pending compaction bytes exceeds the threshold. The hope is to make it more striaght-forward to tune.

Test Plan: Modify DBTest.SoftLimit to cover options.soft_pending_compaction_bytes_limit instead; run all unit tests.

Reviewers: IslamAbdelRahman, yhchiang, rven, kradhakrishnan, igor, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51117
2015-12-09 18:22:45 -08:00
sdong
d6e1035a1f A new compaction picking priority that optimizes for write amplification for random updates.
Summary: Introduce a compaction picking priority that picks files who contains the oldest rows to compact. This is a mode that slightly improves write amplification for random update cases.

Test Plan: Add a unit test and run it in valgrind too.

Reviewers: yhchiang, anthony, IslamAbdelRahman, rven, kradhakrishnan, MarkCallaghan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51459
2015-12-09 18:13:03 -08:00
Yueh-Hsuan Chiang
0991cee6cd Merge pull request #815 from SherlockNoMad/CounterFix
Fix EstimateNumKeys Counter Inaccurate Issue
2015-12-09 14:10:49 -08:00
sdong
ac8e56f050 db_bench: in uncompress benchmark, get Snappy size from compressed stream
Summary: Now in benchmark "uncompress" in db_bench, we get size from compressed stream for all other compression types except Snappy, where we allocate memory based on parameter. Change it to match to behavior of other compression types.

Test Plan: Run ./db_bench --benchmarks=uncompress with snappy and other compression types.

Reviewers: yhchiang, kradhakrishnan, anthony, IslamAbdelRahman, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51681
2015-12-08 18:11:58 -08:00
Siying Dong
fa3dbf203f Merge pull request #853 from Vaisman/enable_C4267_warning
Enable C4267 warning
2015-12-08 17:59:24 -08:00
Siying Dong
ad6aaf4fab Merge pull request #848 from SherlockNoMad/db_bench
Split histogram per OperationType in db_bench
2015-12-08 17:58:40 -08:00
Yueh-Hsuan Chiang
774b80e99e Resubmit the fix for a race condition in persisting options
Summary:
This patch fix a race condition in persisting options which will cause a crash when:

* Thread A obtain cf options and start to persist options based on that cf options.
* Thread B kicks in and finish DropColumnFamily and delete cf_handle.
* Thread A wakes up and tries to finish the persisting options and crashes.

Test Plan: Add a test in column_family_test that can reproduce the crash

Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51717
2015-12-08 17:01:02 -08:00
agiardullo
e5c5f23814 Support marking snapshots for write-conflict checking - Take 2
Summary:
D51183 was reverted due to breaking the LITE build.

This diff is the same as D51183 but with a fix for the LITE BUILD(D51693)

Test Plan: run all unit tests

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51711
2015-12-08 16:47:31 -08:00
Venkatesh Radhakrishnan
3d8bb2c890 Fix valgrind failure in IncreaseUniversalCompactionNumLevels
Summary:
Fixing a valgrind failure in DBTestUniversalCompaction
in the IncreaseUniversalCompactionNumLevels test. Using
SpecialSkipList with 10 rows per file.

Test Plan: Run valgrind and functional tests.

Reviewers: anthony, yhchiang, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51705
2015-12-08 11:45:29 -08:00
sdong
1d63c3d610 Revert "Support marking snapshots for write-conflict checking"
This reverts commit ec704aafdc for it broke RocksDB LITE build.
2015-12-08 09:27:17 -08:00
agiardullo
ec704aafdc Support marking snapshots for write-conflict checking
Summary:
D50475 enables using SST files for transaction write-conflict checking.  In order for this to work, we need to make sure not to compact out SingleDeletes when there is an earlier transaction snapshot(D50295).  If there is a long-held snapshot, this could reduce the benefit of the SingleDelete optimization.

This diff allows Transactions to mark snapshots as being used for write-conflict checking.  Then, during compaction, we will be able to optimize SingleDeletes better in the future.

This diff adds a flag to SnapshotImpl which is used by Transactions.  This diff also passes the earliest write-conflict snapshot's sequence number to CompactionIterator.  This diff does not actually change Compaction (after this diff is pushed, D50295 will be able to use this information).

Test Plan: no behavior change, ran existing tests

Reviewers: rven, kradhakrishnan, yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51183
2015-12-07 19:40:51 -08:00
sdong
770dea9325 Fix occasional failure of DBTest.DynamicCompactionOptions
Summary: DBTest.DynamicCompactionOptions ocasionally fails during valgrind run. We sent a sleeping task to block compaction thread pool but we don't wait it to run.

Test Plan: Run the test multiple times in an environment which can cause failure.

Reviewers: rven, kradhakrishnan, igor, IslamAbdelRahman, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51687
2015-12-07 18:38:39 -08:00
SherlockNoMad
ebc2d490d1 Split histogram per OperationType in db_bench 2015-12-07 17:33:18 -08:00
sdong
f307036bde Revert "Fix a race condition in persisting options"
This reverts commit 2fa3ed5180. It breaks RocksDB lite build
2015-12-07 17:09:12 -08:00
Yueh-Hsuan Chiang
2fa3ed5180 Fix a race condition in persisting options
Summary:
This patch fix a race condition in persisting options which will cause a crash when:

* Thread A obtain cf options and start to persist options based on that cf options.
* Thread B kicks in and finish DropColumnFamily and delete cf_handle.
* Thread A wakes up and tries to finish the persisting options and crashes.

Test Plan: Add a test in column_family_test that can reproduce the crash

Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51609
2015-12-07 15:25:12 -08:00
Venkatesh Radhakrishnan
f276c3a821 Fix valgrind failures in 3 tests in db_compaction_test due to new skiplist changes
Summary:
Several tests in db_compaction_test are failing with aborts in
valgrind. These are LevelCompactionThirdPath, LevelCompactionPathUse and
CompressLevelCompaction. We now use the SpecialSkipListFactory to make
them more deterministic

Test Plan: valgrind

Reviewers: anthony, yhchiang, kradhakrishnan, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51663
2015-12-07 11:57:00 -08:00
sdong
291088ae4e Fix undeterministic failure of ColumnFamilyTest.DifferentWriteBufferSizes
Summary: After the skip list optimization, ColumnFamilyTest.DifferentWriteBufferSizes can occasionally fail with flush triggering of column family 3. Insert more data to it to make sure flush will trigger.

Test Plan: Run it multiple times with both of jemaloc on and off and see it always passes. (Without thd commit the run with jemalloc fails with chance of about one in two)

Reviewers: rven, yhchiang, IslamAbdelRahman, anthony, kradhakrishnan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51645
2015-12-07 10:53:29 -08:00
SherlockNoMad
355fa94365 EstimatedNumKeys Counter Inaccurate 2015-12-07 10:51:08 -08:00
Islam AbdelRahman
a9ca9107b9 Fix db_universal_compaction_test
Summary:
db_universal_compaction_test is still failing because of
UniversalCompactionNumLevels/DBTestUniversalCompaction.UniversalCompactionSecondPathRatio/0

https://travis-ci.org/facebook/rocksdb/jobs/94949919

Use same approach to fix other tests to fix this test

Test Plan: Run ./db_universal_compaction_test on mac and make sure all the tests pass

Reviewers: kradhakrishnan, yhchiang, rven, anthony, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D51591
2015-12-04 13:27:56 -08:00
krad
d3bb572da6 Build break fix.
Summary: Skip list now cannot estimate memory across allocators
consistently and hence triggers flush at different time. This breaks certain
unit tests.

The fix is to adopt key count instead of size for flush.

Test Plan: Ran test on dev box and mac (where it used to fail)

Reviewers: sdong

CC: leveldb@

Task ID: #9273334

Blame Rev:
2015-12-04 11:45:51 -08:00
Alex Yang
e8180f9901 added public api to schedule flush/compaction, code to prevent race with db::open
Summary:
Fixes T8781168.

Added a new function EnableAutoCompactions in db.h to be publicly
avialable.  This allows compaction to be re-enabled after disabling it via
SetOptions

Refactored code to set the dbptr earlier on in TransactionDB::Open and DB::Open
Temporarily disable auto_compaction in TransactionDB::Open until dbptr is set to
prevent race condition.

Test Plan:
Ran make all check

verified fix on myrocks side:
was able to reproduce the seg fault with
../tools/mysqltest.sh --mem --force rocksdb.drop_table

method was to manually sleep the thread after DB::Open but before TransactionDB ptr was
assigned in transaction_db_impl.cc:
  DB::Open(db_options, dbname, column_families_copy, handles, &db);
  clock_t goal = (60000 * 10) + clock();
  while (goal > clock());
  ...dbptr(aka rdb) gets assigned below

verified my changes fixed the issue.

Also added unit test 'ToggleAutoCompaction' in transaction_test.cc

Reviewers: hermanlee4, anthony

Reviewed By: anthony

Subscribers: alex, dhruba

Differential Revision: https://reviews.facebook.net/D51147
2015-12-03 22:59:44 -08:00
Islam AbdelRahman
19b1201b2b Merge pull request #865 from yuslepukhin/fix_db_table_properties_test
Avoid empty ranges vector with subsequent zero element access
2015-12-03 17:32:20 -08:00
yuslepukhin
e0de7ef87b Avoid empty ranges vector with subsequent zero element access 2015-12-02 14:50:33 -08:00
Yueh-Hsuan Chiang
a330f0b3bb Fix incorrect merge in db/db_compaction_test.cc
Summary: Fix incorrect merge in db/db_compaction_test.cc

Test Plan: db_compaction_test

Reviewers: igor, sdong, anthony, IslamAbdelRahman, rven, kradhakrishnan

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51531
2015-12-02 14:09:09 -08:00
Yueh-Hsuan Chiang
bd7a49d448 Make DBCompactionTestWithParam::CompactionTrigger more deterministic
Summary: Make DBCompactionTestWithParam::CompactionTrigger more deterministic

Test Plan: ./db_compaction_test

Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51507
2015-12-02 14:06:33 -08:00
sdong
bcd7bd1229 Relax verification condition of DBTest.SuggestCompactRangeTest
Summary: Verifiction condition of DBTest.SuggestCompactRangeTest is too strict. Based on key distribution, we might have more small files in last level. Not check number of files in the last level.

Test Plan: Run DBTest.SuggestCompactRangeTest with both of jemalloc on and off.

Reviewers: rven, IslamAbdelRahman, yhchiang, kradhakrishnan, igor, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51501
2015-12-01 21:12:24 -08:00
sdong
f9103d9a30 DBTest.DynamicCompactionOptions: More deterministic and readable
Summary: DBTest.DynamicCompactionOptions sometimes fails the assert but I can't repro it locally. Make it more deterministic and readable and see whether the problem is still there.

Test Plan: Run tht test and make sure it passes

Reviewers: kradhakrishnan, yhchiang, igor, rven, IslamAbdelRahman, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51309
2015-12-01 16:49:47 -08:00
sdong
0ad68518bb Fix DBCompactionTestWithParam.CompactionTrigger in non-jemalloc build.
Summary: DBCompactionTestWithParam.CompactionTrigger fails in non-jemalloc build, after the skip list memtable change. Fix it by making mem table flush trigger by number of entries.

Test Plan: Run the test using both of jemalloc and non-jemalloc build.

Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, igor, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51471
2015-12-01 12:25:22 -08:00
sdong
459c7fba36 Revert previous behavior of internal_key_skipped_count
Summary: With recent commit 33e0c93826, db iterator skips perf context counter internal_key_skipped_count when blindly issuing internal Next(). Now increment the counter by one when issuing this Next()

Test Plan: Run all existing tests

Reviewers: rven, yhchiang, IslamAbdelRahman, kradhakrishnan, igor, anthony

Reviewed By: anthony

Subscribers: yoshinorim, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51465
2015-11-30 21:55:05 -08:00
agiardullo
481f9edb15 Fix CLANG build
Summary: fix clang build

Test Plan: build

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51453
2015-11-30 20:02:13 -08:00
sdong
ef8ed3681c Fix DBTest.SuggestCompactRangeTest for disable jemalloc case
Summary: DBTest.SuggestCompactRangeTest fails for the case when jemalloc is disabled, including ASAN and valgrind builds. It is caused by the improvement of skip list, which allocates different size of nodes for a new records. Fix it by using a special mem table that triggers a flush by number of entries. In that way the behavior will be consistent for all allocators.

Test Plan: Run the test with both of DISABLE_JEMALLOC=1 and 0

Reviewers: anthony, rven, yhchiang, kradhakrishnan, igor, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51423
2015-11-30 16:40:47 -08:00
sdong
db320b1b82 DB to only flush the column family with the largest memtable while option.db_write_buffer_size is hit
Summary: When option.db_write_buffer_size is hit, we currently flush all column families. Move to flush the column family with the largest active memt table instead. In this way, we can avoid too many small files in some cases.

Test Plan: Modify test DBTest.SharedWriteBuffer to work with the updated behavior

Reviewers: kradhakrishnan, yhchiang, rven, anthony, IslamAbdelRahman, igor

Reviewed By: igor

Subscribers: march, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51291
2015-11-30 13:36:57 -08:00
sdong
33e0c93826 Reduce extra key comparision in DBIter::Next()
Summary: Now DBIter::Next() always compares with current key with itself first, which is unnecessary if the last key is not a merge key. I made the change and didn't see db_iter_test fails. Want to hear whether people have any idea what I miss.

Test Plan: Run all unit tests

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D48279
2015-11-24 17:16:18 -08:00
Nathan Bronson
9a9d4759b2 InlineSkipList part 3/3 - new skiplist type that colocates key and node
Summary:
This diff completes the creation of InlineSkipList<Cmp>, which is like
SkipList<const char*, Cmp> but it always allocates the key contiguously
with the node.  This allows us to remove the pointer from the node
to the key.  As a result the memory usage of the skip list is reduced
(by 1 to sizeof(void*) bytes depending on the padding required to align
the key storage), cache locality is improved, and we halve the number
of calls to the allocator.

For skip lists whose keys are freshly-allocated const char*,
InlineSkipList is stricly preferrable to SkipList.  This diff doesn't
replace SkipList, however, because some of the use cases of SkipList in
RocksDB are either character sequences that are not allocated at the
same time as the skip list node allocation (for example
hash_linklist_rep) or have different key types (for example
write_batch_with_index).  Taking advantage of inline allocation for
those cases is left to future work.

The perf win is biggest for small values.  For single-threaded CPU-bound
(32M fillrandom operations with no WAL log) with 16 byte keys and 0 byte
values, the db_bench perf goes from ~310k ops/sec to ~410k ops/sec.  For
large values the improvement is less pronounced, but seems to be between
5% and 10% on the same configuration.

Test Plan: make check

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D51123
2015-11-24 15:16:02 -08:00
Nathan Bronson
5201729545 InlineSkipList - part 2/3
Summary:
This diff is 2/3 in a sequence that introduces a skip list optimized
for a key that is a freshly-allocated const char*.  The change is broken
into pieces to make it easier to review.  This piece removes the Key
template type, introduces the AllocateKey interface, and changes the
unit test from using uint64_t as the Key type to using pointers to an 8
byte blob.

Test Plan: unit test

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D51285
2015-11-24 14:30:56 -08:00
Nathan Bronson
78812ec6bf InlineSkipList - part 1/3
Summary:
This diff is 1/3 in a sequence that introduces a skip list optimized for
a key that is a freshly-allocated const char*.  The diff is broken into
pieces to make it easier to review.  This piece only introduces the new
type by copying the existing SkipList, with mechanical naming changes
and reformatting.

Test Plan: new unit test

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D51279
2015-11-24 14:30:22 -08:00
Vasili Svirski
41b32c6059 Enable C4267 warning
* conversion from 'size_t' to 'type', by add static_cast

Tested:
* by build solution on Windows, Linux locally,
* run tests
* build CI system successful
2015-11-24 16:33:09 +03:00
agiardullo
c5b467306d Fix race condition that causes valgrind failures
Summary: DBTest.DynamicLevelCompressionPerLevel2 sometimes fails during valgrind runs.  This causes our valgrind tests to fail.  Not sure what the best fix is for this test, but hopefully this simple change is sufficient.

Test Plan: run test

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51111
2015-11-20 18:26:48 -08:00
Siying Dong
efb01a055a Merge pull request #850 from yuslepukhin/enable_2015_build
Build on Visual Studio 2015 Update 1
2015-11-20 17:57:22 -08:00
Venkatesh Radhakrishnan
81be49c755 Have a way for compaction filter to ignore snapshots
Summary:
Provide an API for compaction filter to specify that it needs
to be applied even if there are snapshots.

Test Plan: DBTestCompactionFilter.CompactionFilterIgnoreSnapshot

Reviewers: yhchiang, IslamAbdelRahman, sdong, anthony

Reviewed By: anthony

Subscribers: yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51087
2015-11-20 15:57:26 -08:00
yuslepukhin
047bd22aae Build on Visual Studio 2015 Update 1 2015-11-20 15:31:47 -08:00
sdong
189b3e03df Fix uninitilizeded SpecialEnv::time_elapse_only_sleep_
Summary: SpecialEnv::time_elapse_only_sleep_ is not initialized, which might cause some test failures. Fix it.

Test Plan: Run some unit tests. Since tests already broken. Might want to commit it sooner.

Reviewers: IslamAbdelRahman, yhchiang, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D50937
2015-11-17 16:22:17 -08:00
sdong
d5540e18e6 DBTest.MergeTestTime to only use fake time to be determinstic
Summary: DBTest.MergeTestTime is a test verifying timing counters. Depending on real time may cause non-determinstic results. Change to fake time to be determinsitic.

Test Plan: Run the test and make sure it passes

Reviewers: yhchiang, anthony, rven, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D50883
2015-11-17 14:40:23 -08:00
Islam AbdelRahman
605a24d94e Block forward_iterator_bench under MAC and Windows
Summary:
Travis is now failing because we cannot compile forward_iterator_bench under MAC
https://travis-ci.org/facebook/rocksdb/jobs/91524025

In forward_iterator_bench.cc we are using multiple functions that are not available in MAC like
htobe64
be64toh

Blocking forward_iterator_bench under MAC

Test Plan: compile under mac

Reviewers: rven, yhchiang, anthony, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D50889
2015-11-17 11:51:37 -08:00
Venkatesh Radhakrishnan
9b8c9be0b5 Fix forward_iterator allocation of vector.
Summary:
db_tailing_iter_test was failing on some platforms because of
an incorrect allocation and use. This diff fixes the issue.

Test Plan:
db_tailing_iter_test
Run valgrind for db_tailing_iter_test

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D50835
2015-11-17 10:27:51 -08:00
sdong
5cbb7e43e0 DBTest.MergeTestTime: relax counter upper bound verification
Summary: Timing counters' upper bounds depend on platform. It frequently fails in valgrind runs. Relax the upper bound.

Test Plan: Run the same valgrind test and make sure it passes.

Reviewers: rven, anthony, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D50829
2015-11-16 19:47:07 -08:00
Reid Horuff
3381e2c3e7 Handle multiple calls to DBImpl::PauseBackgroundWork() and DBImpl::ContinueBackgroundWork()
Summary: Handle multiple calls to DBImpl::PauseBackgroundWork() and DBImpl::ContinueBackgroundWork()

Test Plan: rocksdb.information_schema handles this case.

Reviewers: igor

Reviewed By: igor

Subscribers: hermanlee4, jkedgar, dhruba

Differential Revision: https://reviews.facebook.net/D50781
2015-11-16 14:20:18 -08:00
Islam AbdelRahman
ca5566d209 Fix clang build
Summary: Fix clang

Test Plan:
USE_CLANG=1 make all -j64

Reviewers: sdong, yhchiang, anthony, rven

Reviewed By: rven

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D50793
2015-11-16 14:14:39 -08:00
Dmitri Smirnov
cb9459f85c Fix empty vector write in ForwardIterator 2015-11-16 13:58:10 -08:00
Islam AbdelRahman
a163cc2d5a Lint everything
Summary:
```
arc2 lint --everything
```

run the linter on the whole code repo to fix exisitng lint issues

Test Plan: make check -j64

Reviewers: sdong, rven, anthony, kradhakrishnan, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D50769
2015-11-16 12:56:21 -08:00
sdong
dac5b248b1 UniversalCompactionPicker::PickCompaction(): avoid to form compactions if there is no file
Summary:
Currently RocksDB may break in lines like this:

for (size_t i = sorted_runs.size() - 1; i >= first_index_after; i--) {

if options.level0_file_num_compaction_trigger=0.

Fix it by not executing the logic of picking compactions if there is no file (sorted_runs.size() = 0). Also internally set options.level0_file_num_compaction_trigger=1 if users give a 0. 0 is a value makes no sense in RocksDB.

Test Plan: Run all tests. Will add a unit test too.

Reviewers: yhchiang, IslamAbdelRahman, anthony, kradhakrishnan, rven

Reviewed By: rven

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D50727
2015-11-16 10:32:45 -08:00
Venkatesh Radhakrishnan
d06b63e99f Fix Rocksdb lite build failure in forward_iterator_bench
Summary:
Fixed Rocksdb lite build failure in forward_iterator_bench by
defining main for the ROCKSDB_LITE case

Test Plan: build ROCKSDB_LITE

Reviewers: anthony, yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D50733
2015-11-16 09:57:08 -08:00
Venkatesh Radhakrishnan
7824444bfc Reuse file iterators in tailing iterator when memtable is flushed
Summary:
Under a tailing workload, there were increased block cache
misses when a memtable was flushed because we were rebuilding iterators
in that case since the version set changed. This was exacerbated in the
case of iterate_upper_bound, since file iterators which were over the
iterate_upper_bound would have been deleted and are now brought back as
part of the Rebuild, only to be deleted again. We now renew the iterators
and only build iterators for files which are added and delete file
iterators for files which are deleted.
Refer to https://reviews.facebook.net/D50463 for previous version

Test Plan: DBTestTailingIterator.TailingIteratorTrimSeekToNext

Reviewers: anthony, IslamAbdelRahman, igor, tnovak, yhchiang, sdong

Reviewed By: sdong

Subscribers: yhchiang, march, dhruba, leveldb, lovro

Differential Revision: https://reviews.facebook.net/D50679
2015-11-13 15:50:59 -08:00
Venkatesh Radhakrishnan
2ae4d7d708 Make sure that CompactFiles does not run two parallel Level 0 compactions
Summary:
Since level 0 files can overlap, two level 0 compactions cannot
run in parallel. Compact files needs to check this before running a
compaction.

Test Plan: CompactFilesTest.L0ConflictsFiles

Reviewers: igor, IslamAbdelRahman, anthony, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D50079
2015-11-13 12:01:00 -08:00
Nathan Bronson
6ce42dd075 Don't merge WriteBatch-es if WAL is disabled
Summary:
There's no need for WriteImpl to flatten the write batch group
into a single WriteBatch if the WAL is disabled.  This diff moves the
flattening into the WAL step, and skips flattening entirely if it isn't
needed.  It's good for about 5% speedup on a multi-threaded workload
with no WAL.

This diff also adds clarifying comments about the chance for partial
failure of WriteBatchInternal::InsertInto, and always sets bg_error_ if
the memtable state diverges from the logged state or if a WriteBatch
succeeds only partially.

Benchmark for speedup:
  db_bench -benchmarks=fillrandom -threads=16 -batch_size=1 -memtablerep=skip_list -value_size=0 --num=200000 -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -disable_auto_compactions --max_write_buffer_number=8 -max_background_flushes=8 --disable_wal --write_buffer_size=160000000

Test Plan: asserts + make check

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D50583
2015-11-12 10:50:38 -08:00
Yueh-Hsuan Chiang
56245ddcf5 Fixed DBCompactionTest.SkipStatsUpdateTest
Summary:
DBCompactionTest.SkipStatsUpdateTest relies on the number
of files opened during the DB::Open process, but the persisting
options file support altered this number and thus makes
DBCompactionTest.SkipStatsUpdateTest in certain environment.

This patch fixed this test failure.

Test Plan: db_compaction_test

Reviewers: igor, sdong, anthony, IslamAbdelRahman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D50637
2015-11-12 07:45:53 -08:00
Yueh-Hsuan Chiang
e78389b554 Fixed build failure of RocksDBLite test on options_file_test.cc
Summary: Fixed build failure of RocksDBLite test

Test Plan: options_file_test

Reviewers: igor, sdong, anthony, IslamAbdelRahman

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D50595
2015-11-10 23:23:36 -08:00
Yueh-Hsuan Chiang
e114f0abb8 Enable RocksDB to persist Options file.
Summary:
This patch allows rocksdb to persist options into a file on
DB::Open, SetOptions, and Create / Drop ColumnFamily.
Options files are created under the same directory as the rocksdb
instance.

In addition, this patch also adds a fail_if_missing_options_file in DBOptions
that makes any function call return non-ok status when it is not able to
persist options properly.

  // If true, then DB::Open / CreateColumnFamily / DropColumnFamily
  // / SetOptions will fail if options file is not detected or properly
  // persisted.
  //
  // DEFAULT: false
  bool fail_if_missing_options_file;

Options file names are formatted as OPTIONS-<number>, and RocksDB
will always keep the latest two options files.

Test Plan:
Add options_file_test.

options_test
column_family_test

Reviewers: igor, IslamAbdelRahman, sdong, anthony

Reviewed By: anthony

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48285
2015-11-10 22:58:01 -08:00
Nathan Bronson
631863c63b track WriteBatch contents
Summary:
Parallel writes will only be possible for certain combinations of
flags and WriteBatch contents.  Traversing the WriteBatch at write time
to check these conditions would be expensive, but it is very cheap to
keep track of when building WriteBatch-es.  When loading WriteBatch-es
during recovery, a deferred computation state is used so that the flags
never need to be computed.

Test Plan:
1. add asserts and EXPECT_EQ-s
2. make check

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D50337
2015-11-10 16:56:06 -08:00
Nathan Bronson
b81b430987 Switch to thread-local random for skiplist
Summary:
Using a TLS random instance for skiplist makes it smaller
(useful for hash_skiplist_rep) and prepares skiplist for concurrent
adds.  This diff also modifies the branching factor math to avoid an
unnecessary division.

This diff has the effect of changing the sequence of skip list node
height choices made by tests, so it has the potential to cause unit
test failures for tests that implicitly rely on the exact structure
of the skip list.  Tests that try to exactly trigger a compaction are
likely suspects for this problem (these tests have always been brittle to
changes in the skiplist details).  I've minimizes this risk by reseeding
the main thread's Random at the beginning of each test, increasing the
universal compaction size_ratio limit from 101% to 105% for some tests,
and verifying that the tests pass many times.

Test Plan: for i in `seq 0 9`; do make check; done

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D50439
2015-11-09 19:25:22 -08:00
Islam AbdelRahman
5b9ce1a323 Merge pull request #820 from yuslepukhin/enable_compiler_warnings
Enable Windows warnings C4307 C4309 C4512 C4701
2015-11-06 12:08:25 -08:00
Dmitri Smirnov
20f57b1715 Enable Windows warnings C4307 C4309 C4512 C4701
Enable C4307 'operator' : integral constant overflow
  Longs and ints on Windows are 32-bit hence the overflow
  Enable C4309 'conversion' : truncation of constant value
  Enable C4512 'class' : assignment operator could not be generated
  Enable C4701 Potentially uninitialized local variable 'name' used
2015-11-06 11:34:06 -08:00
Nathan Bronson
2b42000f43 incorrect batch group size computation for write throttling
Summary:
When a write batch can't join a batch group due to the total
size of the contained batches, the write controller's GetDelay is passed
a size value that includes the rejected batch.

Test Plan: make check

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D50343
2015-11-06 09:23:55 -08:00
Venkatesh Radhakrishnan
ae7940b628 Fix regression failure in PrefixTest.PrefixValid
Summary: Use IterKey to store prefix_start_ so that it doesn't get freed

Test Plan: PrefixTest.PrefixValid

Reviewers: anthony, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D50289
2015-11-05 16:43:54 -08:00
Venkatesh Radhakrishnan
9d50afc3b9 Prefix-based iterating only shows keys in prefix
Summary:
MyRocks testing found an issue that while iterating over keys
that are outside the prefix, sometimes wrong results were seen for keys
outside the prefix. We now tighten the range of keys seen with a new
read option called prefix_seen_at_start. This remembers the starting
prefix and then compares it on a Next for equality of prefix. If they
are from a different prefix, it sets valid to false.

Test Plan: PrefixTest.PrefixValid

Reviewers: IslamAbdelRahman, sdong, yhchiang, anthony

Reviewed By: anthony

Subscribers: spetrunia, hermanlee4, yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D50211
2015-11-05 13:24:05 -08:00
Yueh-Hsuan Chiang
7d7ee2b654 Add Memory Insight support to utilities
Summary:
This patch introduces utilities/memory, which currently includes
GetApproximateMemoryUsageByType that reports different types of
rocksdb memory usage given a list of input DBs.

The API also take care of the case where Cache could be shared
across multiple column families / multiple db instances.

Currently, it reports memory usage of memtable, table-readers
and cache.

Test Plan: utilities/memory/memory_test.cc

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D49257
2015-11-03 17:52:17 -08:00
Yueh-Hsuan Chiang
3ecbab0040 Add GetAggregatedIntProperty(): returns the aggregated value from all CFs
Summary:
This patch adds GetAggregatedIntProperty() that returns the aggregated
value from all CFs

Test Plan: Added a test in db_test

Reviewers: igor, sdong, anthony, IslamAbdelRahman, rven

Reviewed By: rven

Subscribers: rven, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D49497
2015-11-03 15:54:18 -08:00
Islam AbdelRahman
f31442fb5c Merge pull request #803 from SherlockNoMad/SkipFlush
Add Option to Skip Flushing in TableBuilder
2015-11-02 14:56:11 -08:00
Igor Canadi
279c8e0cd8 Merge pull request #811 from OverlordQ/unused-variable-warning
Fix introduced in 2ab7065 was reverted by 18285c1.
2015-11-02 12:44:27 -08:00
Brent Garber
affd833690 Fix introduced in 2ab7065 was reverted by 18285c1.
Corrects:

db/memtablerep_bench.cc:135:22: error: ‘FLAGS_env’ defined but not used [-Werror=unused-variable]
 static rocksdb::Env* FLAGS_env = rocksdb::Env::Default();
                      ^
cc1plus: all warnings being treated as errors
Makefile:1147: recipe for target 'db/memtablerep_bench.o' failed
2015-11-02 15:35:45 -05:00
SherlockNoMad
ccc8c10c0c Move skip_table_builder_flush to BlockBasedTableOption 2015-10-30 18:33:01 -07:00
Dmitri Smirnov
eaaf081d16 Do not suppress C4018 'expression' : signed/unsigned mismatch
The code compiles cleanly for the most part. Fix db_test.
  Move debug file to testutil library.
2015-10-30 17:03:16 -07:00
Islam AbdelRahman
ff4499e297 Update DB::AddFile() to have less restrictions
Summary:
Update DB::AddFile() restrictions to be
  - Key range in loaded table file don't overlap with existing keys or tombstones in DB.
  - No other writes happen during AddFile call.

The updated AddFile() will verify that the file key range don't overlap with any keys or tombstones in the DB, and then add the file to L0

Test Plan: unit tests

Reviewers: igor, rven, anthony, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: adsharma, ameyag, dhruba

Differential Revision: https://reviews.facebook.net/D49233
2015-10-30 16:38:10 -07:00
sdong
11c71a365a db_bench: --compaction_pri default should be rocksdb::Options().compaction_pri
Summary: Currently db_bnech's --compaction_pri default is set to be rocksdb::Options().compaction_style. Change it to rocksdb::Options().compaction_pri. Although, for now both is 0.

Test Plan: Build db_bench

Reviewers: anthony, rven, IslamAbdelRahman, kradhakrishnan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D49773
2015-10-30 15:02:33 -07:00
SherlockNoMad
a6dd0831d5 Add Option to Skip Flushing in TableBuilder 2015-10-29 22:10:25 -07:00
Islam AbdelRahman
2872e0c8c2 Clean and expose CreateLoggerFromOptions
Summary:
CreateLoggerFromOptions have some parameters like  db_log_dir and env, these parameters are redundant since they already exist in DBOptions

this patch remove the redundant parameters and expose CreateLoggerFromOptions to users

Test Plan: make check

Reviewers: igor, anthony, yhchiang, rven, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, hermanlee4

Differential Revision: https://reviews.facebook.net/D49713
2015-10-29 18:07:37 -07:00
sdong
296c3a1f94 "make format" in some recent commits
Summary: Run "make format" for some recent commits.

Test Plan: Build and run tests

Reviewers: IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D49707
2015-10-29 17:11:14 -07:00
Siying Dong
6388e7f4e2 Merge pull request #798 from yuslepukhin/readahead_buffermanagement
Implement smart buffer management in Windows Env.
2015-10-29 15:02:59 -07:00
Dmitri Smirnov
1277a48f1b Fix 80 character limit issue. 2015-10-29 11:34:34 -07:00
Herman Lee
0d720dfc17 Use the correct variable when fetching table properties.
Summary:
An uninitialized parameter was being passed into the call to fetch the table
properties during the compaction notification callbacks.

Test Plan:
Build it with myrocks and verify unit test passed.
Run unit tests.

Reviewers: rven, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D49635
2015-10-28 16:28:11 -07:00
Dmitri Smirnov
6fbc4f9f3e Implement smart buffer management.
introduce a new DBOption random_access_max_buffer_size to limit
  the size of the random access buffer used for unbuffered access.
  Implement read ahead buffering when enabled.
  To that effect propagate compaction_readahead_size and the new option
  to the env options to make it available for the implementation.
  Add Hint() override so SetupForCompaction() call would call Hint()
  readahead can now be setup from both Hint() and EnableReadAhead()
  Add new option random_access_max_buffer_size support
  db_bench, options_helper to make it string parsable
  and the unit test.
2015-10-27 14:44:16 -07:00
Praveen Rao
4ce117c4d5 Merge branch 'master' into wal_filter 2015-10-26 19:03:34 -07:00
Praveen Rao
32cdec634e Fail recovery if filter provides more records than original and corresponding unit-test, fix naming conventions 2015-10-26 18:11:18 -07:00
Siying Dong
138876a62c Merge pull request #746 from ceph/wip-recycle
Add Options.recycle_log_file_num for Recycling WAL Files
2015-10-26 15:01:28 -07:00
Dmitri Smirnov
3c750b59ae No need to #ifdef test only code on windows 2015-10-22 15:15:37 -07:00
sdong
e3d4e14075 DBCompactionTestWithParam.ManualCompaction to verify block cache is not filled in manual compaction
Summary: Manual compaction should not fill block cache. Add the verification in unit test

Test Plan: Run the test

Reviewers: yhchiang, kradhakrishnan, rven, IslamAbdelRahman, anthony, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D49089
2015-10-20 10:36:49 -07:00
sdong
6d6776f6b8 Log more information for the add file with overlapping range failure
Summary: crash_test sometimes fails, hitting the add file overlapping assert. Add information in info logs help us to find the bug.

Test Plan: Run all test suites. Do some manual tests to make sure printing is correct.

Reviewers: kradhakrishnan, yhchiang, anthony, IslamAbdelRahman, rven, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D49017
2015-10-19 17:31:13 -07:00
Praveen Rao
7951b9b079 make field order match initialization order 2015-10-19 17:03:01 -07:00
Praveen Rao
2938c5c137 merge upstream changes 2015-10-19 15:21:33 -07:00
Sage Weil
a7b2bedfb0 log_{reader,write}: recyclable record format
Introduce new tags for records that have a log_number.  This changes the
header size from 7 to 11 for these records, making this a
backward-incompatible change.

If we read a record that belongs to a different log_number (i.e., a
previous instantiation of this log file, before it was most recently
recycled), we return kOldRecord from ReadPhysicalRecord.  ReadRecord
will translate this into a kEof or kBadRecord depending on what the
WAL recovery mode is.

We make several adjustments to the log_test.cc tests to compensate for the
fact that the header size varies between the two modes.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-10-19 17:24:05 -04:00
Igor Canadi
4e07c99a9a Fix iOS build
Summary: We don't yet have a CI build for iOS, so our iOS compile gets broken sometimes. Most of the errors are from assumption that size_t is 64-bit, while it's actually 32-bit on some (all?) iOS platforms. This diff fixes the compile.

Test Plan:
TARGET_OS=IOS make static_lib

Observe there are no warnings

Reviewers: sdong, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D49029
2015-10-19 13:40:44 -07:00
Praveen Rao
0c59691dde Handle multiple batches in single log record - allow app to return a new batch + allow app to return corrupted record status 2015-10-19 13:27:40 -07:00
Dmitri Smirnov
2f680ed094 Make index same type as auto deduced uint32_t 2015-10-19 12:29:11 -07:00
Dmitri Smirnov
09f853550c uint is a not a datatype on windows. 2015-10-19 11:28:22 -07:00
Alexey Maykov
f18acd8875 Fixed the clang compilation failure
Summary: As above.

Test Plan: USE_CLANG=1 make check -j

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48981
2015-10-19 10:38:50 -07:00
Sage Weil
4104e9bb67 log_reader: introduce kBadHeader; drop wal mode from ReadPhysicalRecord
Move the WAL recovery mode logic out of ReadPhysicalRecord.  To do this we
introduce a new type indicating when we fail to read a valid header.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-10-18 21:24:32 -04:00
Sage Weil
9c33f64d19 log_reader: pass in WALRecoveryMode instead of bool report_eof_inconsistency
Soon our behavior will depend on more than just whther we are in
kAbsoluteConsistency or not.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-10-18 21:24:32 -04:00
Sage Weil
7188052107 db_test_util: add recycle_log_files to set of tested options
Signed-off-by: Sage Weil <sage@redhat.com>
2015-10-18 21:24:32 -04:00
Sage Weil
3ac13c99d1 log_reader: pass log_number and optional info_log to ctor
We will need the log number to validate the recycle-style CRCs.  The log
is helpful for debugging, but optional, as not all callers have it.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-10-18 21:24:32 -04:00
Sage Weil
5830c699f2 log_writer: pass log number and whether recycling is enabled to ctor
When we recycle log files, we need to mix the log number into the CRC
for each record.  Note that for logs that don't get recycled (like the
manifest), we always pass a log_number of 0 and false.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-10-18 21:24:32 -04:00
Sage Weil
666376150c db_impl: recycle log files
If log recycling is enabled, put old WAL files on a recycle queue instead of
deleting them.  When we need a new log file, take a recycled file off the
list if one is available.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-10-18 21:24:32 -04:00
Sage Weil
d666225a0a db_impl: disable recycle_log_files if WAL archive is enabled
We can't recycle the files if they are being archived.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-10-18 21:21:24 -04:00
Sage Weil
543c12ab06 options: add recycle_log_file_num option
Signed-off-by: Sage Weil <sage@redhat.com>
2015-10-18 21:21:24 -04:00
Alexey Maykov
e1a09a7703 Implementation for GetPropertiesOfTablesInRange
Summary: In MyRocks, it is sometimes important to get propeties only for the subset of the database. This diff implements the API in RocksDB.

Test Plan: ran the GetPropertiesOfTablesInRange

Reviewers: rven, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48651
2015-10-17 13:34:43 -07:00
Yueh-Hsuan Chiang
ad471453e8 Allow GetProperty to report the number of currently running flushes / compactions.
Summary:
Add rocksdb.num-running-compactions and rocksdb.num-running-flushes
to GetIntProperty() that reports the number of currently running
compactions / flushes.

Test Plan: augmented existing tests in db_test

Reviewers: igor, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D48693
2015-10-17 00:16:36 -07:00
sdong
277dea78f0 Add more kill points
Summary:
Add kill points in:
1. after creating a file
2. before writing a manifest record
3. before syncing manifest
4. before creating a new current file
5. after creating a new current file

Test Plan: Run all current tests.

Reviewers: yhchiang, igor, anthony, IslamAbdelRahman, rven, kradhakrishnan

Reviewed By: kradhakrishnan

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D48855
2015-10-16 14:35:12 -07:00
Venkatesh Radhakrishnan
a98fbacfa0 Moving memtable related files from util to a new directory memtable
Summary:
We are cleaning up dependencies.
This diff takes a first step at moving memtable files to their own
directory called memtable. In future diffs, we will move other memtable
files from db to memtable.

Test Plan: make check

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D48915
2015-10-16 14:10:33 -07:00
Islam AbdelRahman
952ad994a9 Fix db_test under ROCKSDB_LITE
Summary:
This diff exclude alot of tests in db_test that are not compiling / failing under ROCKSD_LITE

Test Plan:
OPT=-DROCKSDB_LITE make check -j64
make check -j64

Reviewers: yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48771
2015-10-15 10:59:31 -07:00
Islam AbdelRahman
6d730b4ae7 Block tests under ROCKSDB_LITE
Summary:
This patch will block all tests (not including db_test) that don't compile / fail under ROCKSDB_LITE

Test Plan:
OPT=-DROCKSDB_LITE make db_compaction_filter_test -j64 &&
OPT=-DROCKSDB_LITE make db_compaction_test -j64 &&
OPT=-DROCKSDB_LITE make db_dynamic_level_test -j64 &&
OPT=-DROCKSDB_LITE make db_log_iter_test -j64 &&
OPT=-DROCKSDB_LITE make db_tailing_iter_test -j64 &&
OPT=-DROCKSDB_LITE make db_universal_compaction_test -j64 &&
OPT=-DROCKSDB_LITE make ldb_cmd_test -j64

make clean

make db_compaction_filter_test -j64 &&
make db_compaction_test -j64 &&
make db_dynamic_level_test -j64 &&
make db_log_iter_test -j64 &&
make db_tailing_iter_test -j64 &&
make db_universal_compaction_test -j64 &&
make ldb_cmd_test -j64

Reviewers: yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48723
2015-10-15 10:51:00 -07:00
sdong
dae49e829e Make DBTest.ReadLatencyHistogramByLevel more robust
Summary:
Two fixes:
1. Wait compaction after generating each L0 file so that we are sure there are one L0 file left.
2. https://reviews.facebook.net/D48423 increased from 500 keys to 700 keys but in verification phase we are still querying the first 500 keys. It is a bug to fix.

Test Plan: Run the test in the same environment that fails by chance of one in tens of times. It doesn't fail after 1000 times.

Reviewers: yhchiang, IslamAbdelRahman, igor, rven, kradhakrishnan

Reviewed By: rven, kradhakrishnan

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D48759
2015-10-14 16:08:55 -07:00
Islam AbdelRahman
b81b2ec25d Fix benchmarks under ROCKSDB_LITE
Summary: Fix db_bench and memtablerep_bench under ROCKSDB_LITE

Test Plan:
OPT=-DROCKSDB_LITE make db_bench -j64
OPT=-DROCKSDB_LITE make memtablerep_bench -j64
make db_bench -j64
make memtablerep_bench -j64

Reviewers: yhchiang, anthony, rven, igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48717
2015-10-14 12:43:00 -07:00
Venkatesh Radhakrishnan
e587dbe03a Move manual_compaction_test.cc from util to db
Summary: manual_compaction_test.cc incorrectly in util. Moved to db.

Test Plan: make check

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48687
2015-10-14 11:06:27 -07:00
Islam AbdelRahman
f55d3009c0 Make db_test_util compile under ROCKSDB_LITE
Summary: db_test_util is used in multiple test files but it dont compile under ROCKSDB_LITE

Test Plan:
make check
make static_lib
OPT=-DROCKSDB_LITE make db_wal_test

Reviewers: igor, yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48579
2015-10-13 17:33:23 -07:00
Yueh-Hsuan Chiang
2b925ccb43 Correct the comments in db/internal_stats.h
Summary: Correct the comments in db/internal_stats.h

Test Plan: no code change

Reviewers: igor, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D48675
2015-10-13 17:23:40 -07:00
sdong
35ad531be3 Seperate InternalIterator from Iterator
Summary:
Separate a new class InternalIterator from class Iterator, when the look-up is done internally, which also means they operate on key with sequence ID and type.

This change will enable potential future optimizations but for now InternalIterator's functions are still the same as Iterator's.
At the same time, separate the cleanup function to a separate class and let both of InternalIterator and Iterator inherit from it.

Test Plan: Run all existing tests.

Reviewers: igor, yhchiang, anthony, kradhakrishnan, IslamAbdelRahman, rven

Reviewed By: rven

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D48549
2015-10-13 15:32:13 -07:00
Praveen Rao
cc4d13e0a8 Put wal_filter under #ifndef ROCKSDB_LITE 2015-10-13 11:10:14 -07:00
Yueh-Hsuan Chiang
7062d0ea68 Make perf_context.db_mutex_lock_nanos and db_condition_wait_nanos only measures DB Mutex
Summary:
In the current implementation, perf_context.db_mutex_lock_nanos and
perf_context.db_condition_wait_nanos also include the mutex-wait time
other than DB Mutex.

This patch fix this issue by incrementing the counters only when it detects
a DB mutex.

Test Plan: perf_context_test

Reviewers: anthony, IslamAbdelRahman, sdong, igor

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D48555
2015-10-13 10:41:48 -07:00
Islam AbdelRahman
1fe78a4073 Fix tests failing in ROCKSDB_LITE
Summary:
Fix tests that compile under ROCKSDB_LITE but currently failing.

table_test:
RandomizedLongDB test is using internal stats which is not supported in ROCKSDB_LITE

compaction_job_test:
Using CompactionJobStats which is not supported

perf_context_test:
KeyComparisonCount test try to open DB in ReadOnly mode which is not supported

Test Plan: run the tests under ROCKSDB_LITE

Reviewers: yhchiang, sdong, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48585
2015-10-13 10:32:05 -07:00
Praveen Rao
a6efefef79 Fix format specifiers 2015-10-12 19:20:09 -07:00
Praveen Rao
f7b2a7b40b Fix format specifiers 2015-10-12 18:55:46 -07:00
Islam AbdelRahman
7f58ff7c31 Remove db_impl_debug from release build
Summary:
Remove db_impl_debug from NDEBUG, but allow it in ROCKSDB_LITE
These functions by definition should not be included in NDEBUG and they are only used for testing
This is based on offline discussion with @yhchiang and @igor

Test Plan:
make static_lib
make check

Reviewers: igor, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: igor, yhchiang, dhruba

Differential Revision: https://reviews.facebook.net/D48573
2015-10-12 17:35:43 -07:00
Praveen Rao
eb24178553 merge from master 2015-10-12 17:24:21 -07:00
Islam AbdelRahman
c64ae05b1c Move TEST_NewInternalIterator to NewInternalIterator
Summary:
Long time ago we add InternalDumpCommand to ldb_tool https://reviews.facebook.net/D11517
This command is using TEST_NewInternalIterator although it's not a test. This patch move TEST_NewInternalIterator outside of db_impl_debug.cc

Test Plan:
make check
make static_lib

Reviewers: yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48561
2015-10-12 17:22:37 -07:00
Praveen Rao
59a0c219bb Adding log filter to inspect and filter log records on recovery 2015-10-12 17:03:03 -07:00
Venkatesh Radhakrishnan
f1fdf5205b Clean up dependency: Move db_test_util.* to db directory
Summary:
As part of tech debt week, we are cleaning up dependencies.
This diff moves db_test_util.[h,cc] from util to db directory.

Test Plan: make check

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D48543
2015-10-12 13:05:42 -07:00
Alexey Maykov
fa4b5b3db8 Fix for the travis build caused by my previous commit
Summary: My previous commit ('Passing table properties to compaction callback') broke the clang build. Here is the fix.

Test Plan: USE_CLANG=1 make all -j

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48489
2015-10-09 19:37:51 -07:00
Alexey Maykov
3d07b815f6 Passing table properties to compaction callback
Summary: It would be nice to have and access to table properties in compaction callbacks. In MyRocks project, it will make possible to update optimizer statistics online.

Test Plan: ran the unit test. Ran myrocks with the new way of collecting stats.

Reviewers: igor, rven, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48267
2015-10-09 18:10:55 -07:00
sdong
776bd8d5eb Pass column family ID to table property collector
Summary: Pass column family ID through TablePropertiesCollectorFactory::CreateTablePropertiesCollector() so that users can identify which column family this file is for and handle it differently.

Test Plan: Add unit test scenarios in tests related to table properties collectors to verify the information passed in is correct.

Reviewers: rven, yhchiang, anthony, kradhakrishnan, igor, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: yoshinorim, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D48411
2015-10-09 14:36:51 -07:00
sdong
e61d9c1484 Make DBTest.AggregatedTableProperties more deterministic
Summary: Now based on environment, DBTest.AggregatedTableProperties has a possibility of issuing a L0->L1 compaction after reopening and the results are not what we expected. We tune the L0 compaction trigger to make it less likely to happen.

Test Plan: I can't repro the failure but I think the change is better. Just run the test and make sure it passes.

Reviewers: kradhakrishnan, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D48423
2015-10-09 09:47:56 -07:00
sdong
b77eb16aba New Manifest format to allow customized fields in NewFile.
Summary: With this commit, we add a new format in manifest when adding a new file. Now path ID and need-compaction hint are first two customized fields.

Test Plan: Add a test case in version_edit_test to verify the encoding and decoding logic. Add a unit test in db_test to verify need compaction is persistent after DB restarting.

Reviewers: kradhakrishnan, anthony, IslamAbdelRahman, yhchiang, rven, igor

Reviewed By: igor

Subscribers: javigon, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D48123
2015-10-08 15:51:45 -07:00
sdong
000836a880 CompactionFilter::Context to contain column family ID
Summary: Add the column family ID to compaction filter context, so it is easier for compaction filter to apply different logic for different column families.

Test Plan: Add a unit test to verify the column family ID passed is correct.

Reviewers: rven, yhchiang, igor

Reviewed By: igor

Subscribers: yoshinorim, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D48357
2015-10-08 11:27:38 -07:00
Igor Canadi
9803e0d813 compaction_filter.h cleanup
Summary:
Two changes:
1. remove *V2 filter stuff. we deprecated that a while ago
2. clarify what happens when user sets max_subcompactions to bigger than 1

Test Plan: none

Reviewers: yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47871
2015-10-08 09:32:50 -07:00
dyniusz
a065cdb388 bloom hit/miss stats for SST and memtable
Summary:
	hit and miss bloom filter stats for memtable and SST
	stats added to perf_context struct
	key matches and prefix matches combined into one stat

Test Plan: unit test veryfing the functionality added, see BloomStatsTest in db_test.cc for details

Reviewers: yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47859
2015-10-07 11:23:20 -07:00
Igor Canadi
77e4ad7ce2 Fix compile failure on Travis
Summary:
Travis is complaining against using {} to initialize KVMap: https://travis-ci.org/facebook/rocksdb/jobs/84132600

      db/compaction_job_test.cc:526:26: error: chosen constructor is explicit in copy-initialization
        RunCompaction({files}, {});

This diff should fix it

Test Plan: travis

Reviewers: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D48309
2015-10-07 10:17:47 -07:00
Igor Canadi
d80ce7f99a Compaction filter on merge operands
Summary:
Since Andres' internship is over, I took over https://reviews.facebook.net/D42555 and rebased and simplified it a bit.

The behavior in this diff is a bit simpler than in D42555:
* only merge operators are passed through FilterMergeValue(). If fitler function returns true, the merge operator is ignored
* compaction filter is *not* called on: 1) results of merge operations and 2) base values that are getting merged with merge operands (the second case was also true in previous diff)

Do we also need a compaction filter to get called on merge results?

Test Plan: make && make check

Reviewers: lovro, tnovak, rven, yhchiang, sdong

Reviewed By: sdong

Subscribers: noetzli, kolmike, leveldb, dhruba, sdong

Differential Revision: https://reviews.facebook.net/D47847
2015-10-07 09:30:03 -07:00
dyniusz
0267502655 Support for LevelDB SST with .ldb suffix
Summary:
	Handle SST files with both ".sst" and ".ldb" suffix.
	This enables user to migrate from leveldb to rocksdb.

Test Plan:
        Added unit test with DB operating on SSTs with names schema.
        See db/dc_test.cc:SSTsWithLdbSuffixHandling for details

Reviewers: yhchiang, sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D48003
2015-10-06 17:46:22 -07:00
Igor Canadi
eb5b637fb0 Fix condition for bottommost level
Summary:
The function GetBoundaryKeys() returns the smallest key from the first file and largest key from the last file. This is good for any level >0, but it's not correct for level 0. In level 0, files can overlap, so we need to check all files for boundary keys. This bug can cause wrong value for bottommost_level in compaction (value of true, although correct is false), which means we can set sequence numbers to 0 even if the key is not the oldest one in the database.

Herman reported corruption while testing MyRocks. Fortunately, the patch that added the bug was not released yet.

Test Plan: added a new test to compaction_picker_test.

Reviewers: hermanlee4, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D48201
2015-10-05 17:40:18 -07:00
Igor Canadi
9eaff629e3 Make corruption_test more robust
Summary:
Latest travis failed because of corruption test TableFileIndexData: https://travis-ci.org/facebook/rocksdb/jobs/83732558

This diff makes the test more explicit:
1. create two files
2. corrupt the second's file index
3. expect to get only 5000 keys when range scanning

Test Plan: the test is still passing :)

Reviewers: sdong, rven, yhchiang, kradhakrishnan, IslamAbdelRahman, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D48183
2015-10-05 14:46:28 -07:00
Igor Canadi
bf19dbff44 Fix valgrind - Initialize done variable
Summary: Fixes the valgrind warning "Conditional jump or move depends on uninitialised value(s)"

Test Plan: valgrind test, no more warning

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D48177
2015-10-05 10:10:11 -07:00
Igor Canadi
115427ef63 Add APIs PauseBackgroundWork() and ContinueBackgroundWork()
Summary:
To support a new MongoDB capability, we need to make sure that we don't do any IO for a short period of time. For background, see:
* https://jira.mongodb.org/browse/SERVER-20704
* https://jira.mongodb.org/browse/SERVER-18899

To implement that, I add a new API calls PauseBackgroundWork() and ContinueBackgroundWork() which reuse the capability we already have in place for RefitLevel() function.

Test Plan: Added a new test in db_test. Made sure that test fails when PauseBackgroundWork() is commented out.

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47901
2015-10-02 13:17:34 -07:00
Islam AbdelRahman
c29af48d3e Add max_file_opening_threads to db_bench
Summary: Add an option to db_bench for max_file_opening_threads

Test Plan: compile and run db_bench

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba, paultuckfield

Differential Revision: https://reviews.facebook.net/D47811
2015-09-30 09:51:31 -07:00
Yueh-Hsuan Chiang
30f74fa964 Make CompactionJobStatsTest.UniversalCompactionTest more robust
Summary:
CompactionJobStatsTest.UniversalCompactionTest assumes compaction
kicks in when the number of L0 files equals to the compaction trigger.
However, in some case, the compaction might not catch up the write
speed and thus compaction might not kick in until the number of L0 files
is GREATER than the compaction trigger.

This patch tries to fix this corner case by making the Put thread wait
for a potential compaction whenever it flushes.

Test Plan: ./compaction_job_stats_test

Reviewers: sdong, anthony, IslamAbdelRahman, igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D47589
2015-09-28 13:55:53 -07:00
Mike Lin
60fa9cf0b5 Override DBImplReadOnly::SyncWAL() to return NotSupported. Previously, calling it caused program abort. 2015-09-25 21:25:30 -07:00
Yueh-Hsuan Chiang
63e0f86797 Fixed a bug which causes rocksdb.flush.write.bytes stat is always zero
Summary: Fixed a bug which causes rocksdb.flush.write.bytes stat is always zero

Test Plan: augment existing db_test

Reviewers: sdong, anthony, IslamAbdelRahman, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47595
2015-09-25 13:34:49 -07:00
Yueh-Hsuan Chiang
b6aa3f962d Fixed a memory leak issue in DBTest.UnremovableSingleDelete
Summary: Fixed a memory leak issue in DBTest.UnremovableSingleDelete

Test Plan: valgrind --error-exitcode=2 --leak-check=full ./db_test --gtest_filter="*UnremovableSingleDelete*"

Reviewers: sdong, anthony, IslamAbdelRahman, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47583
2015-09-25 12:07:32 -07:00
Igor Canadi
7b7b5d9f18 [minor] Reuse SleepingBackgroundTask
Summary: As title

Test Plan: make check

Reviewers: yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46983
2015-09-25 10:29:44 -07:00
Mayank Pundir
c58bac701c Fix valgrind failure due to memory leaks
Summary: Test cases for IsBottommostLevel function create FileMetaData objects which were not getting deleted in the destructor.

Test Plan: Valgrind check on compaction_picker_test

Reviewers: yhchiang, igor, sdong

Subscribers: rven, kradhakrishnan, IslamAbdelRahman, dhruba, anthony

Differential Revision: https://reviews.facebook.net/D47463
2015-09-23 17:41:42 -07:00
Islam AbdelRahman
f03b5c987b Add experimental DB::AddFile() to plug sst files into empty DB
Summary:
This is an initial version of bulk load feature

This diff allow us to create sst files, and then bulk load them later, right now the restrictions for loading an sst file are
(1) Memtables are empty
(2) Added sst files have sequence number = 0, and existing values in database have sequence number = 0
(3) Added sst files values are not overlapping

Test Plan: unit testing

Reviewers: igor, ott, sdong

Reviewed By: sdong

Subscribers: leveldb, ott, dhruba

Differential Revision: https://reviews.facebook.net/D39081
2015-09-23 12:42:43 -07:00
Yueh-Hsuan Chiang
3fdb6e5234 Fixed old lint errors in db/filename.cc
Summary: Fixed old lint errors in db/filename.cc

Test Plan: make

Reviewers: igor, sdong, anthony, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47445
2015-09-23 12:39:16 -07:00
Yueh-Hsuan Chiang
b349d22786 Fixed old lint errors in db/filename.h
Summary: Fixed old lint errors in db/filename.h

Test Plan: make

Reviewers: igor, sdong, anthony, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47439
2015-09-23 12:22:44 -07:00
sdong
df34aea331 PlainTableReader to support non-mmap mode
Summary:
PlainTableReader now only allows mmap-mode. Add the support to non-mmap mode for more flexibility.
Refactor the codes to move all logic of reading data to PlainTableKeyDecoder, and consolidate the calls to Read() call and ReadVarint32() call. Implement the calls for both of mmap and non-mmap case seperately. For non-mmap mode, make copy of keys in several places when we need to move the buffer after reading the keys.

Test Plan: Add the mode of non-mmap case in plain_table_db_test. Run it in valgrind mode too.

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D47187
2015-09-23 11:41:07 -07:00
sdong
d0c31641d2 Internal stats WAL file synced to match meaning of the stats of the same name
Summary: https://reviews.facebook.net/D23343 changed WAL sync bytes to extra fsync. This change does the same for internal stats.

Test Plan: Run all existing unit tests and verify results in db_bench.

Reviewers: anthony, rven, igor, MarkCallaghan, kradhakrishnan, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D47349
2015-09-22 14:23:11 -07:00
Siying Dong
48b4497f75 Merge pull request #730 from yuslepukhin/fix_write_batch_win_const_expr
Fix Windows constexpr issue and '#ifdef' column_family_test in Release.
2015-09-22 11:08:10 -07:00
sdong
f1b9f804e9 Add a mode to always pick the oldest file to compact for each level
Summary:
Add options.compaction_pri, which specifies the policy about which file to compact first.
kCompactionPriByLargestSeq will compact oldest files first.
Verified the behavior in db_bench but did not write unit tests yet. Also need to make it settable through option string and dynamically changeable.

Test Plan: Will write unit tests

Reviewers: igor, rven, anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, MarkCallaghan

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D45951
2015-09-21 17:21:59 -07:00
Dmitri Smirnov
2754ec9994 Fix Windows constexpr issue and '#ifdef' column_family_test in Release. 2015-09-21 16:21:01 -07:00
jsteemann
5ec129971b key_ cannot become nullptr, so no check is needed for that
(ignoring the unlikely case that some overrides
`operator new throw(std::bad_alloc)` with a function that returns a nullptr)
2015-09-18 20:15:20 +02:00
jsteemann
834b12a8d5 made Size() function const because it does not modify data 2015-09-18 20:10:00 +02:00
Andres Noetzli
014fd55adc Support for SingleDelete()
Summary:
This patch fixes #7460559. It introduces SingleDelete as a new database
operation. This operation can be used to delete keys that were never
overwritten (no put following another put of the same key). If an overwritten
key is single deleted the behavior is undefined. Single deletion of a
non-existent key has no effect but multiple consecutive single deletions are
not allowed (see limitations).

In contrast to the conventional Delete() operation, the deletion entry is
removed along with the value when the two are lined up in a compaction. Note:
The semantics are similar to @igor's prototype that allowed to have this
behavior on the granularity of a column family (
https://reviews.facebook.net/D42093 ). This new patch, however, is more
aggressive when it comes to removing tombstones: It removes the SingleDelete
together with the value whenever there is no snapshot between them while the
older patch only did this when the sequence number of the deletion was older
than the earliest snapshot.

Most of the complex additions are in the Compaction Iterator, all other changes
should be relatively straightforward. The patch also includes basic support for
single deletions in db_stress and db_bench.

Limitations:
- Not compatible with cuckoo hash tables
- Single deletions cannot be used in combination with merges and normal
  deletions on the same key (other keys are not affected by this)
- Consecutive single deletions are currently not allowed (and older version of
  this patch supported this so it could be resurrected if needed)

Test Plan: make all check

Reviewers: yhchiang, sdong, rven, anthony, yoshinorim, igor

Reviewed By: igor

Subscribers: maykov, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43179
2015-09-17 11:42:56 -07:00
Venkatesh Radhakrishnan
51e1c11254 Do not flag error if file to be deleted does not exist
Summary:
Some users have observed errors in the log file when
the log file or sst file is already deleted.

Test Plan:
Make sure that the errors do not appear for already deleted
files.

Reviewers: sdong

Reviewed By: sdong

Subscribers: anthony, kradhakrishnan, yhchiang, rven, igor, IslamAbdelRahman, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47115
2015-09-17 10:21:34 -07:00
Mayank Pundir
a5e312a7a4 Improving condition for bottommost level during compaction
Summary: The diff modifies the condition checked to determine the bottommost level during compaction. Previously, absence of files in higher levels alone was used as the condition. Now, the function additionally evaluates if the higher levels have files which have non-overlapping key ranges, then the level can be safely considered as the bottommost level.

Test Plan: Unit test cases added and passing. However, unit tests of universal compaction are failing as a result of the changes made in this diff. Need to understand why that is happening.

Reviewers: igor

Subscribers: dhruba, sdong, lgalanis, meyering

Differential Revision: https://reviews.facebook.net/D46473
2015-09-16 17:47:50 -07:00
sdong
9aca7cd6d8 DB::Open() to flush info log after printing DB pointer
Summary: Now DB::Open() flushes info log before printing DB pointer, so it may not show up if no activity after DB open. Move log flushing from after printing options to printing DB pointer.

Test Plan: make commit-prereq

Reviewers: igor, IslamAbdelRahman, yhchiang, kradhakrishnan, anthony, rven

Reviewed By: rven

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D47121
2015-09-16 16:33:39 -07:00
Yueh-Hsuan Chiang
f21c7415a7 Change the log level of DB start-up log from Warn to Header.
Summary: Change the log level of DB start-up log from Warn to Header.

Test Plan: db_bench and observe the LOG header

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47067
2015-09-16 11:31:45 -07:00
Alexey Maykov
3ebf11ed16 Adding the increment for a counter for a number of WAL syncs
Summary: This will unblock the corresponding change in MyRocks

Test Plan: ran rocksdb.write_sync test

Reviewers: sdong, kolmike

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D46911
2015-09-16 11:00:49 -07:00
Igor Canadi
1b7ea8ce81 Skipped tests shouldn't be failures
Summary: If we skip a test, we shouldn't mark `make check` as failure. This fixes travis CI test.

Test Plan: Travis CI

Reviewers: noetzli, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47031
2015-09-15 18:10:36 -07:00
Ari Ekmekji
5ba3297d0d Add compaction time to log output
Summary:
Although compaction time is recorded in the statistics,
it is helpful to include this value in the log output corresponding
to the end of compaction.

Test Plan: make all && make check

Reviewers: yhchiang, sdong, igor, noetzli, MarkCallaghan

Reviewed By: MarkCallaghan

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D47007
2015-09-15 17:11:44 -07:00
Igor Canadi
0e50a3fcc0 Merge issue with D46773
Summary: There was a merge issue with SleepingBackgroundTask

Test Plan: compiles now

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46977
2015-09-15 11:35:23 -07:00
Igor Canadi
a7e80379b0 LogAndApply() should fail if the column family has been dropped
Summary:
This patch finally fixes the ColumnFamilyTest.ReadDroppedColumnFamily test. The test has been failing very sporadically and it was hard to repro. However, I managed to write a new tests that reproes the failure deterministically.

Here's what happens:
1. We start the flush for the column family
2. We check if the column family was dropped here: a3fc49bfdd/db/flush_job.cc (L149)
3. This check goes through, ends up in InstallMemtableFlushResults() and it goes into LogAndApply()
4. At about this time, we start dropping the column family. Dropping the column family process gets to LogAndApply() at about the same time as LogAndApply() from flush process
5. Drop column family goes through LogAndApply() first, marking the column family as dropped.
6. Flush process gets woken up and gets a chance to write to the MANIFEST. However, this is where it gets stuck: a3fc49bfdd/db/version_set.cc (L1975)
7. We see that the column family was dropped, so there is no need to write to the MANIFEST. We return OK.
8. Flush gets OK back from LogAndApply() and it deletes the memtable, thinking that the data is now safely persisted to sst file.

The fix is pretty simple. Instead of OK, we return ShutdownInProgress. This is not really true, but we have been using this status code to also mean "this operation was canceled because the column family has been dropped".

The fix is only one LOC. All other code is related to tests. I added a new test that reproes the failure. I also moved SleepingBackgroundTask to util/testutil.h (because I needed it in column_family_test for my new test). There's plenty of other places where we reimplement SleepingBackgroundTask, but I'll address that in a separate commit.

Test Plan:
1. new test
2. make check
3. Make sure the ColumnFamilyTest.ReadDroppedColumnFamily doesn't fail on Travis: https://travis-ci.org/facebook/rocksdb/jobs/79952386

Reviewers: yhchiang, anthony, IslamAbdelRahman, kradhakrishnan, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46773
2015-09-15 11:28:44 -07:00
Yoshinori Matsunobu
4886073174 Adding Slice::difference_offset() function
Summary:
There are some use cases in MyRocks to compare two slices
and to return the first byte where they differ. It may be
useful to add it as a RocksDB Slice function.

Test Plan: db_test

Reviewers: sdong, rven, igor

Reviewed By: igor

Subscribers: jkedgar, dhruba

Differential Revision: https://reviews.facebook.net/D46935
2015-09-15 10:32:42 -07:00
sdong
f3170b6f6c DBImpl::FindObsoleteFiles() shouldn't release mutex between getting min_pending_output and scanning files
Summary:
Releasing mutex between getting min_pending_output and scanning files may cause min_pending_output to be max but some non-final files are found in file scanning, ending up with deleting wrong files.
As a recent regression, mutex can be released while waiting for log sync. We move it to after file scanning.

Test Plan: Run all existing tests. Don't think it is easy to write a unit test. Maybe we should find a way to assert lock not released so that we can have some test verification for similar cases.

Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, kolmike, rven

Reviewed By: rven

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D46899
2015-09-14 23:39:30 -07:00
Islam AbdelRahman
7cb314b9e6 Skip some tests in ROCKSD_LITE
Summary:
Skip these tests under ROCKSDB_LITE

compaction_job_stats_test
corruption_test
transactions/transaction_test

Test Plan: compile using ROCKSDB_LITE

Reviewers: yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D46923
2015-09-14 16:44:35 -07:00
sdong
5de807ac16 Add options.hard_pending_compaction_bytes_limit to stop writes if compaction lagging behind
Summary: Add an option to stop writes if compaction lefts behind. If estimated pending compaction bytes is more than threshold specified by options.hard_pending_compaction_bytes_liimt, writes will stop until compactions are cleared to under the threshold.

Test Plan: Add unit test DBTest.HardLimit

Reviewers: rven, kradhakrishnan, anthony, IslamAbdelRahman, yhchiang, igor

Reviewed By: igor

Subscribers: MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D45999
2015-09-14 12:51:16 -07:00
Siying Dong
592f6bf782 Merge pull request #716 from yuslepukhin/refactor_file_reader_writer_win
Refactor to support file_reader_writer on Windows.
2015-09-14 12:29:01 -07:00
Ari Ekmekji
03ddce9a01 Add counters for L0 stall while L0-L1 compaction is taking place
Summary:
Although there are currently counters to keep track of the
stall caused by having too many L0 files, there is no distinction as
to whether when that stall occurs either (A) L0-L1 compaction is taking
place to try and mitigate it, or (B) no L0-L1 compaction has been scheduled
at the moment. This diff adds a counter for (A) so that the nature of L0
stalls can be better understood.

Test Plan: make all && make check

Reviewers: sdong, igor, anthony, noetzli, yhchiang

Reviewed By: yhchiang

Subscribers: MarkCallaghan, dhruba

Differential Revision: https://reviews.facebook.net/D46749
2015-09-14 11:03:37 -07:00
Dmitri Smirnov
ddc8b44998 Address code review comments both GH and internal
Fix compilation issues on GCC/CLANG
 Address Windows Release test build issues due to Sync
2015-09-11 17:36:48 -07:00
Andres Noetzli
34cedaff66 Initialize variable to avoid warning
Summary:
RocksDB debug version failed to build under gcc-4.8.1 on sandcastle with the following error:

```
db/db_compaction_filter_test.cc:570:33: error: ‘snapshot’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
```

Test Plan: make db_compaction_filter_test && ./db_compaction_filter_test

Reviewers: rven, anthony, yhchiang, aekmekji, igor, sdong

Reviewed By: igor, sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46725
2015-09-11 12:07:54 -07:00
Manuel Ung
aeb4612685 Add counters for seek/next/prev
Summary:
There are currently no statistics on seeks, only on gets. This adds the following counters:

rocksdb.number.db.seek
rocksdb.number.db.next
rocksdb.number.db.prev
(number of calls)

rocksdb.db.iterate.bytes.read
(number of bytes read from key + value using seek/next/prev)

rocksdb.number.keys.seek.found
rocksdb.number.keys.next.found
rocksdb.number.keys.prev.found
(number of calls where seek/next/prev found a value)

Test Plan:
./db_bench -statistics -benchmarks fillrandom,seekrandom -seek_nexts 5
./db_bench -statistics -benchmarks fillrandom,seekrandom -seek_nexts 5 -reverse_iterator

Reviewers: yhchiang, rven, kradhakrishnan, IslamAbdelRahman, MarkCallaghan, sdong, igor

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D46605
2015-09-11 11:37:44 -07:00
Islam AbdelRahman
45e9e4f0bb Refactor NewTableReader to accept TableReaderOptions
Summary:
Refactoring NewTableReader to accept TableReaderOptions
This will make it easier to add new options in the future, for example in this diff https://reviews.facebook.net/D46071

Test Plan: run existing tests

Reviewers: igor, yhchiang, anthony, rven, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D46179
2015-09-11 11:36:33 -07:00
Andres Noetzli
ddb950f83f Fixed bug in compaction iterator
Summary:
During the refactoring, the condition that makes sure that compaction
filters are only applied to records newer than the latest snapshot
got butchered. This patch fixes the condition and adds a test case.

Test Plan: make db_compaction_filter_test && ./db_compaction_filter_test

Reviewers: rven, anthony, yhchiang, sdong, aekmekji, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46707
2015-09-11 10:13:49 -07:00
Dmitri Smirnov
30e82d5c41 Refactor to support file_reader_writer on Windows.
Summary. A change https://reviews.facebook.net/differential/diff/224721/
  Has attempted to move common functionality out of platform dependent
  code to a new facility called file_reader_writer.
  This includes:
  - perf counters
  - Buffering
  - RateLimiting

  However, the change did not attempt to refactor Windows code.
  To mitigate, we introduce new quering interfaces such as UseOSBuffer(),
  GetRequiredBufferAlignment() and ReaderWriterForward()
  for pure forwarding where required.
  Introduce WritableFile got a new method Truncate(). This is to communicate
  to the file as to how much data it has on close.
   - When space is pre-allocated on Linux it is filled with zeros implicitly,
    no such thing exist on Windows so we must truncate file on close.
   - When operating in unbuffered mode the last page is filled with zeros but we still want to truncate.

   Previously, Close() would take care of it but now buffer management is shifted to the wrappers and the file has
   no idea about the file true size.

   This means that Close() on the wrapper level must always include
   Truncate() as well as wrapper __dtor should call Close() and
   against double Close().
   Move buffered/unbuffered write logic to the wrapper.
   Utilize Aligned buffer class.
   Adjust tests and implement Truncate() where necessary.
   Come up with reasonable defaults for new virtual interfaces.
   Forward calls for RandomAccessReadAhead class to avoid double
   buffering and locking (double locking in unbuffered mode on WIndows).
2015-09-11 09:57:02 -07:00
Andres Noetzli
c25f6a85bf Removed __unused__ attribute
Summary:
The current build is failing on some platforms due to an __unused__ attribute.
This patch prevents the problem by using a pattern similar to MergeHelper
(assert not on the variable but inside a condition that uses the variable). We
should have better error handling in both cases in the future.

Test Plan: make clean all check

Reviewers: rven, anthony, yhchiang, sdong, igor, aekmekji

Reviewed By: aekmekji

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46623
2015-09-10 15:16:32 -07:00
Ari Ekmekji
6db0a939d2 Fix DBCompactionTest failure with parallel L0-L1 compactions
Summary:
The test SuggestCompactRangeNoTwoLevel0Compactions in
DBCompactionTest fails when there are parallel L0-L1 compactions
taking place because the test makes sure that only one compaction
involving L0 takes place at any given time (since before having
parallel compactions this was impossible). I changed the test to only
run with DBOptions.max_subcompactions=1 so as to not hit this issue
which is not a correctness issue but just an inherent changing of
assumptions after introducing parallel compactions.

This failed after landing https://reviews.facebook.net/D43269#inline-321303
so now this should fix it

Test Plan: make all && make check

Reviewers: yhchiang, igor, anthony, noetzli, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D46617
2015-09-10 14:37:00 -07:00
Andres Noetzli
8aa1f15197 Refactored common code of Builder/CompactionJob out into a CompactionIterator
Summary:
Builder and CompactionJob share a lot of fairly complex code. This patch
refactors this code into a separate class, the CompactionIterator. Because the
shared code is fairly complex, this patch hopefully improves maintainability.
While there are is a lot of potential for further improvements, the patch is
intentionally pretty close to the original structure because the change is
already complex enough.

Test Plan: make clean all check && ./db_stress

Reviewers: rven, anthony, yhchiang, sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46197
2015-09-10 14:35:25 -07:00
Igor Canadi
95ffc5d2bc Correct ASSERT_OK() in ReadDroppedColumnFamily
Summary: ReadDroppedColumnFamily is consistently failing in Travis CI environment (can't repro locally). I suspect it might be failing with non-OK status. This diff will give us more info about the failure.

Test Plan: none

Reviewers: sdong, kradhakrishnan

Reviewed By: kradhakrishnan

Subscribers: kradhakrishnan, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46611
2015-09-10 14:17:12 -07:00
Ari Ekmekji
3c37b3cccd Determine boundaries of subcompactions
Summary:
Up to this point, the subcompactions that make up a compaction
job have been divided based on the key range of the L1 files, and each
subcompaction has handled the key range of only one file. However
DBOption.max_subcompactions allows the user to designate how many
subcompactions at most to perform. This patch updates the
CompactionJob::GetSubcompactionBoundaries() to determine these
divisions accordingly based on that option and other input/system factors.

The current approach orders the starting and/or ending keys of certain
compaction input files and then generates a histogram to approximate the
size covered by the key range between each consecutive pair of keys. Then
it groups these ranges into groups so that the sizes are approximately equal
to one another. The approach has also been adapted to work for universal
compaction as well instead of just for level-based compaction as it was before.

These subcompactions are then executed in parallel by locally spawning
threads, one for each. The results are then aggregated and the compaction
completed.

Test Plan: make all && make check

Reviewers: yhchiang, anthony, igor, noetzli, sdong

Reviewed By: sdong

Subscribers: MarkCallaghan, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43269
2015-09-10 13:50:00 -07:00
krad
1126644082 Relaxing consistency detection to include errors while inserting to memtable as WAL recovery error.
Summary: The current code, considers data to be consistent if the record
checksum passes. We do have customer issues where the record checksum passed but
the data was incomprehensible. There is no way to get out of this error case
since all WAL recovery model will consider this error as unrelated to WAL.

Relaxing the definition and including errors while inserting to memtable as WAL
errors and handing them as per the recovery level.

Test Plan: Used customer dump to verify the fix for different level. The db
opens for kSkipAnyCorruptedRecords and kPointInTimeRecovery, but fails for
kAbsoluteConsistency and kTolerateCorruptedTailRecords.

Reviewers: sdon igor

CC: leveldb@

Task ID: #7918721

Blame Rev:
2015-09-10 12:56:17 -07:00
sdong
abc7f5fdb2 Make DBTest.ReadLatencyHistogramByLevel more robust
Summary: DBTest.ReadLatencyHistogramByLevel was not written as expected. After writes, reads aren't guaranteed to hit data written. It was not expected. Fix it.

Test Plan: Run the test multiple times

Reviewers: IslamAbdelRahman, rven, anthony, kradhakrishnan, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D46587
2015-09-10 11:32:19 -07:00
Igor Canadi
ac9bcb55ce Set max_open_files based on ulimit
Summary: We should never set max_open_files to be bigger than the system's ulimit. Otherwise we will get "Too many open files" errors. See an example in this Travis run: https://travis-ci.org/facebook/rocksdb/jobs/79591566

Test Plan:
make check

I will also verify that max_max_open_files is reasonable.

Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46551
2015-09-10 10:49:28 -07:00
agiardullo
b5b2b75e52 better tuning of arena block size
Summary: Currently, if users didn't set options.arena_block_size, we set "result.arena_block_size = result.write_buffer_size / 10". It makes result.arena_block_size not a multiplier of 4KB, even if options.write_buffer_size is a multiplier of MBs. When calling malloc to arena_block_size, we may waste a small amount of memory for it. We now make the default to be /8 or /16 and align it to 4KB.

Test Plan: unit tests

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46467
2015-09-08 20:53:32 -07:00
sdong
342ba80895 Make DBTest.OptimizeFiltersForHits more deterministic
Summary:
This commit makes DBTest.OptimizeFiltersForHits more deterministic by:
(1) make key inserts more random
(2) make sure L0 has one file
(3) make file size smaller compared to level target so L1 will cover more range.

Test Plan: Run the test many times.

Reviewers: rven, IslamAbdelRahman, kradhakrishnan, igor, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D46461
2015-09-08 19:31:34 -07:00
Andres Notzli
e17e92ea19 Relaxed assert in forward iterator
Summary:
It looks like in some cases an assert in SeekInternal failed when computing the
hints for the next level because user_key was the same as the largest key and
not strictly smaller. Relaxing the assert to expect smaller or equal keys.

Test Plan: make clean all check

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46443
2015-09-08 17:15:11 -07:00
Andres Noetzli
6bdc484fd8 Added Equal method to Comparator interface
Summary:
In some cases, equality comparisons can be done more efficiently than three-way
comparisons. There are quite a few places in the code where we only care about
equality. This patch adds an Equal() method that defaults to using the
Compare() method.

Test Plan: make clean all check

Reviewers: rven, anthony, yhchiang, igor, sdong

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46233
2015-09-08 15:30:49 -07:00
Andres Noetzli
3a0df7f161 Fixed comparison in ForwardIterator when computing hint for GetNextLevelIndex()
Summary: When computing the hint for GetNextLevelIndex(), ForwardIterator was doing a redundant comparison. This patch fixes the comparison (using https://github.com/facebook/rocksdb/blob/master/db/version_set.cc#L158 as a reference) and moves it inside an assert because we expect `level_files[f_idx]` to contain the next key after Seek(), so user_key should always be smaller than the largest key.

Test Plan: make clean all check

Reviewers: rven, anthony, yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: tnovak, sdong, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46227
2015-09-08 09:47:54 -07:00
Venkatesh Radhakrishnan
91f3c90792 Fix case when forward iterator misses a new update
Summary:
This diff fixes a case when the forward iterator misses a new
insert when the mutable iterator is not current. The test is also
improved and the check for deleted iterators is made more informative.

Test Plan: DBTailingIteratorTest.*Trim

Reviewers: tnovak, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D46167
2015-09-04 14:28:45 -07:00
Andres Noetzli
3c9cef1eed Unified maps with Comparator for sorting, other cleanup
Summary:
This diff is a collection of cleanups that were initially part of D43179.
Additionally it adds a unified way of defining key-value maps that use a
Comparator for sorting (this was previously implemented in four different
places).

Test Plan: make clean check all

Reviewers: rven, anthony, yhchiang, sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45993
2015-09-02 13:58:22 -07:00
sdong
3e0a672c50 Bug fix: table readers created by TableCache::Get() doesn't have latency histogram reported
Summary: TableCache::Get() puts parameters in the wrong places so that table readers created by Get() will not have the histogram updated.

Test Plan: Will write a unit test for that.

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D46035
2015-09-02 12:57:07 -07:00
Tomislav Novak
5508122ed6 Fix a perf regression in ForwardIterator
Summary:
I noticed that memtable iterator usually crosses the `iterate_upper_bound`
threshold when tailing. Changes introduced in D43833 made `NeedToSeekImmutable`
always return true in such case, even when `Seek()` only needs to rewind the
memtable iterator. In a test I ran, this caused the "tailing efficiency"
(ratio of calls to `Seek()` that only affect the memtable versus all seeks)
to drop almost to zero.

This diff attempts to fix the regression by using a different flag to indicate
that `current_` is over the limit instead of resetting `valid_` in
`UpdateCurrent()`.

Test Plan: `DBTestTailingIterator.TailingIteratorUpperBound`

Reviewers: sdong, rven

Reviewed By: rven

Subscribers: dhruba, march

Differential Revision: https://reviews.facebook.net/D45909
2015-09-01 09:54:30 -07:00
Andres Notzli
b722007778 Fix listener_test when using ROCKSDB_MALLOC_USABLE_SIZE
Summary:
Flushes in listener_test happened to early when ROCKSDB_MALLOC_USABLE_SIZE was
active (e.g. when compiling with ROCKSDB_FBCODE_BUILD_WITH_481=1) due to
malloc_usable_size() reporting a better estimate (similar to
https://reviews.facebook.net/D43317 ). This patch grows the write buffer size
slightly to compensate for this.

Test Plan: ROCKSDB_FBCODE_BUILD_WITH_481=1 make listener_test && ./listener_test

Reviewers: rven, anthony, yhchiang, igor, sdong

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45921
2015-08-31 23:11:12 -07:00
agiardullo
18db1e4695 better db_bench options for transactions
Summary:
Pessimistic Transaction expiration time checking currently causes a performace regression,  Lets disable it in db_bench by default.

Also, in order to be able to better tune how much contention we're simulating, added new optinos to set lock timeout and snapshot.

Test Plan: run db_bench randomtranansaction

Reviewers: sdong, igor, yhchiang, MarkCallaghan

Reviewed By: MarkCallaghan

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45831
2015-08-31 15:56:07 -07:00
Ari Ekmekji
8b689546b6 Add Subcompactions to Universal Compaction Unit Tests
Summary:
Now that the approach to parallelizing L0-L1 level-based
compactions by breaking the compaction job into subcompactions is
being extended to apply to universal compactions as well, the unit
tests need to account for this and run the universal compaction
tests with subcompactions both enabled and disabled.

Test Plan: make all && make check

Reviewers: sdong, igor, noetzli, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D45657
2015-08-31 12:59:02 -07:00
sdong
3d78eb66bb Arena usage to be calculated using malloc_usable_size()
Summary: malloc_usable_size() gets a better estimation of memory usage. It is already used to calculate block cache memory usage. Use it in arena too.

Test Plan: Run all unit tests

Reviewers: anthony, kradhakrishnan, rven, IslamAbdelRahman, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D43317
2015-08-31 09:39:27 -07:00
Andres Noetzli
effd9dd1e1 Fix deadlock in WAL sync
Summary:
MarkLogsSynced() was doing `logs_.erase(it++);`. The standard is saying:

```
all iterators and references are invalidated, unless the erased members are at an end (front or back) of the deque (in which case only iterators and references to the erased members are invalidated)
```

Because `it` is an iterator to the first element of the container, it is
invalidated, only one iteration is executed and `log.getting_synced = false;`
is not being done, so `while (logs_.front().getting_synced)` in `WriteImpl()`
is not terminating.

Test Plan: make db_bench && ./db_bench --benchmarks=fillsync

Reviewers: igor, rven, IslamAbdelRahman, anthony, kradhakrishnan, yhchiang, sdong, tnovak

Reviewed By: tnovak

Subscribers: kolmike, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45807
2015-08-28 18:06:32 -07:00
Andres Noetzli
72a9b73c9e Removed unnecessary checks in DBTest.ApproximateMemoryUsage
Summary:
Just realized that after D45675, part of the code in
DBTest.ApproximateMemoryUsage, does not really test anything anymore, so I
removed it.

Test Plan: make clean all check

Reviewers: rven, igor, sdong, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45783
2015-08-28 11:13:20 -07:00
Venkatesh Radhakrishnan
cb164bfc48 Do not delete iterators for immutable memtables.
Summary:
The immutable memtable iterators are allocated from an arena and there
is no benefit from deleting these. Also the immutable memtables
themselves will continue to be in memory until the version set
containing it is alive. We will not remove immutable memtable iterators
over the upper bound. We now add immutable iterators to the test.

Test Plan: db_tailing_iter_test.TailingIteratorTrimSeekToNext

Reviewers: tnovak, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45597
2015-08-28 11:07:07 -07:00
sdong
7a0dbdf3ac Add ZSTD (not final format) compression type
Summary: Add ZSTD compression type. The same way as adding LZ4.

Test Plan: run all tests. Generate files in db_bench. Make sure reads succeed. But the SST files cannot be opened in older versions. Also some other adhoc tests.

Reviewers: rven, anthony, IslamAbdelRahman, kradhakrishnan, igor

Reviewed By: igor

Subscribers: MarkCallaghan, maykov, yoshinorim, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D45747
2015-08-28 11:01:13 -07:00
Andres Noetzli
e853191c17 Fix DBTest.ApproximateMemoryUsage
Summary:
This patch fixes two issues in DBTest.ApproximateMemoryUsage:
- It was possible that a flush happened between getting the two properties in
  Phase 1, resulting in different numbers for the properties and failing the
  assertion. This is fixed by waiting for the flush to finish before getting
  the properties.
- There was a similar issue in Phase 2 and additionally there was an issue that
  rocksdb.size-all-mem-tables was not monotonically increasing because it was
  possible that a flush happened just after getting the properties and then
  another flush just before getting the properties in the next round. In this
  situation, the reported memory usage decreased. This is fixed by forcing a
  flush before getting the properties.

Note: during testing, I found that kFlushesPerRound does not seem very
accurate. I added a TODO for this and it would be great to get some input on
what to do there.

Test Plan:
The first issue can be made more likely to trigger by inserting a
`usleep(10000);` between the calls to GetIntProperty() in Phase 1.
The second issue can be made more likely to trigger by inserting a
`if (r != 0) usleep(10000);` before the calls to GetIntProperty() and a
`usleep(10000);` after the calls.
Then execute make db_test && ./db_test --gtest_filter=DBTest.ApproximateMemoryUsage

Reviewers: rven, yhchiang, igor, sdong, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45675
2015-08-27 16:17:08 -07:00
Yueh-Hsuan Chiang
8ef0144e2f Add argument --show_table_properties to db_bench
Summary:
Add argument --show_table_properties to db_bench

  -show_table_properties (If true, then per-level table properties will be
    printed on every stats-interval when stats_interval is set and
    stats_per_interval is on.) type: bool default: false

Test Plan:
./db_bench --show_table_properties=1 --stats_interval=100000 --stats_per_interval=1
./db_bench --show_table_properties=1 --stats_interval=100000 --stats_per_interval=1 --num_column_families=2

Sample Output:

    Compaction Stats [column_family_name_000001]
    Level    Files   Size(MB) Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) Stall(cnt)  KeyIn KeyDrop
    ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
      L0      3/0          5   0.8      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0     86.3         0        17    0.021          0       0      0
      L1      5/0          9   0.9      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0         0         0    0.000          0       0      0
      L2      9/0         16   0.2      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0         0         0    0.000          0       0      0
     Sum     17/0         31   0.0      0.0     0.0      0.0       0.0      0.0       0.0   1.0      0.0     86.3         0        17    0.021          0       0      0
     Int      0/0          0   0.0      0.0     0.0      0.0       0.0      0.0       0.0   1.0      0.0     83.9         0         2    0.022          0       0      0
    Flush(GB): cumulative 0.030, interval 0.004
    Stalls(count): 0 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 0 leveln_slowdown_soft, 0 leveln_slowdown_hard

    Level[0]: # data blocks=2571; # entries=84813; raw key size=2035512; raw average key size=24.000000; raw value size=8481300; raw average value size=100.000000; data block size=5690119; index block size=82415; filter block size=0; (estimated) table size=5772534; filter policy name=N/A;
    Level[1]: # data blocks=4285; # entries=141355; raw key size=3392520; raw average key size=24.000000; raw value size=14135500; raw average value size=100.000000; data block size=9487353; index block size=137377; filter block size=0; (estimated) table size=9624730; filter policy name=N/A;
    Level[2]: # data blocks=7713; # entries=254439; raw key size=6106536; raw average key size=24.000000; raw value size=25443900; raw average value size=100.000000; data block size=17077893; index block size=247269; filter block size=0; (estimated) table size=17325162; filter policy name=N/A;
    Level[3]: # data blocks=0; # entries=0; raw key size=0; raw average key size=0.000000; raw value size=0; raw average value size=0.000000; data block size=0; index block size=0; filter block size=0; (estimated) table size=0; filter policy name=N/A;
    Level[4]: # data blocks=0; # entries=0; raw key size=0; raw average key size=0.000000; raw value size=0; raw average value size=0.000000; data block size=0; index block size=0; filter block size=0; (estimated) table size=0; filter policy name=N/A;
    Level[5]: # data blocks=0; # entries=0; raw key size=0; raw average key size=0.000000; raw value size=0; raw average value size=0.000000; data block size=0; index block size=0; filter block size=0; (estimated) table size=0; filter policy name=N/A;
    Level[6]: # data blocks=0; # entries=0; raw key size=0; raw average key size=0.000000; raw value size=0; raw average value size=0.000000; data block size=0; index block size=0; filter block size=0; (estimated) table size=0; filter policy name=N/A;

Reviewers: anthony, IslamAbdelRahman, MarkCallaghan, sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45651
2015-08-26 18:27:23 -07:00
Igor Canadi
5f4166c90e ReadaheadRandomAccessFile -- userspace readahead
Summary:
ReadaheadRandomAccessFile acts as a transparent layer on top of RandomAccessFile. When a Read() request is issued, it issues a much bigger request to the OS and caches the result. When a new request comes in and we already have the data cached, it doesn't have to issue any requests to the OS.

We add ReadaheadRandomAccessFile layer only when file is read during compactions.

D45105 was incorrectly closed by Phabricator because I committed it to a separate branch (not master), so I'm resubmitting the diff.

Test Plan: make check

Reviewers: MarkCallaghan, sdong

Reviewed By: sdong

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D45123
2015-08-26 15:25:59 -07:00
sdong
d286b5df90 DBIter to out extra keys with higher sequence numbers when changing direction from forward to backward
Summary:
When DBIter changes iterating direction from forward to backward, it might see some much larger keys with higher sequence ID. With this commit, these rows will be actively filtered out. It should fix existing disabled tests in db_iter_test.

This may not be a perfect fix, but it introduces least impact on existing codes, in order to be safe.

Test Plan:
Enable existing tests and make sure they pass. Add a new test DBIterWithMergeIterTest.InnerMergeIteratorDataRace8.
Also run all existing tests.

Reviewers: yhchiang, rven, anthony, IslamAbdelRahman, kradhakrishnan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D45567
2015-08-26 13:01:39 -07:00
Andres Noetzli
3795449c9d Fix DBTest.GetProperty
Summary:
DBTest.GetProperty was failing occasionally (see task #8131266). The reason was
that the test closed the database before the compaction was done. When the test
reopened the database, RocksDB would schedule a compaction which in turn
created table readers and lead the test to fail the assertion that
rocksdb.estimate-table-readers-mem is 0. In most cases, GetIntProperty() of
rocksdb.estimate-table-readers-mem happened before the compaction created the
table readers, hiding the problem. This patch changes the
WaitForFlushMemTable() to WaitForCompact(). WaitForFlushMemTable() is not
necessary because it is already being called a couple of lines before without
any insertions in-between.

Test Plan:
Insert `usleep(10000);` just after `Reopen(options);` on line 2333 to make the issue more likely, then run:
make db_test && while ./db_test --gtest_filter=DBTest.GetProperty; do true; done

Reviewers: rven, yhchiang, anthony, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45603
2015-08-26 10:10:26 -07:00
Igor Canadi
a7834a1292 Merge pull request #698 from yuslepukhin/address_noexcept_windows
Address noexcept and const integer lambda capture on win
2015-08-25 17:15:23 -07:00
Dmitri Smirnov
6924d7582b Address noexcept and const integer lambda capture
VS 2013 does not support noexcept.
   Complains about usage of ineteger constant within lambda requiring explicit capture.
2015-08-25 15:17:14 -07:00
Ari Ekmekji
2f8d71ec05 Moving sequence number compaction variables from SubCompactionState to CompactionJob
Summary:
It was pointed out to me that the members of SubCompactionState
'earliest_snapshot', 'latest_snapshot' and 'visible_at_tip' are never
modified by the subcompactions, so they can stay as global varaibles
instead to make things simpler.

Test Plan: make all && make check

Reviewers: sdong, igor, noetzli, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D45477
2015-08-25 14:03:10 -07:00
Venkatesh Radhakrishnan
bab9934d9e Fix build failure caused by bad merge.
Summary: There was a bad merge during refresh.

Test Plan: make -j all; make check

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D45555
2015-08-25 14:02:03 -07:00
Venkatesh Radhakrishnan
4d28a7d8ab Add a whitebox test for deleted file iterators.
Summary:
We have earlier added a feature to delete file iterators when the
current key is over the iterate upper bound. We now add a whitebox test
to check if the file iterators were actually deleted.

Test Plan: Add check for a range which has deleted iterators.

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45321
2015-08-25 13:40:58 -07:00
Venkatesh Radhakrishnan
249fb4f881 Fix use of deleted file iterators with incomplete iterators
Summary:
After deleting file iterators which are over the iterate upper
bound, we also need to check for null pointers in
ResetIncompletIterators.

Test Plan: db_tailing_iter_test.TailingIteratorTrimSeekToNext

Reviewers: tnovak, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45525
2015-08-25 13:38:35 -07:00
Andres Notzli
09d982f9e0 Fix compact_files_example
Summary:
See task #7983654. The example was triggering an assert in compaction job
because the compaction was not marked as manual. With this patch,
CompactionPicker::FormCompaction() marks compactions as manual. This patch
also fixes a couple of typos, adds optimistic_transaction_example to
.gitignore and librocksdb as a dependency for examples. Adding librocksdb as
a dependency makes sure that the examples are built with the latest changes
in librocksdb.

Test Plan: make clean && cd examples && make all && ./compact_files_example

Reviewers: rven, sdong, anthony, igor, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45117
2015-08-25 12:29:44 -07:00
Yueh-Hsuan Chiang
6996de87af Expose per-level aggregated table properties via GetProperty()
Summary:
This patch adds "rocksdb.aggregated-table-properties"
and "rocksdb.aggregated-table-properties-at-levelN", the former
returns the aggreated table properties of a column family,
while the later returns the aggregated table properties
of the specified level N.

Test Plan: Added tests in db_test

Reviewers: igor, sdong, IslamAbdelRahman, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45087
2015-08-25 12:03:54 -07:00
Andres Noetzli
2050832974 Fixing race condition in DBTest.DynamicMemtableOptions
Summary:
This patch fixes a race condition in DBTEst.DynamicMemtableOptions. In rare cases,
it was possible that the main thread would fill up both memtables before the flush
job acquired its work. Then, the flush job was flushing both memtables together,
producing only one L0 file while the test expected two. Now, the test waits for
flushes to finish earlier, to make sure that the memtables are flushed in separate
flush jobs.

Test Plan:
Insert "usleep(10000);" after "IOSTATS_SET_THREAD_POOL_ID(Env::Priority::HIGH);" in BGWorkFlush()
to make the issue more likely. Then test with:
make db_test && time while ./db_test --gtest_filter=*DynamicMemtableOptions; do true; done

Reviewers: rven, sdong, yhchiang, anthony, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45429
2015-08-24 17:04:18 -07:00
Igor Canadi
e46bcc08b9 Remove an extra 's' from cur-size-all-mem-tabless
Summary: As title

Test Plan: make check

Reviewers: yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45447
2015-08-24 16:43:18 -07:00
Igor Canadi
4ab26c5ad1 Smarter purging during flush
Summary:
Currently, we only purge duplicate keys and deletions during flush if `earliest_seqno_in_memtable <= newest_snapshot`. This means that the newest snapshot happened before we first created the memtable. This is almost never true for MyRocks and MongoRocks.

This patch makes purging during flush able to understand snapshots. The main logic is copied from compaction_job.cc, although the logic over there is much more complicated and extensive. However, we should try to merge the common functionality at some point.

I need this patch to implement no_overwrite_i_promise functionality for flush. We'll also need this to support SingleDelete() during Flush(). @yoshinorim requested the feature.

Test Plan:
make check
I had to adjust some unit tests to understand this new behavior

Reviewers: yhchiang, yoshinorim, anthony, sdong, noetzli

Reviewed By: noetzli

Subscribers: yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42087
2015-08-24 11:11:12 -07:00
Ari Ekmekji
b6def58f73 Changed 'num_subcompactions' to the more accurate 'max_subcompactions'
Summary:
Up until this point we had DbOptions.num_subcompactions, but
it is semantically more correct to call this max_subcompactions since
we will schedule *up to* DbOptions.max_subcompactions smaller compactions
at a time during a compaction job.

I also added a --subcompactions option to db_bench

Test Plan: make all   make check

Reviewers: sdong, igor, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D45069
2015-08-21 14:25:34 -07:00
sdong
c852968465 db_iter_test: add more test cases for the data race bug
Summary: Add more test cases of data race causing wrong iterating results. Tag tests not passing as DISABLED_

Test Plan: Run the tests

Reviewers: igor, rven, IslamAbdelRahman, anthony, kradhakrishnan, yhchiang

Reviewed By: yhchiang

Subscribers: tnovak, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D44907
2015-08-21 12:14:12 -07:00
sdong
9130873a13 Add options.new_table_reader_for_compaction_inputs
Summary: Currently compaction inputs share the same file descriptor and table reader as other foreground threads. It makes fadvise works less predictable. Add options.new_table_reader_for_compaction_inputs to enforce to create a new file descriptor and new table reader for it.

Test Plan: Add the option.

Reviewers: rven, anthony, kradhakrishnan, IslamAbdelRahman, igor, yhchiang

Reviewed By: igor

Subscribers: igor, MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D43311
2015-08-21 08:46:29 -07:00
sdong
07d2d34160 Add a counter about estimated pending compaction bytes
Summary:
Add a counter of estimated bytes the DB needs to compact for all the compactions to finish. Expose it as a DB Property.
In the future, we can use threshold of this counter to replace soft rate limit and hard rate limit. A single threshold of estimated compaction debt in bytes will be easier for users to reason about when should slow down and stopping than more abstract soft and hard rate limits.

Test Plan: Add unit tests

Reviewers: IslamAbdelRahman, yhchiang, rven, kradhakrishnan, anthony, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D44205
2015-08-20 22:17:10 -07:00
Yueh-Hsuan Chiang
a203b913c1 Fixed a rare deadlock in DBTest.ThreadStatusFlush
Summary:
Currently, ThreadStatusFlush uses two sync-points to ensure
there's a flush currently running when calling GetThreadList().
However, one of the sync-point is inside db-mutex, which could
cause deadlock in case there's a DB::Get() call.

This patch fix this issue by moving the sync-point to a better
place where the flush job does not hold the mutex.

Test Plan: db_test

Reviewers: igor, sdong, anthony, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45045
2015-08-20 17:18:47 -07:00
Siying Dong
962aa64292 Merge pull request #695 from yuslepukhin/address_windows_build
Address windows build issues caused by introducing Subcompaction
2015-08-20 17:04:48 -07:00
Dmitri Smirnov
5bf8907622 More indent adjustment. 2015-08-20 14:14:02 -07:00
Dmitri Smirnov
e2a9f43d64 Adjust indent 2015-08-20 14:10:51 -07:00
Dmitri Smirnov
1cac89c9b1 Address windows build issues
Intro SubCompactionState move functionality
 =delete copy functionality
 #ifdef SyncPoint in tests for Windows Release builds
2015-08-20 14:08:24 -07:00
Islam AbdelRahman
027ca5b2cd Total SST files size DB Property
Summary: Add a new DB property that calculate the total size of files used by all RocksDB Versions

Test Plan: Unittests for the new property

Reviewers: igor, yhchiang, anthony, rven, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D44799
2015-08-20 11:47:19 -07:00
Andres Noetzli
b604d2562f Removing unused variables to fix build
Summary: Removing two unused variables that prevented compilation.

Test Plan: make all

Reviewers: rven, sdong, yhchiang, anthony, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D44991
2015-08-19 16:57:40 -07:00
Venkatesh Radhakrishnan
1b114eed4d Free file iterators for files which are above the iterate upper bound to Improve memory utilization
Summary:
This diff improves the memory utilization for tailing iterators RocksDB,
by freeing file iterators which are over the upper bound.
It is an updating on Siying's original diff for improving the memory usage for
tailing iterators. The changes for the seek and next path are now complete
and a test has been added to exercise these paths while deleting file iterators
which are above the upper bound.

Test Plan: db_tailing_iter_test.TailingIteratorTrimSeekToNext

Reviewers: march, tnovak, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D43833
2015-08-19 16:05:51 -07:00
Islam AbdelRahman
3fd70b05b8 Rate limit deletes issued by DestroyDB
Summary: Update DestroyDB so that all SST files in the first path id go through DeleteScheduler instead of being deleted immediately

Test Plan: added a unittest

Reviewers: igor, yhchiang, anthony, kradhakrishnan, rven, sdong

Reviewed By: sdong

Subscribers: jeanxu2012, dhruba

Differential Revision: https://reviews.facebook.net/D44955
2015-08-19 15:02:17 -07:00
Yueh-Hsuan Chiang
df79eafcb3 Introduce GetIntProperty("rocksdb.size-all-mem-tables")
Summary:
Currently, GetIntProperty("rocksdb.cur-size-all-mem-tables") only returns
the memory usage by those memtables which have not yet been flushed.

This patch introduces GetIntProperty("rocksdb.size-all-mem-tables"),
which includes the memory usage by all the memtables, includes those
have been flushed but pinned by iterators.

Test Plan: Added a test in db_test

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D44229
2015-08-19 13:32:09 -07:00
sdong
888fbdc889 Remove the contstaint that iterator upper bound needs to be within a prefix
Summary: There is a check to fail the iterator if prefix extractor is specified but upper bound is out of the prefix for the seek key. Relax this constraint to allow users to set upper bound to the next prefix of the current one.

Test Plan: make commit-prereq

Reviewers: igor, anthony, kradhakrishnan, yhchiang, rven

Reviewed By: rven

Subscribers: tnovak, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D44949
2015-08-19 11:03:51 -07:00
Ari Ekmekji
137c376675 Removing variables used only in assertions to prevent build error
Summary:
A couple variables were declared but only used in assertions
which causes issues when building in fbcode.

Test Plan: make dbg  and   make release

Reviewers: yhchiang, sdong, igor, anthony, MarkCallaghan

Reviewed By: MarkCallaghan

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D44937
2015-08-19 08:52:22 -07:00
Ari Ekmekji
b47cc58516 Bounding Number of Subcompactions
Summary:
In D43239 (https://reviews.facebook.net/D43239) the number
of subcompactions is set based on the number of L1 files with
unique starting keys. In certain cases when this number is very large
this causes issues, particularly with the overlap between files since
very small output files can be generated. This diff bounds the number
of subcompactions to the user option DBOption.num_subcompactions.

Test Plan: ./db_test ./db_compaction_test

Reviewers: sdong, igor, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D44883
2015-08-18 14:56:31 -07:00
Venkatesh Radhakrishnan
e58e1b18e7 Make tailing iterator show new entries in memtable.
Summary:
Reseek mutable_iter if it is invalid in Next and immutable_iter
is invalid.

Test Plan: DBTestTailingIterator.TailingIteratorSeekToNext

Reviewers: tnovak, march, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D44865
2015-08-18 14:40:06 -07:00
Ari Ekmekji
601b1aaca0 Fixing Failed Assertion in Subcompaction State Diff
Summary:
In D43239 (https://reviews.facebook.net/D43239) there is an
assertion to make sure a subcompaction's output is never empty at the
end of execution. This assertion however breaks the build because some
tests lead to exactly that scenario. So instead I have altered the logic
to handle this case instead of just failing the assertion.

The reason that it is possible for a subcompaction's output to be empty is
that during a sequential execution of subcompactions, if a user aborts the
compaction job then some of the later subcompactions to be executed may
have yet to process any keys and therefore have yet to generate output files.
This becomes very rare once the subcompactions are executed in parallel,
but for now they are still sequential so the case is possible when there is an
early termination, as in some of the tests.

Test Plan: ./db_test  ./db_compaction_test

Reviewers: sdong, igor, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D44877
2015-08-18 12:27:12 -07:00
Ari Ekmekji
f0da6977a3 [Parallel L0-L1 Compaction Prep]: Giving Subcompactions Their Own State
Summary:
In prepration for running multiple threads at the same time during
a compaction job, this patch assigns each subcompaction its own state
(instead of sharing the one global CompactionState). Each subcompaction then
uses this state to update its statistics, keep track of its snapshots, etc.
during the course of execution. Then at the end of all the executions the
statistics are aggregated across the subcompactions so that the final result
is the same as if only one larger compaction had run.

Test Plan: ./db_test  ./db_compaction_test  ./compaction_job_test

Reviewers: sdong, anthony, igor, noetzli, yhchiang

Reviewed By: yhchiang

Subscribers: MarkCallaghan, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43239
2015-08-18 11:06:23 -07:00
Andres Notzli
f32a572099 Simplify querying of merge results
Summary:
While working on supporting mixing merge operators with
single deletes ( https://reviews.facebook.net/D43179 ),
I realized that returning and dealing with merge results
can be made simpler. Submitting this as a separate diff
because it is not directly related to single deletes.

Before, callers of merge helper had to retrieve the merge
result in one of two ways depending on whether the merge
was successful or not (success = result of merge was single
kTypeValue). For successful merges, the caller could query
the resulting key/value pair and for unsuccessful merges,
the result could be retrieved in the form of two deques of
keys and values. However, with single deletes, a successful merge
does not return a single key/value pair (if merge
operands are merged with a single delete, we have to generate
a value and keep the original single delete around to make
sure that we are not accidentially producing a key overwrite).
In addition, the two existing call sites of the merge
helper were taking the same actions independently from whether
the merge was successful or not, so this patch simplifies that.

Test Plan: make clean all check

Reviewers: rven, sdong, yhchiang, anthony, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43353
2015-08-17 17:34:38 -07:00
sdong
72613657f0 Measure file read latency histogram per level
Summary: In internal stats, remember read latency histogram, if statistics is enabled. It can be retrieved from DB::GetProperty() with "rocksdb.dbstats" property, if it is enabled.

Test Plan: Manually run db_bench and prints out "rocksdb.dbstats" by hand and make sure it prints out as expected

Reviewers: igor, IslamAbdelRahman, rven, kradhakrishnan, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D44193
2015-08-14 17:32:42 -07:00
Nathan Bronson
b7198c3afe reduce db mutex contention for write batch groups
Summary:
This diff allows a Writer to join the next write batch group
without acquiring any locks. Waiting is performed via a per-Writer mutex,
so all of the non-leader writers never need to acquire the db mutex.
It is now possible to join a write batch group after the leader has been
chosen but before the batch has been constructed. This diff doesn't
increase parallelism, but reduces synchronization overheads.

For some CPU-bound workloads (no WAL, RAM-sized working set) this can
substantially reduce contention on the db mutex in a multi-threaded
environment.  With T=8 N=500000 in a CPU-bound scenario (see the test
plan) this is good for a 33% perf win.  Not all scenarios see such a
win, but none show a loss.  This code is slightly faster even for the
single-threaded case (about 2% for the CPU-bound scenario below).

Test Plan:
1. unit tests
2. COMPILE_WITH_TSAN=1 make check
3. stress high-contention scenarios with db_bench -benchmarks=fillrandom -threads=$T -batch_size=1 -memtablerep=skip_list -value_size=0 --num=$N -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -disable_auto_compactions --max_write_buffer_number=8 -max_background_flushes=8 --disable_wal --write_buffer_size=160000000

Reviewers: sdong, igor, rven, ljin, yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D43887
2015-08-14 10:55:43 -07:00
sdong
603b6da8b8 Add options.compaction_measure_io_stats to print write I/O stats in compactions
Summary:
Add options.compaction_measure_io_stats to print out / pass to listener accumulated time spent on write calls. Example outputs in info logs:

2015/08/12-16:27:59.463944 7fd428bff700 (Original Log Time 2015/08/12-16:27:59.463922) EVENT_LOG_v1 {"time_micros": 1439422079463897, "job": 6, "event": "compaction_finished", "output_level": 1, "num_output_files": 4, "total_output_size": 6900525, "num_input_records": 111483, "num_output_records": 106877, "file_write_nanos": 15663206, "file_range_sync_nanos": 649588, "file_fsync_nanos": 349614797, "file_prepare_write_nanos": 1505812, "lsm_state": [2, 4, 0, 0, 0, 0, 0]}

Add two more counters in iostats_context.

Also add a parameter of db_bench.

Test Plan: Add a unit test. Also manually verify LOG outputs in db_bench

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D44115
2015-08-13 16:52:26 -07:00
sdong
4637207120 Add test case to repro the mispositional iterator in a low-chance data race case
Summary: Iterator has a bug: if a child iterator reaches its end, and user issues a Prev(), and just before SeekToLast() of the child iterator is called, some extra rows is added in the end, the position of iterator can be misplaced.

Test Plan: Run the tests with or without valgrind

Reviewers: rven, yhchiang, IslamAbdelRahman, anthony

Reviewed By: anthony

Subscribers: tnovak, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D43671
2015-08-12 10:50:52 -07:00
agiardullo
0db807ec28 Transaction error statuses
Summary:
Based on feedback from spetrunia, we should better differentiate error statuses for transaction failures.

https://github.com/MySQLOnRocksDB/mysql-5.6/issues/86#issuecomment-124605954

Test Plan: unit tests

Reviewers: rven, kradhakrishnan, spetrunia, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43323
2015-08-11 17:52:56 -07:00
agiardullo
c2f2cb0214 Pessimistic Transactions
Summary:
Initial implementation of Pessimistic Transactions.  This diff contains the api changes discussed in D38913.  This diff is pretty large, so let me know if people would prefer to meet up to discuss it.

MyRocks folks:  please take a look at the API in include/rocksdb/utilities/transaction[_db].h and let me know if you have any issues.

Also, you'll notice a couple of TODOs in the implementation of RollbackToSavePoint().  After chatting with Siying, I'm going to send out a separate diff for an alternate implementation of this feature that implements the rollback inside of WriteBatch/WriteBatchWithIndex.  We can then decide which route is preferable.

Next, I'm planning on doing some perf testing and then integrating this diff into MongoRocks for further testing.

Test Plan: Unit tests, db_bench parallel testing.

Reviewers: igor, rven, sdong, yhchiang, yoshinorim

Reviewed By: sdong

Subscribers: hermanlee4, maykov, spetrunia, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D40869
2015-08-11 17:52:23 -07:00
Islam AbdelRahman
c2868cbc52 Use manual_compaction for compaction_job_test
Summary:
Under certain conditions (disable compression) the compactions that are created in compaction_job_test will pass the trivial_move conditions
This will cause problems since we assert that we dont run a compaction if it's a trivial move
https://github.com/facebook/rocksdb/blob/master/db/compaction_job.cc#L144-L147

for example when we disable compression, compactions become a valid trivial move and the assert fails
https://ci-builds.fb.com/view/rocksdb/job/rocksdb_no_compression/180/console

Test Plan: compaction_job_test

Reviewers: sdong, yhchiang, noetzli, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D43983
2015-08-11 14:47:14 -07:00
Islam AbdelRahman
cee1e8a080 Parallelize LoadTableHandlers
Summary: Add a new option that all LoadTableHandlers to use multiple threads to load files on DB Open and Recover

Test Plan:
make check -j64
COMPILE_WITH_TSAN=1 make check -j64
DISABLE_JEMALLOC=1 make all valgrind_check -j64 (still running)

Reviewers: yhchiang, anthony, rven, kradhakrishnan, igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D43755
2015-08-11 12:19:56 -07:00
Andres Notzli
4249f159d5 Removing duplicate code in db_bench/db_stress, fixing typos
Summary:
While working on single delete support for db_bench, I realized that
db_bench/db_stress contain a bunch of duplicate code related to
copmression and found some typos. This patch removes duplicate code,
typos and a redundant #ifndef in internal_stats.cc.

Test Plan: make db_stress && make db_bench && ./db_bench --benchmarks=compress,uncompress

Reviewers: yhchiang, sdong, rven, anthony, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43965
2015-08-11 11:46:15 -07:00
Nathan Bronson
1ae27113c7 reduce comparisons by skiplist
Summary:
Key comparison is the single largest CPU user for CPU-bound
workloads. This diff reduces the number of comparisons in two ways.

The first is that it moves predecessor array gathering from
FindGreaterOrEqual to FindLessThan, so that FindGreaterOrEqual can
return immediately if compare_ returns 0.  As part of this change I
moved the sequential insertion optimization into Insert, to remove the
undocumented (and smelly) requirement that prev must be equal to prev_
if it is non-null.

The second optimization is that all of the search functions skip calling
compare_ when moving to a lower level that has the same Next pointer.
With a branching factor of 4 we would expect this to happen 1/4 of
the time.

On a single-threaded CPU-bound workload (-benchmarks=fillrandom -threads=1
-batch_size=1 -memtablerep=skip_list -value_size=0 --num=1600000
-level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8
-max_background_flushes=8 --disable_wal --write_buffer_size=160000000)
on my dev server this is good for a 7% perf win.

Test Plan: unit tests

Reviewers: rven, ljin, yhchiang, sdong, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D43233
2015-08-11 11:25:22 -07:00
Islam AbdelRahman
a9dcc0a638 Fix clang build
Summary:
https://ci-builds.fb.com/view/rocksdb/job/rocksdb_clang_build/893/console
Fixing clang build

Test Plan:
make clean
USE_CLANG=1 make all -j64

Reviewers: sdong, noetzli, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D43959
2015-08-10 11:30:36 -07:00
Andres Notzli
68f934355a Better CompactionJob testing
Summary:
Changed compaction_job_test to support better/more thorough
tests and added two tests. Also changed MockFileContents
to order using InternalKeyComparator.

Test Plan: make compaction_job_test && ./compaction_job_test; make all && make check

Reviewers: sdong, rven, igor, yhchiang, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42837
2015-08-07 21:59:51 -07:00
agiardullo
16ea1c7d1c simple ManagedSnapshot wrapper
Summary: Implemented this simple wrapper for something else I was working on.  Seemed like it makes sense to expose it instead of burying it in some random code.

Test Plan: added test

Reviewers: rven, kradhakrishnan, sdong, yhchiang

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43293
2015-08-06 17:59:05 -07:00
sdong
6a4aaadcd7 Avoid type unique_ptr in LogWriterNumber::writer for Windows build break
Summary:
Visual Studio complains about deque<LogWriterNumber> because LogWriterNumber is non-copyable for its unique_ptr member writer. Move away from it, and do explit free.
It is less safe but I can't think of a better way to unblock it.

Test Plan: valgrind check test

Reviewers: anthony, IslamAbdelRahman, kolmike, rven, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D43647
2015-08-06 10:52:41 -07:00
Andres Noetzli
d7314ba759 Fixing endless loop if seeking to end of key with seq num 0
Summary:
When seeking to the last occurrence of a key with sequence number 0, db_iter
ends up in an endless loop because it seeks to type kValueTypeForSeek
which is larger than kTypeDeletion/kTypeValue. Added test case that triggers
the behavior.

Test Plan: make clean all check

Reviewers: igor, rven, anthony, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43653
2015-08-06 10:43:28 -07:00
Islam AbdelRahman
29b028b0ed Make DeleteScheduler tests more reliable
Summary: Update DeleteScheduler tests so that they verify the used penalties for waiting instead of measuring the time spent which is not reliable

Test Plan:
make -j64 delete_scheduler_test && ./delete_scheduler_test
COMPILE_WITH_TSAN=1 make -j64 delete_scheduler_test && ./delete_scheduler_test
COMPILE_WITH_ASAN=1 make -j64 delete_scheduler_test && ./delete_scheduler_test

make -j64 db_test && ./db_test --gtest_filter="DBTest.RateLimitedDelete:DBTest.DeleteSchedulerMultipleDBPaths"
COMPILE_WITH_TSAN=1 make -j64 db_test && ./db_test --gtest_filter="DBTest.RateLimitedDelete:DBTest.DeleteSchedulerMultipleDBPaths"
COMPILE_WITH_ASAN=1 make -j64 db_test && ./db_test --gtest_filter="DBTest.RateLimitedDelete:DBTest.DeleteSchedulerMultipleDBPaths"

Reviewers: yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D43635
2015-08-05 19:16:52 -07:00
Poornima Chozhiyath Raman
7d364d0d94 Fix build failure
Summary: fix the build failure

Test Plan: make all

Reviewers: sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43623
2015-08-05 16:38:12 -07:00
Poornima Chozhiyath Raman
960d936e83 Add function 'GetInfoLogList()'
Summary: The list of info log files of a db can be obtained using the new function.

Test Plan: New test in db_test.cc passed.

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: IslamAbdelRahman, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D41715
2015-08-05 16:16:46 -07:00
sdong
7ccd1c80a7 Add two unit tests for SyncWAL()
Summary:
Add two unit tests for SyncWAL(). One makes sure SyncWAL() doesn't block writes in the other thread. Another one makes sure SyncWAL() doesn't wait ongoing writes to finish before being executed.

Create a new test file db_wal_test and move two WAL related tests from db_test to here.

Test Plan: Run the new tests

Reviewers: IslamAbdelRahman, rven, kradhakrishnan, kolmike, tnovak, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D43605
2015-08-05 14:27:02 -07:00
sdong
3ae386eafe Add statistic histogram "rocksdb.sst.read.micros"
Summary: Measure read latency histogram and put in statistics. Compaction inputs are excluded from it when possible (unfortunately usually no possible as we usually take table reader from table cache.

Test Plan:
Run db_bench and it shows the stats, like:

rocksdb.sst.read.micros statistics Percentiles :=> 50 : 1.238522 95 : 2.529740 99 : 3.912180

Reviewers: kradhakrishnan, rven, anthony, IslamAbdelRahman, MarkCallaghan, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D43275
2015-08-05 13:02:33 -07:00
Islam AbdelRahman
9aec75fbb9 Enable DBTest.FlushSchedule under TSAN
Summary: This patch will fix the false positive of DBTest.FlushSchedule under TSAN, we dont need to disable this test

Test Plan: COMPILE_WITH_TSAN=1 make -j64 db_test && ./db_test --gtest_filter="DBTest.FlushSchedule"

Reviewers: yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D43599
2015-08-05 11:47:07 -07:00
sdong
8e01bd1144 Fix misplaced position for reversing iterator direction while current key is a merge
Summary:
While doing forward iterating, if current key is merge, internal iterator position is placed to the next key. If Prev() is called now, needs to do extra Prev() to recover the location.
This is second attempt of fixing after reverting ec70fea4c4. This time shrink the fix to only merge key is the current key and avoid the reseeking logic for max_iterating skipping

Test Plan: enable the two disabled tests and make sure they pass

Reviewers: rven, IslamAbdelRahman, kradhakrishnan, tnovak, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D43557
2015-08-05 11:08:50 -07:00
Andres Notzli
c465071029 Removing duplicate code
Summary:
While working on https://reviews.facebook.net/D43179 , I found
duplicate code in the tests. This patch removes it.

Test Plan: make clean all check

Reviewers: igor, sdong, rven, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43263
2015-08-05 07:33:27 -07:00
Mike Kolupaev
e06cf1a098 [wal changes 3/3] method in DB to sync WAL without blocking writers
Summary:
Subj. We really need this feature.

Previous diff D40899 has most of the changes to make this possible, this diff just adds the method.

Test Plan: `make check`, the new test fails without this diff; ran with ASAN, TSAN and valgrind.

Reviewers: igor, rven, IslamAbdelRahman, anthony, kradhakrishnan, tnovak, yhchiang, sdong

Reviewed By: sdong

Subscribers: MarkCallaghan, maykov, hermanlee4, yoshinorim, tnovak, dhruba

Differential Revision: https://reviews.facebook.net/D40905
2015-08-05 06:06:39 -07:00
Ari Ekmekji
5dc3e6881a Update Tests To Enable Subcompactions
Summary:
Updated DBTest DBCompactionTest and CompactionJobStatsTest
to run compaction-related tests once with subcompactions enabled and
once disabled using the TEST_P test type in the Google Test suite.

Test Plan: ./db_test  ./db_compaction-test  ./compaction_job_stats_test

Reviewers: sdong, igor, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43443
2015-08-04 22:19:07 -07:00
Islam AbdelRahman
c45a57b41e Support delete rate limiting
Summary:
Introduce DeleteScheduler that allow enforcing a rate limit on file deletion
Instead of deleting files immediately, files are moved to trash directory and deleted in a background thread that apply sleep penalty between deletes if needed.

I have updated PurgeObsoleteFiles and PurgeObsoleteWALFiles to use the delete_scheduler instead of env_->DeleteFile

Test Plan:
added delete_scheduler_test
existing unit tests

Reviewers: kradhakrishnan, anthony, rven, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D43221
2015-08-04 20:45:27 -07:00
Yueh-Hsuan Chiang
241bb2aef3 Make DBCompactionTest.SkipStatsUpdateTest more stable.
Summary:
Make DBCompactionTest.SkipStatsUpdateTest more stable by
removing flaky but unnecessary assertion on the size of db
as simply checking the random file open count is suffice.

Test Plan: db_compaction_test

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43533
2015-08-04 15:47:05 -07:00
Yueh-Hsuan Chiang
14d0bfa429 Add DBOptions::skip_sats_update_on_db_open
Summary:
UpdateAccumulatedStats() is used to optimize compaction decision
esp. when the number of deletion entries are high, but this function
can slowdown DBOpen esp. in disk environment.

This patch adds DBOptions::skip_sats_update_on_db_open, which skips
UpdateAccumulatedStats() in DB::Open() time when it's set to true.

Test Plan: Add DBCompactionTest.SkipStatsUpdateTest

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: tnovak, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42843
2015-08-04 13:48:16 -07:00
Venkatesh Radhakrishnan
20b244fcca Fix CompactFiles by adding all necessary files
Summary:
The compact files API had a bug where some overlapping files
are not added. These are files which overlap with files which were
added to the compaction input files, but not to the original set of
input files. This happens only when there are more than two levels
involved in the compaction. An example will illustrate this better.

Level 2 has 1 input file 1.sst which spans [20,30].

Level 3 has added file  2.sst which spans [10,25]

Level 4 has file 3.sst which spans [35,40] and
        input file 4.sst which spans [46,50].

The existing code would not add 3.sst to the set of input_files because
it only becomes an overlapping file in level 4 and it wasn't one in
level 3.

When installing the results of the compaction, 3.sst would overlap with
output file from the compact files and result in the assertion in
version_set.cc:1130

 // Must not overlap
   assert(level <= 0 || level_files->empty() ||
            internal_comparator_->Compare(
                (*level_files)[level_files->size() - 1]->largest, f->smallest) <
                0);
This change now adds overlapping files from the current level to the set
of input files also so that we don't hit the assertion above.

Test Plan:
d=/tmp/j; rm -rf $d; seq 1000 | parallel --gnu --eta
'd=/tmp/j/d-{}; mkdir -p $d; TEST_TMPDIR=$d ./db_compaction_test
--gtest_filter=*CompactilesOnLevel* --gtest_also_run_disabled_tests >&
'$d'/log-{}'

Reviewers: igor, yhchiang, sdong

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43437
2015-08-03 15:53:22 -07:00
Venkatesh Radhakrishnan
87df6295dd Make SuggestCompactRangeNoTwoLevel0Compactions deterministic
Summary:
Made SuggestCompactRangeNoTwoLevel0Compactions by forcing
a flush after generating a file and waiting for compaction at the end.

Test Plan: Run SuggestCompactRangeNoTwoLevel0Compactions

Reviewers: yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43449
2015-08-03 15:52:52 -07:00
Ari Ekmekji
40c64434d4 Parallelize L0-L1 Compaction: Restructure Compaction Job
Summary:
As of now compactions involving files from Level 0 and Level 1 are single
threaded because the files in L0, although sorted, are not range partitioned like
the other levels. This means that during L0-L1 compaction each file from L1
needs to be merged with potentially all the files from L0.

This attempt to parallelize the L0-L1 compaction assigns a thread and a
corresponding iterator to each L1 file that then considers only the key range
found in that L1 file and only the L0 files that have those keys (and only the
specific portion of those L0 files in which those keys are found). In this way
the overlap is minimized and potentially eliminated between different iterators
focusing on the same files.

The first step is to restructure the compaction logic to break L0-L1 compactions
into multiple, smaller, sequential compactions. Eventually each of these smaller
jobs will be run simultaneously. Areas to pay extra attention to are

  # Correct aggregation of compaction job statistics across multiple threads
  # Proper opening/closing of output files (make sure each thread's is unique)
  # Keys that span multiple L1 files
  # Skewed distributions of keys within L0 files

Test Plan: Make and run db_test (newer version has separate compaction tests) and compaction_job_stats_test

Reviewers: igor, noetzli, anthony, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: MarkCallaghan, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42699
2015-08-03 11:32:14 -07:00
Andres Notzli
193dc977e7 Fixing dead code in table_properties_collector_test
Summary:
There was a bug in table_properties_collector_test that this patch
is fixing: `!backward_mode && !test_int_tbl_prop_collector` in
TestCustomizedTablePropertiesCollector was never true, so the code
in the if-block never got executed. The reason is that the
CustomizedTablePropertiesCollector test was skipping tests with
`!backward_mode_ && !encode_as_internal`. The reason for skipping
the tests is unknown.

Test Plan: make table_properties_collector_test && ./table_properties_collector_test

Reviewers: rven, igor, yhchiang, anthony, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43281
2015-07-30 16:59:03 -07:00
agiardullo
8161bdb5a0 WriteBatch Save Points
Summary:
Support RollbackToSavePoint() in WriteBatch and WriteBatchWithIndex.  Support for partial transaction rollback is needed for MyRocks.

An alternate implementation of Transaction::RollbackToSavePoint() exists in D40869.  However, the other implementation is messier because it is implemented outside of WriteBatch.  This implementation is much cleaner and also exposes a potentially useful feature to WriteBatch.

Test Plan: Added unit tests

Reviewers: IslamAbdelRahman, kradhakrishnan, maykov, yoshinorim, hermanlee4, spetrunia, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42723
2015-07-29 16:54:23 -07:00
Andres Notzli
d06c82e477 Further cleanup of CompactionJob and MergeHelper
Summary:
Simplified logic in CompactionJob and removed unused parameter in
MergeHelper.

Test Plan: make && make check

Reviewers: rven, igor, sdong, yhchiang

Reviewed By: sdong

Subscribers: aekmekji, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42687
2015-07-28 19:21:55 -07:00
Andres Notzli
e95c59cd2f Count number of corrupt keys during compaction
Summary:
For task #7771355, we would like to log the number of corrupt keys
during a compaction. This patch implements and tests the count
as part of CompactionJobStats.

Test Plan: make && make check

Reviewers: rven, igor, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42921
2015-07-28 16:41:40 -07:00
Poornima Chozhiyath Raman
1bdfcef7bf Fix when output level is 0 of universal compaction with trivial move
Summary: Fix for universal compaction with trivial move, when the ouput level is 0. The tests where failing. Fixed by allowing normal compaction when output level is 0.

Test Plan: modified test cases run successfully.

Reviewers: sdong, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: anthony, kradhakrishnan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D42933
2015-07-27 14:25:57 -07:00
sdong
82f148ef97 Fix test DBCompactionTest.PartialCompactionFailure undeterministic failure
Summary: DBCompactionTest.PartialCompactionFailure has a risk that one flush job writes out two mem tables into one file, so that the total files flushed are less than expected. Fix it by writing for flush to finish after every write.

Test Plan: Run the test

Reviewers: IslamAbdelRahman, kradhakrishnan, yhchiang, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D42831
2015-07-22 13:46:56 -07:00
Mike Kolupaev
4922af6f8d fixed DBTest.GetPropertiesOfAllTablesTest and DBTest.GetUserDefinedTablaProperties flakiness
Summary: These tests used to fail if a compaction happened between flushing tables and enumerating them to get properties.

Test Plan: this reports occasional failures without this diff and no failures with it: `for i in {1..10000}; do echo $i; done | parallel --gnu -j100 'TEST_TMPDIR=`TMPDIR=/dev/shm/rockstemp mktemp -d -t` ./db_test --gtest_filter=DBTest.GetUserDefinedTablaProperties >&/dev/null || echo {} failed'`

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D42861
2015-07-22 12:37:49 -07:00
Mike Kolupaev
fe09a6dae3 [wal changes 2/3] write with sync=true syncs previous unsynced wals to prevent illegal data loss
Summary:
I'll just copy internal task summary here:

"
This sequence will cause data loss in the middle after an sync write:

non-sync write key 1
flush triggered, not yet scheduled
sync write key 2
system crash

After rebooting, users might see key 2 but not key 1, which violates the API of sync write.

This can be reproduced using unit test FaultInjectionTest::DISABLED_WriteOptionSyncTest.

One way to fix it is for a sync write, if there is outstanding unsynced log files, we need to syc them too.
"

This diff should be considered together with the next diff D40905; in isolation this fix probably could be a little simpler.

Test Plan: `make check`; added a test for that (DBTest.SyncingPreviousLogs) before noticing FaultInjectionTest.WriteOptionSyncTest (keeping both since mine asserts a bit more); both tests fail without this diff; for D40905 stacked on top of this diff, ran tests with ASAN, TSAN and valgrind

Reviewers: rven, yhchiang, IslamAbdelRahman, anthony, kradhakrishnan, igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D40899
2015-07-22 03:28:08 -07:00
Andres Notzli
06aebca592 Report live data size estimate
Summary:
Fixes T6548822. Added a new function for estimating the size of the live data
as proposed in the task. The value can be accessed through the property
rocksdb.estimate-live-data-size.

Test Plan:
There are two unit tests in version_set_test and a simple test in db_test.
make version_set_test && ./version_set_test;
make db_test && ./db_test gtest_filter=GetProperty

Reviewers: rven, igor, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41493
2015-07-21 21:33:20 -07:00
sdong
02b635fa38 Fix undeterministic failure of DBTest.GetPropertiesOfAllTablesTest
Summary: DBTest.GetPropertiesOfAllTablesTest generates four files and expects four files there, but a L0->L1 comapction can trigger to compact to one single file. Fix it by raising level 0 number of file compaction trigger

Test Plan: Run it many times and see it never fails.

Reviewers: kradhakrishnan, IslamAbdelRahman, yhchiang, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D42789
2015-07-21 17:13:23 -07:00
Yueh-Hsuan Chiang
7219088cda Move general compaction tests from db_test.cc to db_compaction_test.cc
Summary: Move general compaction tests from db_test.cc to db_compaction_test.cc

Test Plan:
db_test
db_compaction_test

Reviewers: igor, sdong, IslamAbdelRahman, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42651
2015-07-21 03:05:57 -07:00
agiardullo
064294081b Improved FileExists API
Summary: Add new CheckFileExists method.  Considered changing the FileExists api but didn't want to break anyone's builds.

Test Plan: unit tests

Reviewers: yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42003
2015-07-20 17:20:40 -07:00
Yueh-Hsuan Chiang
443c6646bc Move remaining universal compaction tests from db_test.cc to db_universal_compaction_test.cc
Summary: Move remaining universal compaction tests from db_test.cc to db_universal_compaction_test.cc

Test Plan:
db_test
db_universal_compaction_test

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42645
2015-07-20 16:07:54 -07:00
Yueh-Hsuan Chiang
7462286d33 Move in-place-update related tests from db_test.cc to db_inplace_update_test.cc
Summary: Move in-place-update related tests from db_test.cc to db_inplace_update_test.cc

Test Plan:
db_test
db_inplace_update_test

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D42657
2015-07-20 16:05:28 -07:00
Yueh-Hsuan Chiang
331954ab83 Fixed DBTestUniversalManualCompactionOutputPathId test
Summary:
Fixed DBTestUniversalManualCompactionOutputPathId test
by changing the expected number of files when setting up
the test as flushes no-longer preempt compactions
in patch https://reviews.facebook.net/D41931.

Also, include db_universal_copaction_test in make all check.

Test Plan: db_universal_copaction_test

Reviewers: igor, sdong, IslamAbdelRahman, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42639
2015-07-20 13:53:41 -07:00
krad
a75f23eb87 Relax assertions in unit DropWrites to be more permissible
Summary: This unit test is blocking our release since it fails under certain
compiler versions. The failure is due to a race in the unit test and not the
core functionality.

Test Plan: Run locally

Reviewers: sdong

CC: leveldb

Task ID: #7760955

Blame Rev:
2015-07-20 12:37:54 -07:00
sdong
ee80432ff8 db_bench add an option of --universal_allow_trivial_move
Summary: Now we allow trivial move in universal compaction. Add a parameter in db_bench

Test Plan: Run db_bench with this option on and off and make sure the option is switched correctly.

Reviewers: yhchiang, igor, kradhakrishnan, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D41427
2015-07-20 11:56:12 -07:00
sdong
58b4209e05 Trigger non-trivial compaction in fault_injection_test
Summary: Now the major test cases of fault_injection_test only insert keys in sorted order so compactions will be trivial move. Add a new mode to insert in non-sequential order to trigger non-trivial compactions.

Test Plan: Run the test

Reviewers: kradhakrishnan, anthony, IslamAbdelRahman, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D42435
2015-07-20 11:51:57 -07:00
Islam AbdelRahman
aa8ac6445b Skip unsupported tests in ROCKSDB_LITE
Summary:
Skipping these tests in ROCKSDB_LITE since they are not supported
json_document_test
wal_manager_test
ttl_test
sst_dump_test
deletefile_test
compact_files_test
prefix_test
checkpoint_test

Test Plan:
json_document_test
wal_manager_test
ttl_test
sst_dump_test
deletefile_test
compact_files_test
prefix_test
checkpoint_test

Reviewers: igor, sdong, yhchiang, kradhakrishnan, anthony

Reviewed By: anthony

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D42573
2015-07-20 11:24:54 -07:00
Islam AbdelRahman
c06d1d8393 Make merge_test runnable in ROCKSDB_LITE
Summary: Make merge_test runnable in ROCKSDB_LITE

Test Plan: merge_test

Reviewers: igor, sdong, yhchiang, kradhakrishnan, anthony

Reviewed By: anthony

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D42579
2015-07-20 11:17:52 -07:00
Islam AbdelRahman
0d1d9aeebe Block plain_table_db_test in ROCKSDB_LITE
Summary: Block plain_table_db_test in ROCKSDB_LITE since plain table is not supported in ROCKSDB_LITE

Test Plan: plain_table_db_test

Reviewers: igor, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D42159
2015-07-20 11:12:02 -07:00
Islam AbdelRahman
f0fe9126f2 Fix compile for write_callback_test in ROCKSDB_LITE
Summary: Add main for write_callback_test when compiled under ROCKSDB_LITE

Test Plan: write_callback_test

Reviewers: igor, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D42147
2015-07-20 10:54:15 -07:00
Islam AbdelRahman
cf6a7bebc8 Block cuckoo table tests in ROCKSDB_LITE
Summary: Cuckoo table is not supported in ROCKSDB_LITE, blocking it's tests

Test Plan:
cuckoo_table_builder_test
cuckoo_table_db_test
cuckoo_table_reader_test

Reviewers: sdong, igor, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D42141
2015-07-20 10:50:46 -07:00
Islam AbdelRahman
20922c4a5a Make compaction_picker_test runnable in ROCKSDB_LITE
Summary: Remove universal and fifo compaction tests from ROCKSDB_LITE since they are not supported

Test Plan: compaction_picker_test

Reviewers: sdong, igor, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D42129
2015-07-20 10:46:09 -07:00
sdong
6e9fbeb27c Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.

Test Plan: Run all existing unit tests.

Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D42321
2015-07-17 16:58:18 -07:00
Igor Canadi
35ca59364c Don't let flushes preempt compactions
Summary:
When we first started, max_background_flushes was 0 by default and compaction thread was executing flushes (since there was no flush thread). Then, we switched the default max_background_flushes to 1. However, we still support the case where there is no flush thread and flushes are done in compaction. This is making our code a bit more complicated. By not supporting this use-case we can make our code simpler.

We have a special case that when you set max_background_flushes to 0, we
schedule the flush to execute on the compaction thread.

Test Plan: make check (there might be some unit tests that depend on this behavior)

Reviewers: IslamAbdelRahman, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41931
2015-07-17 12:02:52 -07:00
Ari Ekmekji
74c755c552 Added JSON manifest dump option to ldb command
Summary:
Added a new flag --json to the ldb manifest_dump command
that prints out the version edits as JSON objects for easier
reading and parsing of information.

Test Plan:
**Sample usage: **
```
./ldb manifest_dump --json --path=path/to/manifest/file
```

**Sample output:**
```
{"EditNumber": 0, "Comparator": "leveldb.BytewiseComparator", "ColumnFamily": 0}
{"EditNumber": 1, "LogNumber": 0, "ColumnFamily": 0}
{"EditNumber": 2, "LogNumber": 4, "PrevLogNumber": 0, "NextFileNumber": 7, "LastSeq": 35356, "AddedFiles": [{"Level": 0, "FileNumber": 5, "FileSize": 1949284, "SmallestIKey": "'", "LargestIKey": "'"}], "ColumnFamily": 0}
...
{"EditNumber": 13, "PrevLogNumber": 0, "NextFileNumber": 36, "LastSeq": 290994, "DeletedFiles": [{"Level": 0, "FileNumber": 17}, {"Level": 0, "FileNumber": 20}, {"Level": 0, "FileNumber": 22}, {"Level": 0, "FileNumber": 24}, {"Level": 1, "FileNumber": 13}, {"Level": 1, "FileNumber": 14}, {"Level": 1, "FileNumber": 15}, {"Level": 1, "FileNumber": 18}], "AddedFiles": [{"Level": 1, "FileNumber": 25, "FileSize": 2114340, "SmallestIKey": "'", "LargestIKey": "'"}, {"Level": 1, "FileNumber": 26, "FileSize": 2115213, "SmallestIKey": "'", "LargestIKey": "'"}, {"Level": 1, "FileNumber": 27, "FileSize": 2114807, "SmallestIKey": "'", "LargestIKey": "'"}, {"Level": 1, "FileNumber": 30, "FileSize": 2115271, "SmallestIKey": "'", "LargestIKey": "'"}, {"Level": 1, "FileNumber": 31, "FileSize": 2115165, "SmallestIKey": "'", "LargestIKey": "'"}, {"Level": 1, "FileNumber": 32, "FileSize": 2114683, "SmallestIKey": "'", "LargestIKey": "'"}, {"Level": 1, "FileNumber": 35, "FileSize": 1757512, "SmallestIKey": "'", "LargestIKey": "'"}], "ColumnFamily": 0}
...
```

Reviewers: sdong, anthony, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D41727
2015-07-17 10:07:40 -07:00
Igor Canadi
a96fcd09b7 Deprecate CompactionFilterV2
Summary: It has been around for a while and it looks like it never found any uses in the wild. It's also complicating our compaction_job code quite a bit. We're deprecating it in 3.13, but will put it back in 3.14 if we actually find users that need this feature.

Test Plan: make check

Reviewers: noetzli, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42405
2015-07-17 18:59:11 +02:00
Andres Notzli
1d20fa9d0f Fixed and simplified merge_helper
Summary:
MergeUntil was not reporting a success when merging an operand with
a Value/Deletion despite the comments in MergeHelper and CompactionJob
indicating otherwise. This lead to operands being written to the compaction
output unnecessarily:

M1 M2 M3 P M4 M5 --> (P+M1+M2+M3) M2 M3 M4 M5 (before the diff)
M1 M2 M3 P M4 M5 --> (P+M1+M2+M3) M4 M5 (after the diff)

In addition, the code handling Values/Deletion was basically identical.
This patch unifies the code. Finally, this patch also adds testing for
merge_helper.

Test Plan: make && make check

Reviewers: sdong, rven, yhchiang, tnovak, igor

Reviewed By: igor

Subscribers: tnovak, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42351
2015-07-17 09:27:24 -07:00
sdong
ac2b9367fe Fix a typo in variable
Summary: Typo seqeuntial -> sequential in db/fault_injection_test.cc

Test Plan: Build it

Reviewers: ott, igor, anthony, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D42417
2015-07-16 12:17:24 -07:00
sdong
6c0c8dee7b Fix data loss after DB recovery by not allowing flush/compaction to be scheduled until DB opened
Summary:
Previous run may leave some SST files with higher file numbers than manifest indicates.
Compaction or flush may start to run while DB::Open() is still going on. SST file garbage collection may happen interleaving with compaction or flush, and overwrite files generated by compaction of flushes after they are generated. This might cause data loss. This possibility of interleaving is recently introduced.
Fix it by not allowing compaction or flush to be scheduled before DB::Open() finishes.

Test Plan: Add a unit test. This verification will have a chance to fail without the fix but doesn't fix without the fix.

Reviewers: kradhakrishnan, anthony, yhchiang, IslamAbdelRahman, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42399
2015-07-16 10:57:41 -07:00
Andres Notzli
e4af3bfb27 Test for compaction of corrupted keys
Summary:
Fixes T7697334. Adds a simple test to check whether CompactionJob deals
with corrupted keys correctly. Right now, we preserve corrupted keys.
Note: depending on the type of corruption and options like comparators,
CompactionJob fails. This test just checks whether corrupted keys that
do not fail CompactionJob are preserved.

Test Plan:
`make compaction_job_test && ./compaction_job_test` -> Tests pass.
Add `input->Next(); continue;` in CompactionJob::ProcessKeyValueCompaction()
inside then-branch of `!ParseInternalKey(key, &ikey)` -> Tests fail.

Reviewers: sdong, rven, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42237
2015-07-16 09:18:35 -07:00
agiardullo
81d072623c move convenience.h out of utilities
Summary: Moved convenience.h out of utilities to remove a dependency on utilities in db.

Test Plan: unit tests.  Also compiled a link to the old location to verify the _Pragma works.

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42201
2015-07-15 14:51:51 -07:00
Poornima Chozhiyath Raman
beb19ad0dd Fixing delete files in Trivial move of universal compaction
Summary:
Trvial move in universal compaction was failing when trying to move files from levels other than 0.
This was because the DeleteFile while trivially moving, was only deleting files of level 0 which caused duplication of same file in different levels.
This is fixed by passing the right level as argument in the call of DeleteFile while doing trivial move.

Test Plan: ./db_test ran successfully with the new test cases.

Reviewers: sdong

Reviewed By: sdong

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D42135
2015-07-15 12:28:22 -07:00
krad
c613960690 Build fix.
Summary: gcc-4.9-glibc-2.20 complains about uninitialized variable.

db/compaction_picker.cc: In member function 'bool
rocksdb::CompactionPicker::IsInputNonOverlapping(rocksdb::Compaction*)':
db/compaction_picker.cc:1174:17: error:
'prev.rocksdb::{anonymous}::InputFileInfo::f' may be used uninitialized in this
function [-Werror=maybe-uninitialized]
   InputFileInfo prev, curr, next;

Test Plan: pmake on local environment

Reviewers: sdong igor

CC: leveldb@

Task ID: #

Blame Rev:
2015-07-15 11:25:10 -07:00
Aaron Feldman
2c8de0ecae Update --help message in db_bench.
Summary:
Remove --help entry for readhot.

Update read_random_exp_range flag description: The distribution is num *
exp(-r), not num * exp(r).

Test Plan: Run ./db_bench --help

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D42303
2015-07-15 10:21:09 -07:00
Andres Notzli
6b2d44b2ff Refactoring of writing key/value pairs
Summary:
Before, writing key/value pairs out to files was done inside
ProcessKeyValueCompaction(). To make ProcessKeyValueCompaction()
more understandable, this patch moves the writing part to a separate
function. This is intended to be a stepping stone for additional
changes.

Test Plan: make && make check

Reviewers: sdong, rven, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42243
2015-07-15 09:55:45 -07:00
Igor Canadi
9a6a0bd8c9 Style fix in compaction_job.cc
Summary: I didn't know this can work :)

Test Plan: none

Reviewers: sdong, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42081
2015-07-15 09:36:47 +02:00
Andres Notzli
ddad40e930 Fixed nullptr deref and added assert
Summary:
Fixes two minor issues in CompactionJob.
CompactionJob::Run() dereferences log_buffer_ without a check, so
this patch adds an assert in the constructor where log_buffer_
is assigned. compaction_job_stats_ can be null but
ProcessKeyValueCompaction was dereferencing it without a check.

Test Plan: make && make check

Reviewers: sdong, rven, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42231
2015-07-14 23:12:34 -07:00
Yueh-Hsuan Chiang
801df912af Move UniversalCompaction related db-tests to db_universal_compaction_test.cc
Summary: Move UniversalCompaction related db-tests to db_universal_compaction_test.cc

Test Plan:
db_test
db_universal_compaction_test

Reviewers: igor, sdong, IslamAbdelRahman, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42225
2015-07-14 18:24:45 -07:00
Yueh-Hsuan Chiang
3ca6b2541e Move TailingIterator tests from db_test.cc to db_test_tailing_iterator.cc
Summary: Move TailingIterator tests from db_test.cc to db_test_tailing_iterator.cc

Test Plan:
db_test
db_test_tailing_iterator

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D42021
2015-07-14 16:41:08 -07:00
Yueh-Hsuan Chiang
ce829c77e3 Make TransactionLogIterator related tests from db_test.cc to db_log_iter_test.cc
Summary: Make TransactionLogIterator related tests from db_test.cc to db_log_iter_test.cc

Test Plan:
db_test
db_log_iter_test

Reviewers: sdong, IslamAbdelRahman, igor, anthony

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42045
2015-07-14 16:08:21 -07:00
Yueh-Hsuan Chiang
c3f98bb89b Move CompactionFilter tests in db_test.cc to db_compaction_filter_test.cc
Summary: Move CompactionFilter tests in db_test.cc to db_compaction_filter_test.cc

Test Plan:
db_test
db_compaction_filter_test

Reviewers: igor, sdong, IslamAbdelRahman, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42207
2015-07-14 16:03:47 -07:00
agiardullo
18d5e1bf88 Remove db_impl_readonly dependency on utilities
Summary: Seems like the cleanest way to resolve this is to move CompactedDBImpl into db/.  CompactedDBImpl should probably live in the same place as DBImplReadonly since Opening the latter could end up instantiating the former.  Both DBImplReadonly and CompactedDBImpl inherit from DBImpl access protected members of DBImpl( and the latter access friendly private methods).

Test Plan: unit tests

Reviewers: sdong, yhchiang, kradhakrishnan, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42027
2015-07-14 11:32:54 -07:00
Islam AbdelRahman
49640bd82f Allow write_batch_test to run with ROCKSDB_LITE
Summary: Remove ColumnFamiliesBatchWithIndexTest from ROCKSDB_LITE since WriteBatchWithIndex is not supported

Test Plan: write_batch_test

Reviewers: yhchiang, sdong, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D42033
2015-07-14 10:35:35 -07:00
Igor Canadi
a9c5109515 Deprecate purge_redundant_kvs_while_flush
Summary: This option is guarding the feature implemented 2 and a half years ago: D8991. The feature was enabled by default back then and has been running without issues. There is no reason why any client would turn this feature off. I found no reference in fbcode.

Test Plan: none

Reviewers: sdong, yhchiang, anthony, dhruba

Reviewed By: dhruba

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42063
2015-07-14 13:07:02 +02:00
Igor Canadi
5aea98ddd8 Deprecate WriteOptions::timeout_hint_us
Summary:
In one of our recent meetings, we discussed deprecating features that are not being actively used. One of those features, at least within Facebook, is timeout_hint. The feature is really nicely implemented, but if nobody needs it, we should remove it from our code-base (until we get a valid use-case). Some arguments:
* Less code == better icache hit rate, smaller builds, simpler code
* The motivation for adding timeout_hint_us was to work-around RocksDB's stall issue. However, we're currently addressing the stall issue itself (see @sdong's recent work on stall write_rate), so we should never see sharp lock-ups in the future.
* Nobody is using the feature within Facebook's code-base. Googling for `timeout_hint_us` also doesn't yield any users.

Test Plan: make check

Reviewers: anthony, kradhakrishnan, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: sdong, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41937
2015-07-14 09:35:48 +02:00
Andres Notzli
ae29495e4b Avoid manipulating const char* arrays
Summary:
We were manipulating `const char*` arrays in CompactionJob to
change the sequence number/types of keys. This patch changes
UpdateInternalKey() to use string methods to do the manipulation
and updates all calls accordingly.

Test Plan:
Added test case for UpdateInternalKey() in dbformat_test.
make && make check

Reviewers: sdong, rven, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41985
2015-07-14 00:21:41 -07:00
Andres Notzli
ab137af4ba Partial cleanup of CompactionJob
Summary:
Logging, dealing with key prefix batches and updating stats
moved from CompactionJob::Run() into separate functions.

Test Plan: make all && make check

Reviewers: sdong, rven, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41919
2015-07-14 00:09:20 -07:00
Yueh-Hsuan Chiang
b10cf4e2e0 Move DynamicLevel related db-tests to db_dynamic_level_test.cc
Summary: Move DynamicLevel related db-tests to db_dynamic_level_test.cc

Test Plan:
db_dynamic_level_test
db_test

Reviewers: igor, anthony, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42039
2015-07-13 19:00:30 -07:00
Islam AbdelRahman
e290f5d3ce Block reduce_levels_test in ROCKSDB_LITE
Summary: Block reduce_levels_test in ROCKSDB_LITE as LDBCommand is not supported

Test Plan:
make reduce_levels_test -j64
OPT=-DROCKSDB_LITE make reduce_levels_test -j64
make check -j64

Reviewers: sdong, igor, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D41967
2015-07-13 18:45:55 -07:00
Yueh-Hsuan Chiang
49f42ad032 Move global static functions in db_test_util to DBTestBase
Summary:
Move global static functions in db_test_util to DBTestBase.
This is to prevent unused function warning when decoupling
db_test.cc into multiple files.

Test Plan: db_test

Reviewers: igor, sdong, anthony, IslamAbdelRahman, kradhakrishnan

Reviewed By: kradhakrishnan

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D42009
2015-07-13 17:41:41 -07:00
Yueh-Hsuan Chiang
625467a08a Move reusable part of db_test.cc to util/db_test_util.h
Summary:
Move reusable part of db_test.cc to util/db_test_util.h.
This makes it more possible to partition db_test.cc into
multiple smaller test files.

Also, fixed many old lint errors in db_test.

Test Plan: db_test

Reviewers: igor, anthony, IslamAbdelRahman, sdong, kradhakrishnan

Reviewed By: sdong, kradhakrishnan

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41973
2015-07-13 16:53:38 -07:00
Ari Ekmekji
8bca83e5dd Add tombstone information in CompactionJobStats
Summary:
Added new statistics in CompactionJobStats to keep track of
deletion entries and the expiration of those entries. Updated these
fields in compaction_job.cc as compaction took place and wrote a new
test in compaction_job_stats_test.cc to verify accuracy.

Test Plan:
Wrote new test DeletionStatsTest in
compaction_job_stats_test.cc to verify

Reviewers: sdong, igor, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D41355
2015-07-13 15:51:38 -07:00
sdong
f9728640f3 "make format" against last 10 commits
Summary: This helps Windows port to format their changes, as discussed. Might have formatted some other codes too becasue last 10 commits include more.

Test Plan: Build it.

Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D41961
2015-07-13 13:50:18 -07:00
sdong
76d3cd3286 Fix public API dependency on internal codes and dependency on MAX_INT32
Summary:
Public API depends on port/port.h which is wrong. Fix it.
Also with gcc 4.8.1 build was broken as MAX_INT32 was not recognized. Fix it by using ::max in linux.

Test Plan: Build it and try to build an external project on top of it.

Reviewers: anthony, yhchiang, kradhakrishnan, igor

Reviewed By: igor

Subscribers: yoshinorim, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D41745
2015-07-11 10:32:11 -07:00
sdong
5fd11853cb Print Fast CRC32 support information in DB LOG
Summary: Print whether fast CRC32 is supported in DB info LOG

Test Plan: Run db_bench and see it prints out correctly.

Reviewers: yhchiang, anthony, kradhakrishnan, igor

Reviewed By: igor

Subscribers: MarkCallaghan, yoshinorim, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D41733
2015-07-10 17:59:36 -07:00
Siying Dong
e41cbd9c2f Merge pull request #646 from yuslepukhin/ms_win_port
Windows Port from Microsoft
2015-07-10 15:53:39 -07:00
sdong
041b6f95a2 perf_context: report time spent on reading index and bloom blocks
Summary: Add a perf context counter to help users figure out time spent on reading indexes and bloom filter blocks.

Test Plan: Will write a unit test

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D41433
2015-07-10 14:45:42 -07:00
Dmitri Smirnov
c903ccc4c2 Merge from github/master 2015-07-09 18:01:08 -07:00
krad
7189e90c22 Fix a noisy unit test.
Summary: The t/DBTest.DropWrites test still fails under certain gcc version in release unit test.

I unfortunately cannot repro the failure (since the compilers have mapped library which I am not able to map to correctly). I am suspecting the clock skew.

Test Plan: Run make check

Reviewers:

CC: sdong igore

Task ID: #7312624

Blame Rev:
2015-07-09 13:43:52 -07:00
Aaron Feldman
1f4d565709 Add db_bench flag to set cache_index_and_filter_blocks
Summary:
The new flag --cache_index_and_filter_blocks sets
BlockBasedTableOptions.cache_index_and_filter_blocks

Test Plan: make db_bench. Working on benchmarks with the new flag.

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D41481
2015-07-09 13:36:16 -07:00
Dmitri Smirnov
d8586ab22b All of these are in the new code added past 3.10
1) Crash in env_win.cc that prevented db_test run to completion and some new tests
     2) Fix new corruption tests in DBTest by allowing a shared trunction of files. Note that this is generally needed ONLY for tests.
     3) Close database so WAL is closed prior to inducing corruption similar to what we did within Corruption tests.
2015-07-08 16:56:45 -07:00
Poornima Chozhiyath Raman
4bed00a44b Fix function name format according to google style
Summary: Change the naming style of getter and setters according to Google C++ style in compaction.h file

Test Plan: Compilation success

Reviewers: sdong

Reviewed By: sdong

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D41265
2015-07-08 15:21:10 -07:00
krad
e2e3d84b2c Added multi WAL log testing to recovery tests.
Summary: Currently there is no test in the suite to test the case where
there are multiple WAL files and there is a corruption in one of them. We have
tests for single WAL file corruption scenarios. Added tests to mock
the scenarios for all combinations of recovery modes and corruption in
specified file locations.

Test Plan: Run make check

Reviewers: sdong igor

CC: leveldb@

Task ID: #7501229

Blame Rev:
2015-07-08 14:39:00 -07:00
Dmitri Smirnov
ef4b87f1b2 Commit both PR and internal code review changes 2015-07-07 16:58:20 -07:00
agiardullo
4f56632b16 Fix occasional failure in compaction_job_test
Summary: Coverage test has been occasionally failing due to this timing check.

Test Plan: run test

Reviewers: yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41367
2015-07-07 16:10:23 -07:00
Poornima Chozhiyath Raman
411c8e3d19 Build fail fix
Summary: Build fail fix. Type cast issues.

Test Plan: compiled

Reviewers: sdong, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D41349
2015-07-07 15:21:17 -07:00
Poornima Chozhiyath Raman
c0b23dd5b0 Enabling trivial move in universal compaction
Summary: This change enables trivial move if all the input files are non onverlapping while doing Universal Compaction.

Test Plan: ./compaction_picker_test and db_test ran successfully with the new testcases.

Reviewers: sdong

Reviewed By: sdong

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D40875
2015-07-07 14:18:55 -07:00
Yueh-Hsuan Chiang
d8e3e766f9 Fixed a bug in test ThreadStatusSingleCompaction
Summary:
Fixed a bug in test ThreadStatusSingleCompaction where
SyncPoint traces are not cleared before the test begins
its second iteration.

Test Plan: db_test

Reviewers: sdong, anthony, IslamAbdelRahman, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41337
2015-07-07 14:00:03 -07:00
Yueh-Hsuan Chiang
4ce5be4255 fixed leaking log::Writers
Summary: Fixes valgrind errors in column_family_test.

Test Plan: `make check`, `make valgrind_check`

Reviewers: igor, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D41181
2015-07-07 12:10:10 -07:00
Yueh-Hsuan Chiang
685582a0b4 Revert two diffs related to DBIter::FindPrevUserKey()
Summary:
This diff reverts the following two previous diffs related to
DBIter::FindPrevUserKey(), which makes db_stress unstable.
We should bake a better fix for this.

* "Fix a comparison in DBIter::FindPrevUserKey()"
  ec70fea4c4.

* "Fixed endless loop in DBIter::FindPrevUserKey()"
  acee2b08a2.

Test Plan: db_stress

Reviewers: anthony, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41301
2015-07-07 11:36:24 -07:00
Andres Notzli
58d7ab3c68 Added tests for ExpandWhileOverlapping()
Summary:
This patch adds three test cases for ExpandWhileOverlapping()
to the compaction_picker_test test suite.
ExpandWhileOverlapping() only has an effect if the comparison
function for the internal keys allows for overlapping user
keys in different SST files on the same level. Thus, this
patch adds a comparator based on sequence numbers to
compaction_picker_test for the new test cases.

Test Plan:
- make compaction_picker_test && ./compaction_picker_test
  -> All tests pass
- Replace body of ExpandWhileOverlapping() with `return true`
  -> Compile and run ./compaction_picker_test as before
  -> New tests fail

Reviewers: sdong, yhchiang, rven, anthony, IslamAbdelRahman, kradhakrishnan, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41277
2015-07-06 22:25:27 -07:00
Igor Canadi
155ce60daf Fix compaction_job_test
Summary:
Two issues:
* the input keys to the compaction don't include sequence number.
* sequence number is set to max(seq_num), but it should be set to max(seq_num)+1, because the condition here is strictly-larger (i.e. we will only zero-out sequence number if the DB's sequence number is strictly greater than the key's sequence number): https://github.com/facebook/rocksdb/blob/master/db/compaction_job.cc#L830

Test Plan: make compaction_job_test && ./compaction_job_test

Reviewers: sdong, lovro

Reviewed By: lovro

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41247
2015-07-06 11:14:08 -07:00
Dmitri Smirnov
d2f0912bd3 Merge the latest changes from github/master 2015-07-02 17:23:41 -07:00
Mike Kolupaev
218487d8dc [wal changes 1/3] fixed unbounded wal growth in some workloads
Summary:
This fixes the following scenario we've hit:
 - we reached max_total_wal_size, created a new wal and scheduled flushing all memtables corresponding to the old one,
 - before the last of these flushes started its column family was dropped; the last background flush call was a no-op; no one removed the old wal from alive_logs_,
 - hours have passed and no flushes happened even though lots of data was written; data is written to different column families, compactions are disabled; old column families are dropped before memtable grows big enough to trigger a flush; the old wal still sits in alive_logs_ preventing max_total_wal_size limit from kicking in,
 - a few more hours pass and we run out disk space because of one huge .log file.

Test Plan: `make check`; backported the new test, checked that it fails without this diff

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D40893
2015-07-02 14:27:00 -07:00
Dmitri Smirnov
9dbde7277c Merge remote-tracking branch 'origin' into ms_win_port 2015-07-02 11:34:22 -07:00
Dmitri Smirnov
18285c1e2f Windows Port from Microsoft
Summary: Make RocksDb build and run on Windows to be functionally
 complete and performant. All existing test cases run with no
 regressions. Performance numbers are in the pull-request.

 Test plan: make all of the existing unit tests pass, obtain perf numbers.

 Co-authored-by: Praveen Rao praveensinghrao@outlook.com
 Co-authored-by: Sherlock Huang baihan.huang@gmail.com
 Co-authored-by: Alex Zinoviev alexander.zinoviev@me.com
 Co-authored-by: Dmitri Smirnov dmitrism@microsoft.com
2015-07-01 16:13:56 -07:00
sdong
05e2831966 Allocate LevelFileIteratorState and LevelFileNumIterator from DB iterator's arena
Summary: Try to allocate LevelFileIteratorState and LevelFileNumIterator from DB iterator's arena, instead of calling malloc and free.

Test Plan: valgrind check

Reviewers: rven, yhchiang, anthony, kradhakrishnan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D40929
2015-06-30 17:30:38 -07:00
krad
b0f1927dbb Increasing timeout for drop writes.
Summary: We have a race in the way test works. We avoided the race by adding the
wait to the counter. I thought 1s was eternity, but that is not true in some
scenarios. Increasing the timeout to 10s and adding warnings.

Also, adding nosleep to avoid the case where the wakeup thread is waiting behind
the sleeping thread for scheduling.

Test Plan: Run make check

Reviewers: siying igorcanadi

CC: leveldb@

Task ID: #7312624

Blame Rev:
2015-06-30 11:11:56 -07:00
Tomislav Novak
ec70fea4c4 Fix a comparison in DBIter::FindPrevUserKey()
Summary:
When seek target is a merge key (`kTypeMerge`), `DBIter::FindNextUserEntry()`
advances the underlying iterator _past_ the current key (`saved_key_`); see
`MergeValuesNewToOld()`. However, `FindPrevUserKey()` assumes that `iter_`
points to an entry with the same user key as `saved_key_`. As a result,
`it->Seek(key) && it->Prev()` can cause the iterator to be positioned at the
_next_, instead of the previous, entry (new test, written by @lovro, reproduces
the bug).

This diff changes `FindPrevUserKey()` to also skip keys that are _greater_ than
`saved_key_`.

Test Plan: db_test

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: leveldb, dhruba, lovro

Differential Revision: https://reviews.facebook.net/D40791
2015-06-29 17:04:03 -07:00
Yueh-Hsuan Chiang
501591c423 Make column_family_test runnable in ROCKSDB_LITE
Summary: Make column_family_test runnable in ROCKSDB_LITE.

Test Plan: column_family_test

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D40251
2015-06-29 14:39:01 -07:00
krad
6199cba998 Fix race in unit test.
Summary: Avoid falling victim to race condition.

Test Plan: Run the unit test

Reviewers: sdong igor

CC: leveldb@

Task ID: #7312624

Blame Rev:
2015-06-29 11:40:21 -07:00
Igor Canadi
4cbc4e6f88 Call merge operators with empty values
Summary: It's not really nice to call user's API with garbage data in new_value. This diff makes sure that new_value is empty before calling the merge operator.

Test Plan: Added assert to Merge operator in merge_test

Reviewers: sdong, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D40773
2015-06-26 11:35:46 -07:00
Venkatesh Radhakrishnan
c9cd404bcd Make flush check for shutdown
Summary:
Fixes task 7156865 where a compaction causes a hang in flush
memtable if CancelAllBackgroundWork was called prior to it.
Stack trace is in : https://phabricator.fb.com/P19848829
We end up waiting for a flush which will never happen because there are no background threads.

Test Plan: PreShutdownFlush

Reviewers: sdong, igor

Reviewed By: sdong, igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D40617
2015-06-25 14:43:25 -07:00
Poornima Chozhiyath Raman
4fb09c6871 Updating SeekToLast with upper bound
Summary: #7124486: RocksDB's Iterator.SeekToLast should seek to the last key before iterate_upper_bound if presents

Test Plan: ./db_iter_test run successfully with the new testcase

Reviewers: rven, yhchiang, igor, anthony, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D40425
2015-06-25 09:44:30 -07:00
Yueh-Hsuan Chiang
dec2c9f564 Make table_properties_collector_test runnable in ROCKSDB_LITE
Summary: Make table_properties_collector_test runnable in ROCKSDB_LITE

Test Plan: table_properties_collector_test

Reviewers: sdong, rven, anthony, kradhakrishnan, IslamAbdelRahman, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D40581
2015-06-24 01:38:53 -07:00
Islam AbdelRahman
674b1181cf Bottommost level compaction option
Summary: Replace force_bottommost_level_compaction in CompactRangeOption with an option that allow the user to (always skip, always compact, compact if compaction filter is present) the bottommost level for level based compaction.

Test Plan: make check

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D40527
2015-06-23 13:32:40 -07:00
Giuseppe Ottaviano
782a1590f9 Implement a table-level row cache
Summary:
Implementation of a table-level row cache.
It only caches point queries done through the `DB::Get` interface, queries done through the `Iterator` interface will completely skip the cache.

Supports snapshots and merge operations.

Test Plan: Ran `make valgrind_check commit-prereq`

Reviewers: igor, philipp, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D39849
2015-06-23 10:25:45 -07:00
krad
de85e4cadf Introduce WAL recovery consistency levels
Summary:
The "one size fits all" approach with WAL recovery will only introduce inconvenience for our varied clients as we go forward. The current recovery is a bit heuristic. We introduce the following levels of consistency while replaying the WAL.

1. RecoverAfterRestart (kTolerateCorruptedTailRecords)

This mocks the current recovery mode.

2. RecoverAfterCleanShutdown (kAbsoluteConsistency)

This is ideal for unit test and cases where the store is shutdown cleanly. We tolerate no corruption or incomplete writes.

3. RecoverPointInTime (kPointInTimeRecovery)

This is ideal when using devices with controller cache or file systems which can loose data on restart. We recover upto the point were is no corruption or incomplete write.

4. RecoverAfterDisaster (kSkipAnyCorruptRecord)

This is ideal mode to recover data. We tolerate corruption and incomplete writes, and we hop over those sections that we cannot make sense of salvaging as many records as possible.

Test Plan:
(1) Run added unit test to cover all levels.
(2) Run make check.

Reviewers: leveldb, sdong, igor

Subscribers: yoshinorim, dhruba

Differential Revision: https://reviews.facebook.net/D38487
2015-06-22 15:28:12 -07:00
Islam AbdelRahman
530534fceb Fix trivial move merge
Summary: Fixing bad merge

Test Plan: make -j64 check (this is not enough to verify the fix)

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D40521
2015-06-22 15:20:30 -07:00
Venkatesh Radhakrishnan
04251e1e3a Add wal files to Checkpoint for multiple column families.
Summary:
When there are multiple column families, the flush in
GetLiveFiles is not atomic, so that there are entries in the wal files
which are needed to get a consisten RocksDB. We now add the log files to
the checkpoint.

Test Plan:
CheckpointCF - This test forces more data to be written to
the other column families after the flush of the first column family but
before the second.

Reviewers: igor, yhchiang, IslamAbdelRahman, anthony, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D40323
2015-06-19 16:08:31 -07:00
Igor Canadi
bf03f59c11 Disable CompressLevelCompaction() if Zlib is not supported
Summary: CompressLevelCompaction() depends on Zlib. We should skip it when zlib is not present.

Test Plan: `make check` without zlib

Reviewers: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D40401
2015-06-18 18:46:26 -07:00
Igor Canadi
760e9a94de Fail DB::Open() when the requested compression is not available
Summary:
Currently RocksDB silently ignores this issue and doesn't compress the data. Based on discussion, we agree that this is pretty bad because it can cause confusion for our users.

This patch fails DB::Open() if we don't support the compression that is specified in the options.

Test Plan: make check with LZ4 not present. If Snappy is not present all tests will just fail because Snappy is our default library. We should make Snappy the requirement, since without it our default DB::Open() fails.

Reviewers: sdong, MarkCallaghan, rven, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39687
2015-06-18 14:55:05 -07:00
Islam AbdelRahman
4eabbdb7ec Skip bottommost level compaction if possible
Summary:
This is https://reviews.facebook.net/D39999 but after introducing an option to force compaction the bottom most level

Changes in this patch
- Introduce force_bottommost_level_compaction to CompactRangeOptions that force compacting bottommost level during compaction
- Skip bottommost level compaction if we dont have a compaction filter and force_bottommost_level_compaction options is not set

Although tests pass on my machine but I suspect that there maybe some tests that I am not aware of that  should use force_bottommost_level_compaction to pass in a deterministic way

Test Plan:
make check
adding new tests

Reviewers: igor, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D40059
2015-06-18 11:03:31 -07:00
Igor Canadi
4b8bb62f0a Don't dump DBOptions for each column family
Summary: Currently we dump DBOptions for each column family options we dump. This leads to duplicate lines in our LOG file. This diff fixes that.

Test Plan: Check out the LOG

Reviewers: sdong, rven, yhchiang

Reviewed By: yhchiang

Subscribers: IslamAbdelRahman, yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39729
2015-06-18 10:15:54 -07:00
Poornima Chozhiyath Raman
176f0bedce Merge branch 'master' of github.com:facebook/rocksdb
D40233: Replace %llu with format macros in ParsedInternalKey::DebugString())
2015-06-18 10:12:55 -07:00
Yueh-Hsuan Chiang
bb1c74ce18 Fixed a bug of CompactionStats in multi-level universal compaction case
Summary:
Universal compaction can involves in multiple levels.  However,
the current implementation of bytes_readn and bytes_readnp1
(and some other stats with postfix `n` and `np1`) assumes compaction
can only have two levels.

This patch fixes this bug and redefines bytes_readn and bytes_readnp1:
* bytes_readnp1: the number of bytes read in the compaction output level.
* bytes_readn: the total number of bytes read minus bytes_readnp1

Test Plan: Add a test in compaction_job_stats_test

Reviewers: igor, sdong, rven, anthony, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D40239
2015-06-17 23:40:34 -07:00
Poornima Chozhiyath Raman
a66b8157df Merge branch 'master' of github.com:facebook/rocksdb
D40233: Replace %llu with format macros in ParsedInternalKey::DebugString())
2015-06-17 20:44:52 -07:00
Poornima Chozhiyath Raman
f06be62fd2 Replace %llu with format macros in ParsedInternalKey::DebugString())
Test Plan: successfully compiled the code

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D40233
2015-06-17 20:44:26 -07:00
Igor Canadi
2dc3910b5e Add --benchmark_write_rate_limit option to db_bench
Summary:
So far, we benchmarked RocksDB by writing as fast as possible. With this change, we're able to limit our write throughput, which should help us better understand how RocksDB performes under varying write workloads.

Specifically, I'm currently interested in the shape of the graph that has write throughput on one axis and write rate on another. This should help us with designing our stall system, as we have started to do with D36351.

Test Plan:
    $ ./db_bench --benchmarks=fillrandom --benchmark_write_rate_limit=1000000
    fillrandom   :     118.523 micros/op 8437 ops/sec;    0.9 MB/s
    $ ./db_bench --benchmarks=fillrandom --benchmark_write_rate_limit=2000000
    fillrandom   :      59.136 micros/op 16910 ops/sec;    1.9 MB/s

Reviewers: MarkCallaghan, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39759
2015-06-17 16:44:52 -07:00
Islam AbdelRahman
12e030a992 Use CompactRangeOptions for CompactRange
Summary:
This diff update DB::CompactRange to use RangeCompactionOptions instead of using multiple parameters
Old CompactRange is still available but deprecated

Test Plan:
make all check
make rocksdbjava
USE_CLANG=1 make all
OPT=-DROCKSDB_LITE make release

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D40209
2015-06-17 14:36:14 -07:00
Igor Canadi
4716ab4d16 Merge pull request #638 from HolodovAlexander/master
C api: human-readable statistics
2015-06-17 13:16:20 -07:00
Igor Canadi
25d600569d Clean up InstallSuperVersion
Summary:
We go to great lengths to make sure MaybeScheduleFlushOrCompaction() is called outside of write thread. But anyway, it's still called in the mutex, so it's not that much cheaper.

This diff removes the "optimization" and cleans up the code a bit.

Test Plan: make check

Reviewers: rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D40113
2015-06-17 12:37:59 -07:00
Yueh-Hsuan Chiang
1a08d0beb5 Block c_test in ROCKSDB_LITE
Summary: Block c_test in ROCKSDB_LITE as it's not supported in ROCKSDB_LITE.

Test Plan: c_test

Reviewers: sdong, rven, anthony, kradhakrishnan, IslamAbdelRahman, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D40257
2015-06-17 10:54:51 -07:00
sdong
40f562e747 Allow GetApproximateSize() to include mem table size if it is skip list memtable
Summary:
Add an option in GetApproximateSize() so that the result will include estimated sizes in mem tables.
To implement it, implement an estimated count from the beginning to a key in skip list. The approach is to count to find the entry, how many Next() is issued from each level, and sum them with a weight that is <branching factor> ^ <level>.

Test Plan: Add a test case

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D40119
2015-06-16 18:13:23 -07:00
Igor Canadi
d59d90bb1f db_bench periodically writes QPS to CSV file
Summary:
This is part of an effort to better understand and optimize RocksDB stalls under high load. I added a feature to db_bench to periodically write QPS to CSV files. That way we can nicely see how our QPS changes in time (especially when DB is stalled) and can do a better job of evaluating our stall system (i.e. we want the QPS to be as constant as possible, as opposed to having bunch of stalls)

Cool part of CSV files is that we can easily graph them -- there are a bunch of tools available.

Test Plan:
Ran ./db_bench --report_interval_seconds=10 --benchmarks=fillrandom --num=10000000
and observed this in report.csv:

secs_elapsed,interval_qps
10,2725860
20,1980480
30,1863456
40,1454359
50,1460389

Reviewers: sdong, MarkCallaghan, rven, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D40047
2015-06-12 14:31:53 -07:00
Yueh-Hsuan Chiang
5fec963877 Fixed false alarm of size comparison in compaction_job_stats_test
Summary: Fixed false alarm of size comparison in compaction_job_stats_test

Test Plan: compaction_job_stats_test

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39921
2015-06-12 10:44:07 -07:00
Islam AbdelRahman
cccd2199a6 Revert skip bottommost compaction
Summary:
Reverting this diff https://reviews.facebook.net/D39999
Will add an option to force bottom most level compaction and then re submit it

Test Plan: make check

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D40041
2015-06-12 10:43:33 -07:00
Islam AbdelRahman
20f2b54252 Skip bottom most level compaction if no compaction filter
Summary: If we don't have a compaction filter then we can skip compacting the bottom most level

Test Plan:
make check
added unit tests

Reviewers: yhchiang, sdong, igor

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D39999
2015-06-12 09:56:08 -07:00
sdong
7842920be5 Slow down writes by bytes written
Summary:
We slow down data into the database to the rate of options.delayed_write_rate (a new option) with this patch.

The thread synchronization approach I take is to still synchronize write controller by DB mutex and GetDelay() is inside DB mutex. Try to minimize the frequency of getting time in GetDelay(). I verified it through db_bench and it seems to work

hard_rate_limit is deprecated.

options.delayed_write_rate is still not dynamically changeable. Need to work on it as a follow-up.

Test Plan: Add new unit tests in db_test

Reviewers: yhchiang, rven, kradhakrishnan, anthony, MarkCallaghan, igor

Reviewed By: igor

Subscribers: ikabiljo, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D36351
2015-06-11 20:42:18 -07:00
Igor Canadi
a84df655f3 Don't let two L0->L1 compactions run in parallel
Summary: With experimental feature SuggestCompactRange() we don't restrict running two L0->L1 compactions in parallel. This diff fixes this.

Test Plan: added a unit test to reproduce the failure. fixed the unit test

Reviewers: yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39981
2015-06-11 15:42:16 -07:00
Islam AbdelRahman
d6ce0f7c61 Add largest sequence to FlushJobInfo
Summary:
Adding largest sequence number to FlushJobInfo
and passing flushed file metadata to NotifyOnFlushCompleted which include alot of other values that we may want to expose in FlushJobInfo

Test Plan: make check

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D39927
2015-06-11 15:22:22 -07:00
Yueh-Hsuan Chiang
3eddd1abe9 Add Env::GetThreadID(), which returns the ID of the current thread.
Summary:
Add Env::GetThreadID(), which returns the ID of the current thread.

In addition, make GetThreadList() and InfoLog use same unique ID for the same thread.

Test Plan:
db_test
listener_test

Reviewers: igor, rven, IslamAbdelRahman, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D39735
2015-06-11 14:18:02 -07:00
Islam AbdelRahman
73faa3d41d Handling edge cases for ReFitLevel
Summary:
Right now the level we pass to ReFitLevel is the maximum level with files (before compaction), there are multiple cases where this maximum level have changed after compaction
- all files where in L0 (now maximum level is L1)
- using kCompactionStyleUniversal (now maximum level in the last level)
- level_compaction_dynamic_level_bytes ??

We can handle each of these cases individually, but I felt it's safer to calculate max_level_with_files again if we want to do a ReFitLevel

Test Plan:
adding some tests
make -j64 check

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: ott, dhruba

Differential Revision: https://reviews.facebook.net/D39663
2015-06-11 14:15:52 -07:00
Reed Allman
735df66552 C: add WriteBatch.PutLogData support 2015-06-10 00:12:33 -07:00
sdong
e409d3d745 Make "make all" work for CYGWIN
Summary: Some test and benchmark codes don't build for CYGWIN. Fix it.

Test Plan: Build "make all" with TARGET_OS=Cygwin on cygwin and make sure it passes.

Reviewers: rven, yhchiang, anthony, igor, kradhakrishnan

Reviewed By: igor, kradhakrishnan

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D39711
2015-06-09 16:36:07 -07:00
sdong
75d7075a8a Print info message about files need compaction for debuging purpose
Summary:
When there are files marked for compaction after compactions, print extra messages to help debugging. Example:

2015/06/08-23:12:55.212855 7ff5013ff700 [default] [JOB 121] Generated table #75: 54 keys, 4807 bytes (need compaction)

2015/06/08-23:12:55.556194 7ff5013ff700 (Original Log Time 2015/06/08-23:12:55.556160) [default] compacted to: base level 1 max bytes base
10240 files[0 1 9 32 12 0 0 0] max score 0.96 (2 files need compaction), MB/sec: 0.0 rd, 0.1 wr, level 2, files in(1, 3) out(5) MB in(0.0,
0.0) out(0.0), read-write-amplify(11.3) write-amplify(5.7) OK, records in: 40, records dropped: 0

Test Plan:
Run test and see LOG files.

valgrind test DBTest.TablePropertiesNeedCompactTest

Reviewers: rven, yhchiang, kradhakrishnan, IslamAbdelRahman, igor

Reviewed By: igor

Subscribers: yoshinorim, maykov, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D39771
2015-06-09 11:23:29 -07:00
Venkatesh Radhakrishnan
406a5682eb Fix hang when closing a DB after doing loads with WAL disabled.
Summary:
There is a hang during DB close in the following scenario:
a) a load with WAL disabled was done,
b) CancelAllBackgroundWork was called,
c) DB Close was called
This was because in that we will wait for a flush but we cannot do a
background flush because we have called CancelAllBackgroundWork which
marks the DB as shutting downn.

Test Plan: Added DBTest FlushOnDestroy

Reviewers: sdong

Reviewed By: sdong

Subscribers: yoshinorim, hermanlee4, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39747
2015-06-09 10:39:49 -07:00
sdong
d8c8f08c12 GetSnapshot() and ReleaseSnapshot() to move new and free out of DB mutex
Summary: We currently issue malloc and free inside DB mutex in GetSnapshot() and ReleaseSnapshot(). Move them out.

Test Plan:
Go through all tests
make valgrind_check

Reviewers: yhchiang, rven, IslamAbdelRahman, anthony, igor

Reviewed By: igor

Subscribers: maykov, hermanlee4, MarkCallaghan, yoshinorim, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D39753
2015-06-08 21:57:02 -07:00
Islam AbdelRahman
643bbbf081 Use nullptr for default compaction_filter_factory
Summary:
Replacing the default value for compaction_filter_factory and compaction_filter_factory_v2 to be nullptr instead of DefaultCompactionFilterFactory / DefaultCompactionFilterFactoryV2
The reason for this is to be able to determine easily if we have compaction filter factory or not without depending on RTTI

Test Plan: make check

Reviewers: yoshinorim, ott, igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D39693
2015-06-08 16:34:26 -07:00
Igor Canadi
f02ce0c651 Fix ASAN errors in c_test
Summary: key_sizes claims that 3rd key is of length 8, but it's really only 3. This diff makes it length 8.

Test Plan: asan c_test works again.

Reviewers: sdong, yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39699
2015-06-08 11:28:40 -07:00
Igor Canadi
133130a4f5 Merge pull request #625 from rdallman/c-slice-parts-support
C: add support for WriteBatch SliceParts params
2015-06-08 13:14:44 -04:00
Igor Canadi
de4d172d0f Merge pull request #622 from rdallman/c-multiget
C: add MultiGet support
2015-06-08 13:13:54 -04:00
sdong
6df589b446 Add TablePropertiesCollector::NeedCompact() to suggest DB to further compact output files
Summary:
It is experimental. Allow users to return from a call back function TablePropertiesCollector::NeedCompact(), based on the data in the file.
It can be used to allow users to suggest DB to clear up delete tombstones faster.

Test Plan: Add a unit test.

Reviewers: igor, yhchiang, kradhakrishnan, rven

Reviewed By: rven

Subscribers: yoshinorim, MarkCallaghan, maykov, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D39585
2015-06-05 20:18:21 -07:00
Yueh-Hsuan Chiang
2e764f06ea [API Change] Improve EventListener::OnFlushCompleted interface
Summary:
EventListener::OnFlushCompleted() now passes a structure instead
of a list of parameters.  This minimizes the API change in the
future.

Test Plan:
listener_test
compact_files_test
example/compact_files_example

Reviewers: kradhakrishnan, sdong, IslamAbdelRahman, rven, igor

Reviewed By: rven, igor

Subscribers: IslamAbdelRahman, rven, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39543
2015-06-05 12:28:51 -07:00
Yueh-Hsuan Chiang
7322c74012 Revert incorrect commit
Summary: Revert incorrect commit

Test Plan: db_test

Reviewers: sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39651
2015-06-05 11:23:09 -07:00
Islam AbdelRahman
31e60e2a77 Unlock mutex in ReFitLevel
Summary: I encountered an issue where the database hang, it looks like the mutex is not unlocked on return in ReFitLevel function

Test Plan: make -j64 check

Reviewers: yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D39609
2015-06-05 11:06:14 -07:00
Yueh-Hsuan Chiang
7647df8f9e Fixed the tsan failure in util/compaction_job_stats_impl.cc
Summary:
The type of smallest_output_key_prefix and largest_output_key_prefix
have been changed to std::string in https://reviews.facebook.net/D39537.
As a result, we shouldn't do smallest_output_key_prefix[0] = 0 in the
initialization.

Test Plan: compile db_test with tsan enabled and repeat DBTest.CompactionDeletionTrigger test to verify the tsan issue has been gone.

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39645
2015-06-05 11:05:35 -07:00
Igor Canadi
b2785472c8 Fix compile
Summary:
This commit broke the compile: 3ce3bb3da2
As evidenced here: https://evergreen.mongodb.com/task/mongodb_mongo_master_ubuntu1404_rocksdb_compile_ce2b1d11d42de93f7b375f7e6c41fb709f66e969_15_06_04_23_09_36

This should fix it

Test Plan: make check

Reviewers: IslamAbdelRahman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39627
2015-06-05 09:41:45 -04:00
Islam AbdelRahman
3ce3bb3da2 Allowing L0 -> L1 trivial move on sorted data
Summary:
This diff updates the logic of how we do trivial move, now trivial move can run on any number of files in input level as long as they are not overlapping

The conditions for trivial move have been updated

Introduced conditions:
  - Trivial move cannot happen if we have a compaction filter (except if the compaction is not manual)
  - Input level files cannot be overlapping

Removed conditions:
  - Trivial move only run when the compaction is not manual
  - Input level should can contain only 1 file

More context on what tests failed because of Trivial move
```
DBTest.CompactionsGenerateMultipleFiles
This test is expecting compaction on a file in L0 to generate multiple files in L1, this test will fail with trivial move because we end up with one file in L1
```

```
DBTest.NoSpaceCompactRange
This test expect compaction to fail when we force environment to report running out of space, of course this is not valid in trivial move situation
because trivial move does not need any extra space, and did not check for that
```

```
DBTest.DropWrites
Similar to DBTest.NoSpaceCompactRange
```

```
DBTest.DeleteObsoleteFilesPendingOutputs
This test expect that a file in L2 is deleted after it's moved to L3, this is not valid with trivial move because although the file was moved it is now used by L3
```

```
CuckooTableDBTest.CompactionIntoMultipleFiles
Same as DBTest.CompactionsGenerateMultipleFiles
```

This diff is based on a work by @sdong https://reviews.facebook.net/D34149

Test Plan: make -j64 check

Reviewers: rven, sdong, igor

Reviewed By: igor

Subscribers: yhchiang, ott, march, dhruba, sdong

Differential Revision: https://reviews.facebook.net/D34797
2015-06-04 16:51:25 -07:00
Yueh-Hsuan Chiang
bb808eaddb Changed the CompactionJobStats::output_key_prefix type from char[] to string.
Summary:
Keys in RocksDB can be arbitrary byte strings.  However, in the current
CompactionJobStats, smallest_output_key_prefix and largest_output_key_prefix
are of type char[] without having a length, which is insufficient to handle
non-null terminated strings.

This patch change their type to std::string.

Test Plan: compaction_job_stats_test

Reviewers: igor, rven, IslamAbdelRahman, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39537
2015-06-04 12:31:12 -07:00
Yueh-Hsuan Chiang
0b3172d071 Add EventListener::OnTableFileDeletion()
Summary:
Add EventListener::OnTableFileDeletion(), which will be
called when a table file is deleted.

Test Plan: Extend three existing tests in db_test to verify the deleted files.

Reviewers: rven, anthony, kradhakrishnan, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38931
2015-06-03 19:57:01 -07:00
Reed Allman
211a195d41 C: add MultiGet support 2015-06-03 17:57:42 -07:00
Reed Allman
5dc174e11a C: add support for WriteBatch SliceParts params 2015-06-03 17:08:00 -07:00
Igor Canadi
2d0b9e5f0a Fix compile on darwin
Summary: We need to start doing some CI on Macs.

Test Plan: works now

Reviewers: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39489
2015-06-03 16:52:51 -04:00
sdong
3af668ed17 Fix DBTest.MigrateToDynamicLevelMaxBytesBase slowness with valgrind
Summary:
DBTest.MigrateToDynamicLevelMaxBytesBase with valgrind test is
extremely slow. Work it around by not having both threads running
everything non-stop.

Test Plan: Run the test with valgrind which used to take too long to finish and see it finish in reasonable time.

Reviewers: yhchiang, anthony, rven, kradhakrishnan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D39477
2015-06-03 12:08:37 -07:00
Igor Canadi
408cc4b8e0 Revert "Merge pull request #621 from rdallman/c-slice-parts-support"
This reverts commit 78382d4ba7, reversing
changes made to ca8b85ac04.
2015-06-03 13:34:07 -04:00
Igor Canadi
78382d4ba7 Merge pull request #621 from rdallman/c-slice-parts-support
C: add support for WriteBatch SliceParts params
2015-06-03 13:16:14 -04:00
Yueh-Hsuan Chiang
0483dab2ab Remove a TODO that has been done
Summary: Remove a TODO that has been done

Test Plan: make

Reviewers: sdong, igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39429
2015-06-02 18:38:57 -07:00
Yueh-Hsuan Chiang
8afafc2783 Fix compile warning in db/db_impl
Summary:
Fix the following compile warning in db/db_impl

  db/db_impl.cc:1603:19: error: implicit conversion loses integer precision: 'const uint64_t' (aka 'const unsigned long') to 'int' [-Werror,-Wshorten-64-to-32]
     info.job_id = job_id;
                 ~ ^~~~~~

Test Plan: db_test

Reviewers: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39423
2015-06-02 17:36:45 -07:00
Yueh-Hsuan Chiang
fe5c6321cb Allow EventListener::OnCompactionCompleted to return CompactionJobStats.
Summary:
Allow EventListener::OnCompactionCompleted to return CompactionJobStats,
which contains useful information about a compaction.

Example CompactionJobStats returned by OnCompactionCompleted():
    smallest_output_key_prefix 05000000
    largest_output_key_prefix 06990000
    elapsed_time 42419
    num_input_records 300
    num_input_files 3
    num_input_files_at_output_level 2
    num_output_records 200
    num_output_files 1
    actual_bytes_input 167200
    actual_bytes_output 110688
    total_input_raw_key_bytes 5400
    total_input_raw_value_bytes 300000
    num_records_replaced 100
    is_manual_compaction 1

Test Plan: Developed a mega test in db_test which covers 20 variables in CompactionJobStats.

Reviewers: rven, igor, anthony, sdong

Reviewed By: sdong

Subscribers: tnovak, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38463
2015-06-02 17:07:16 -07:00
Yueh-Hsuan Chiang
3083ed2129 Fixed heap-use-after-free error in compaction_job_test.cc
Summary: Fixed heap-use-after-free error in compaction_job_test.cc

Test Plan: compaction_job_test

Reviewers: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39411
2015-06-02 16:23:01 -07:00
Yueh-Hsuan Chiang
ab946af08a Fix a compile warning in listener_test.cc
Summary:
Fixed the following compile warning in listener_test.cc:
db/listener_test.cc:214:8: error: 'OnTableFileCreated' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override]
14:16:46   void OnTableFileCreated(

Test Plan:
make listener_test

Reviewers: sdong, igor

Subscribers: leveldb
2015-06-02 14:20:27 -07:00
Yueh-Hsuan Chiang
fc83821270 Add EventListener::OnTableFileCreated()
Summary:
Add EventListener::OnTableFileCreated(), which will be called
when a table file is created.  This patch is part of the
EventLogger and EventListener integration.

Test Plan: Augment existing test in db/listener_test.cc

Reviewers: anthony, kradhakrishnan, rven, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38865
2015-06-02 14:12:23 -07:00
Yueh-Hsuan Chiang
898e803fc5 Add a stats counter for DB_WRITE back which was mistakenly removed.
Summary: Add a stats counter for DB_WRITE back which was mistakenly removed.

Test Plan: augment GroupCommitTest

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39399
2015-06-02 12:35:12 -07:00
Mike Kolupaev
ec7a944360 more times in perf_context and iostats_context
Summary:
We occasionally get write stalls (>1s Write() calls) on HDD under read load. The following timers explain almost all of the stalls:
 - perf_context.db_mutex_lock_nanos
 - perf_context.db_condition_wait_nanos
 - iostats_context.open_time
 - iostats_context.allocate_time
 - iostats_context.write_time
 - iostats_context.range_sync_time
 - iostats_context.logger_time

In my experiments each of these occasionally takes >1s on write path under some workload. There are rare cases when Write() takes long but none of these takes long.

Test Plan: Added code to our application to write the listed timings to log for slow writes. They usually add up to almost exactly the time Write() call took.

Reviewers: rven, yhchiang, sdong

Reviewed By: sdong

Subscribers: march, dhruba, tnovak

Differential Revision: https://reviews.facebook.net/D39177
2015-06-02 02:07:58 -07:00
sdong
4266d4fd90 Allow users to migrate to options.level_compaction_dynamic_level_bytes=true using CompactRange()
Summary: In DB::CompactRange(), change parameter "reduce_level" to "change_level". Users can compact all data to the last level if needed. By doing it, users can migrate the DB to options.level_compaction_dynamic_level_bytes=true.

Test Plan: Add a unit test for it.

Reviewers: yhchiang, anthony, kradhakrishnan, igor, rven

Reviewed By: rven

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D39099
2015-06-01 18:21:14 -07:00
Yueh-Hsuan Chiang
d333820bad Removed DBImpl::notifying_events_
Summary:
DBImpl::notifying_events_ is a internal counter in DBImpl which is
used to prevent DB close when DB is notifying events.  However, as
the current events all rely on either compaction or flush which
already have similar counters to prevent DB close, it is safe to
remove notifying_events_.

Test Plan:
listener_test
examples/compact_files_example

Reviewers: igor, anthony, kradhakrishnan, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39315
2015-06-01 15:32:23 -07:00
Igor Canadi
a187e66ad0 Merge pull request #617 from rdallman/wb-merge-sliceparts
WriteBatch.Merge w/ SliceParts support
2015-05-31 13:34:34 -04:00
Igor Canadi
4c181f08bc Fix compile on darwin
Summary: As title

Test Plan: make check

Reviewers: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39243
2015-05-30 12:25:45 -04:00
agiardullo
bc7a7a400c fix LITE build
Summary: Broken by optimistic transaction diff.  (I only built 'release' not 'static_lib' when testing).

Test Plan: build

Reviewers: yhchiang, sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39219
2015-05-29 15:22:00 -07:00
agiardullo
dc9d70de65 Optimistic Transactions
Summary: Optimistic transactions supporting begin/commit/rollback semantics.  Currently relies on checking the memtable to determine if there are any collisions at commit time.  Not yet implemented would be a way of enuring the memtable has some minimum amount of history so that we won't fail to commit when the memtable is empty.  You should probably start with transaction.h to get an overview of what is currently supported.

Test Plan: Added a new test, but still need to look into stress testing.

Reviewers: yhchiang, igor, rven, sdong

Reviewed By: sdong

Subscribers: adamretter, MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D33435
2015-05-29 14:36:35 -07:00
Reed Allman
21cd6b7ad8 C: add support for WriteBatch SliceParts params 2015-05-29 10:23:43 -07:00
Reed Allman
a0635ba3f6 WriteBatch.Merge w/ SliceParts support
also hooked up WriteBatchInternal
2015-05-29 04:30:03 -07:00
agiardullo
c815351038 Support saving history in memtable_list
Summary:
For transactions, we are using the memtables to validate that there are no write conflicts.  But after flushing, we don't have any memtables, and transactions could fail to commit.  So we want to someone keep around some extra history to use for conflict checking.  In addition, we want to provide a way to increase the size of this history if too many transactions fail to commit.

After chatting with people, it seems like everyone prefers just using Memtables to store this history (instead of a separate history structure).  It seems like the best place for this is abstracted inside the memtable_list.  I decide to create a separate list in MemtableListVersion as using the same list complicated the flush/installalflushresults logic too much.

This diff adds a new parameter to control how much memtable history to keep around after flushing.  However, it sounds like people aren't too fond of adding new parameters.  So I am making the default size of flushed+not-flushed memtables be set to max_write_buffers.  This should not change the maximum amount of memory used, but make it more likely we're using closer the the limit.  (We are now postponing deleting flushed memtables until the max_write_buffer limit is reached).  So while we might use more memory on average, we are still obeying the limit set (and you could argue it's better to go ahead and use up memory now instead of waiting for a write stall to happen to test this limit).

However, if people are opposed to this default behavior, we can easily set it to 0 and require this parameter be set in order to use transactions.

Test Plan: Added a xfunc test to play around with setting different values of this parameter in all tests.  Added testing in memtablelist_test and planning on adding more testing here.

Reviewers: sdong, rven, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D37443
2015-05-28 16:34:24 -07:00
Yueh-Hsuan Chiang
ec4ff4e99c Rename EventLoggerHelpers EventHelpers
Summary:
Rename EventLoggerHelpers EventHelpers, as it's going to include
all event-related helper functions instead of EventLogger only stuffs.

Test Plan: make

Reviewers: sdong, rven, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39093
2015-05-28 13:37:47 -07:00
Yueh-Hsuan Chiang
672dda9b3b [API Change] Move listeners from ColumnFamilyOptions to DBOptions
Summary: Move listeners from ColumnFamilyOptions to DBOptions

Test Plan:
listener_test
compact_files_test

Reviewers: rven, anthony, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D39087
2015-05-28 13:21:39 -07:00
Yueh-Hsuan Chiang
3ab8ffd4dd Compaction now conditionally boosts the size of deletion entries.
Summary:
Compaction now boosts the size of deletion entries of a file only when
the number of deletion entries is greater than the number of non-deletion
entries in the file.  The motivation here is that in a stable workload,
the number of deletion entries should be roughly equal to the number of
non-deletion entries.  If we compensate the size of deletion entries in a
stable workload, the deletion compensation logic might introduce unwanted
effet which changes the shape of LSM tree.

Test Plan: db_test --gtest_filter="*Deletion*"

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38703
2015-05-26 14:05:38 -07:00
Igor Canadi
a81ac24127 Merge pull request #615 from rdallman/master
C: add more block based table stuff, some aux slice transform/merge ops
2015-05-26 14:19:31 -04:00
Yueh-Hsuan Chiang
6d299b70b8 Fixed a bug in EventLoggerHelpers::LogTableFileCreation
Summary:
Fixed a missing "}" at the end of the generated JSON Log
in EventLoggerHelpers::LogTableFileCreation.

Test Plan: db_bench

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38919
2015-05-26 10:55:46 -07:00
Yueh-Hsuan Chiang
a0580205c8 Removed an unused private variable in db_impl.h
Summary: Removed an unused private variable in db_impl.h

Test Plan: make db_test

Reviewers: sdong, anthony, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38925
2015-05-26 10:46:26 -07:00
Reed Allman
9c38ce1d02 C: extra bbto / noop slice transform 2015-05-22 22:56:28 -07:00
Igor Canadi
ea6d3a8ac0 Don't skip last level when calculating compaction stats
Summary: We have a bug where we don't report the last level's files as being compacted. This fixes it.

Test Plan: See the fix in action here: https://phabricator.fb.com/P19845738

Reviewers: MarkCallaghan, sdong

Reviewed By: sdong

Subscribers: yhchiang, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38727
2015-05-22 15:30:43 -04:00
Yueh-Hsuan Chiang
5c224d1b70 Fixed two bugs on logging file deletion.
Summary:
This patch fixes the following two bugs on logging file deletion.

1.  Previously, file deletion failure was only logged in INFO_LEVEL.
    This patch changes it to ERROR_LEVEL and does some code clean.

2.  EventLogger previously will always generate the same log on
    table file deletion even when file deletion is not successful.
    Now the resulting status of file deletion will also be logged.

Test Plan: make all check

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38817
2015-05-22 12:10:51 -07:00
Yueh-Hsuan Chiang
dc81efe415 Change the log-level of DB summary and options from INFO_LEVEL to WARN_LEVEL
Summary: Change the log-level of DB summary and options from INFO_LEVEL to WARN_LEVEL

Test Plan:
Use db_bench to verify the log level.

Sample output:
    2015/05/22-00:20:39.778064 7fff75b41300 [WARN] RocksDB version: 3.11.0
    2015/05/22-00:20:39.778095 7fff75b41300 [WARN] Git sha rocksdb_build_git_sha:7fee8775a459134c4cb04baae5bd1687e268f2a0
    2015/05/22-00:20:39.778099 7fff75b41300 [WARN] Compile date May 22 2015
    2015/05/22-00:20:39.778101 7fff75b41300 [WARN] DB SUMMARY
    2015/05/22-00:20:39.778145 7fff75b41300 [WARN] SST files in /tmp/rocksdbtest-691931916/dbbench dir, Total Num: 0, files:
    2015/05/22-00:20:39.778148 7fff75b41300 [WARN] Write Ahead Log file in /tmp/rocksdbtest-691931916/dbbench:
    2015/05/22-00:20:39.778150 7fff75b41300 [WARN]          Options.error_if_exists: 0
    2015/05/22-00:20:39.778152 7fff75b41300 [WARN]        Options.create_if_missing: 1
    2015/05/22-00:20:39.778153 7fff75b41300 [WARN]          Options.paranoid_checks: 1

Reviewers: MarkCallaghan, igor, kradhakrishnan

Reviewed By: igor

Subscribers: sdong, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38835
2015-05-22 11:54:59 -07:00
Yueh-Hsuan Chiang
687214f878 Ensure ColumnFamilyOptions.num_levels >= 2 when level compaction is used.
Summary: Ensure ColumnFamilyOptions.num_levels >= 2 when level compaction is used.

Test Plan: Extend SanitizeOptions test in column_family_test

Reviewers: sdong, rven, anthony, krishnanm86, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38829
2015-05-22 11:35:40 -07:00
Yueh-Hsuan Chiang
2abb592688 Avoid logging under mutex in DBImpl::WriteLevel0TableForRecovery().
Summary: Avoid logging under mutex in DBImpl::WriteLevel0TableForRecovery().

Test Plan: make all check

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38823
2015-05-22 11:24:12 -07:00
Yueh-Hsuan Chiang
7fee8775a4 Allow EventLogger to directly log from a JSONWriter.
Summary:
Allow EventLogger to directly log from a JSONWriter.  This allows
the JSONWriter to be shared by EventLogger and potentially EventListener,
which is an important step to integrate EventLogger and EventListener.

This patch also rewrites EventLoggerHelpers::LogTableFileCreation(),
which uses the new API to generate identical log.

Test Plan:
Run db_bench in debug mode and make sure the log is correct and no
assertions fail.

Reviewers: sdong, anthony, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38709
2015-05-21 15:39:30 -07:00
Igor Canadi
7a3577519f Don't artificially inflate L0 score
Summary:
This turns out to be pretty bad because if we prioritize L0->L1 then L1 can grow artificially large, which makes L0->L1 more and more expensive. For example:
256MB @ L0 + 256MB @ L1 --> 512MB @ L1
256MB @ L0 + 512MB @ L1 --> 768MB @ L1
256MB @ L0 + 768MB @ L1 --> 1GB @ L1

....

256MB @ L0 + 10GB @ L1 --> 10.2GB @ L1

At some point we need to start compacting L1->L2 to speed up L0->L1.

Test Plan:
The performance improvement is massive for heavy write workload. This is the benchmark I ran: https://phabricator.fb.com/P19842671. Before this change, the benchmark took 47 minutes to complete. After, the benchmark finished in 2minutes. You can see full results here: https://phabricator.fb.com/P19842674

Also, we ran this diff on MongoDB on RocksDB on one replicaset. Before the change, our initial sync was so slow that it couldn't keep up with primary writes. After the change, the import finished without any issues

Reviewers: dynamike, MarkCallaghan, rven, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38637
2015-05-21 11:40:48 -07:00
Yueh-Hsuan Chiang
e2c1d4b57f [Public API Change] Make DB::GetDbIdentity() be const function.
Summary: Make DB::GetDbIdentity() be const function.

Test Plan: make db_test

Reviewers: igor, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38745
2015-05-21 11:01:48 -07:00
Yueh-Hsuan Chiang
812c461c96 Dump db stats in WARN level
Summary: Dump db stats in WARN level

Test Plan: run db_bench and verify the LOG

Reviewers: igor, MarkCallaghan

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38691
2015-05-19 18:42:17 -07:00
Mark Callaghan
944043d683 Add --wal_bytes_per_sync for db_bench and more IO stats
Summary:
See https://gist.github.com/mdcallag/89ebb2b8cbd331854865 for the IO stats.
I added "Cumulative compaction:" and "Interval compaction:" lines. The IO rates
can be confusing. Rates fro per-level stats lines, Wr(MB/s) & Rd(MB/s), are computed
using the duration of the compaction job. If the job reads 10MB, writes 9MB and the job
(IO & merging) takes 1 second then the rates are 10MB/s for read and 9MB/s for writes.
The IO rates in the Cumulative compaction line uses the total uptime. The IO rates in the
Interval compaction line uses the interval uptime. So these Cumalative & Interval
compaction IO rates cannot be compared to the per-level IO rates. But both forms of
the rates are useful for debugging perf.

Task ID: #

Blame Rev:

Test Plan:
run db_bench

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D38667
2015-05-19 16:19:30 -07:00
Igor Canadi
04feaeebb9 Fix comparison between signed and usigned integers
Summary: Not sure why this fails on some compilers and doesn't on others.

Test Plan: none

Reviewers: meyering, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38673
2015-05-19 10:59:30 -07:00
Igor Canadi
4a855c0799 Add an option wal_bytes_per_sync to control sync_file_range for WAL files
Summary:
sync_file_range is not always asyncronous and thus can block writes if we do this for WAL in the foreground thread. See more here: http://yoshinorimatsunobu.blogspot.com/2014/03/how-syncfilerange-really-works.html

Some users don't want us to call sync_file_range on WALs. Some other do.
Thus, I'm adding a separate option wal_bytes_per_sync to control calling
sync_file_range on WAL files. bytes_per_sync will apply only to table
files now.

Test Plan: no more sync_file_range for WAL as evidenced by strace

Reviewers: yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38253
2015-05-18 17:03:59 -07:00
Igor Canadi
b0fdda4ff0 Allow flushes to run in parallel with manual compaction
Summary: As title. I spent some time thinking about it and I don't think there should be any issue with running manual compaction and flushes in parallel

Test Plan: make check works

Reviewers: rven, yhchiang, sdong

Reviewed By: yhchiang, sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38355
2015-05-18 15:34:33 -07:00
sdong
fb5bdbf987 DBTest.DynamicLevelMaxBytesCompactRange: make sure L0 is not empty before running compact range
Summary: DBTest.DynamicLevelMaxBytesCompactRange needs to make sure L0 is not empty to properly cover the code paths we want to cover. However, current codes have a bug that might leave the condition not held. Improve the test to ensure it.

Test Plan: Run the test in an environment that is used to fail. Also run it many times.

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D38631
2015-05-18 11:49:45 -07:00
sdong
6fa7085121 CompactRange skips levels 1 to base_level -1 for dynamic level base size
Summary: CompactRange() now is much more expensive for dynamic level base size as it goes through all the levels. Skip those not used levels between level 0 an base level.

Test Plan: Run all unit tests

Reviewers: yhchiang, rven, anthony, kradhakrishnan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D37125
2015-05-18 10:54:11 -07:00
Holodov Alexander
eeb44366ba C api: human-readable statistics 2015-05-16 12:34:28 +04:00
Yueh-Hsuan Chiang
3f0867c0fe Allow GetThreadList to report Flush properties.
Summary:
Allow GetThreadList to report Flush properties, which includes:
* job id
* number of bytes that has been written since flush started.
* total size of input mem-tables

Test Plan:
./db_bench --threads=30 --num=1000000 --benchmarks=fillrandom --thread_status_per_interval=100 --value_size=1000

Sample output from db_bench which tracks same flush job

          ThreadID ThreadType       cfName            Operation   ElapsedTime                                         Stage        State OperationProperties
   140213879898240   High Pri      default                Flush       5789 us                    FlushJob::WriteLevel0Table              BytesMemtables 4112835 | BytesWritten 577104 | JobID 8 |

          ThreadID ThreadType       cfName            Operation   ElapsedTime                                         Stage        State OperationProperties
   140213879898240   High Pri      default                Flush     30.634 ms                    FlushJob::WriteLevel0Table              BytesMemtables 4112835 | BytesWritten 1734865 | JobID 8 |

Reviewers: rven, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38505
2015-05-15 23:22:22 -07:00
Igor Canadi
7413306d94 Take a chance on a random file when choosing compaction
Summary:
When trying to compact entire database with SuggestCompactRange(), we'll first try the left-most files. This is pretty bad, because:
1) the left part of LSM tree will be overly compacted, but right part will not be touched
2) First compaction will pick up the left-most file. Second compaction will try to pick up next left-most, but this will not be possible, because there's a big chance that second's file range on N+1 level is already being compacted.

I observe both of those problems when running Mongo+RocksDB and trying to compact the DB to clean up tombstones. I'm unable to clean them up :(

This diff adds a bit of randomness into choosing a file. First, it chooses a file at random and tries to compact that one. This should solve both problems specified here.

Test Plan: make check

Reviewers: yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38379
2015-05-15 14:14:40 -07:00
sdong
5aad881298 DBTest.DynamicLevelMaxBytesBase2: remove an unnecesary check
Summary: DBTest.DynamicLevelMaxBytesBase2 has a check that is not necessary and may fail. Remove it, and add two unrelated check.

Test Plan: Run the test

Reviewers: yhchiang, rven, kradhakrishnan, anthony, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D38457
2015-05-14 09:22:43 -07:00
sdong
ec43a8b9fb Universal Compaction with multiple levels won't allocate up to output size
Summary: Universal compactions with multiple levels should use file preallocation size based on file size if output level is not level 0

Test Plan: Run all tests.

Reviewers: igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D38439
2015-05-13 14:15:46 -07:00
sdong
bc68bd5a13 db_bench to support rate limiter
Summary: Add --rate_limiter_bytes_per_sec to db_bench to allow rater limit to disk

Test Plan:
Run
./db_bench --benchmarks=fillseq --num=30000000 --rate_limiter_bytes_per_sec=3000000 --num_multi_db=8 -disable_wal
And see io_stats to have the rate limited.

Reviewers: yhchiang, rven, anthony, kradhakrishnan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D38385
2015-05-13 10:03:41 -07:00
Yueh-Hsuan Chiang
df1f87a882 Fixed compile error in db/column_family.cc
Summary:
Fixed the following compile error in db/column_family.cc
    db/column_family.cc:633:33: error: ‘ASSERT_GT’ was not declared in this scope
    16:14:45    ASSERT_GT(listeners.size(), 0U);

Test Plan: make db_test

Reviewers: igor, sdong, rven

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38367
2015-05-12 16:20:03 -07:00
Yueh-Hsuan Chiang
14431e971d Fixed a bug in EventListener::OnCompactionCompleted().
Summary:
Fixed a bug in EventListener::OnCompactionCompleted() that returns
incorrect list of input / output file names.

Test Plan: Extend existing test in listener_test.cc

Reviewers: sdong, rven, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38349
2015-05-12 16:10:23 -07:00
Igor Canadi
dbd95b7532 Add more table properties to EventLogger
Summary:
Example output:

    {"time_micros": 1431463794310521, "job": 353, "event": "table_file_creation", "file_number": 387, "file_size": 86937, "table_info": {"data_size": "81801", "index_size": "9751", "filter_size": "0", "raw_key_size": "23448", "raw_average_key_size": "24.000000", "raw_value_size": "990571", "raw_average_value_size": "1013.890481", "num_data_blocks": "245", "num_entries": "977", "filter_policy_name": "", "kDeletedKeys": "0"}}

Also fixed a bug where BuildTable() in recovery was passing Env::IOHigh argument into paranoid_checks_file parameter.

Test Plan: make check + check out the output in the log

Reviewers: sdong, rven, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38343
2015-05-12 15:53:55 -07:00
Igor Canadi
b5881762bc Reset parent_index and base_index when picking files marked for compaction
Summary: This caused a crash of our MongoDB + RocksDB instance. PickCompactionBySize() sets its own parent_index. We never reset this parent_index when picking PickFilesMarkedForCompactionExperimental(). So we might end up doing SetupOtherInputs() with parent_index that was set by PickCompactionBySize, although we're using compaction calculated using PickFilesMarkedForCompactionExperimental.

Test Plan: Added a unit test that fails with assertion on master.

Reviewers: yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38337
2015-05-12 11:16:25 -07:00
agiardullo
711465ccec API to fetch from both a WriteBatchWithIndex and the db
Summary:
Added a couple functions to WriteBatchWithIndex to make it easier to query the value of a key including reading pending writes from a batch.  (This is needed for transactions).

I created write_batch_with_index_internal.h to use to store an internal-only helper function since there wasn't a good place in the existing class hierarchy to store this function (and it didn't seem right to stick this function inside WriteBatchInternal::Rep).

Since I needed to access the WriteBatchEntryComparator, I moved some helper classes from write_batch_with_index.cc into write_batch_with_index_internal.h/.cc.  WriteBatchIndexEntry, ReadableWriteBatch, and WriteBatchEntryComparator are all unchanged (just moved to a different file(s)).

Test Plan: Added new unit tests.

Reviewers: rven, yhchiang, sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38037
2015-05-11 14:51:51 -07:00
Igor Canadi
3996fff8a1 Fix clang build - add override
Summary: In new clang we need to add override to every overriden function

Test Plan: none

Reviewers: rven

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D38259
2015-05-09 11:04:14 -07:00
Igor Canadi
d978139063 SuggestCompactRange() is manual compaction
Summary: When reporting compaction that was started because of SuggestCompactRange() we should treat it as manual compaction.

Test Plan: none

Reviewers: yhchiang, rven

Reviewed By: rven

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38139
2015-05-08 19:37:02 -07:00
Yueh-Hsuan Chiang
77a5a543a5 Allow GetThreadList() to report basic compaction operation properties.
Summary:
Now we're able to show more details about a compaction in
GetThreadList() :)

This patch allows GetThreadList() to report basic compaction
operation properties.  Basic compaction properties include:
    1. job id
    2. compaction input / output level
    3. compaction property flags (is_manual, is_deletion, .. etc)
    4. total input bytes
    5. the number of bytes has been read currently.
    6. the number of bytes has been written currently.

Flush operation properties will be done in a seperate diff.

Test Plan:
/db_bench --threads=30 --num=1000000 --benchmarks=fillrandom --thread_status_per_interval=1

Sample output of tracking same job:

          ThreadID ThreadType       cfName            Operation   ElapsedTime                                         Stage        State OperationProperties
   140664171987072    Low Pri      default           Compaction     31.357 ms     CompactionJob::FinishCompactionOutputFile              BaseInputLevel 1 | BytesRead 2264663 | BytesWritten 1934241 | IsDeletion 0 | IsManual 0 | IsTrivialMove 0 | JobID 277 | OutputLevel 2 | TotalInputBytes 3964158 |

          ThreadID ThreadType       cfName            Operation   ElapsedTime                                         Stage        State OperationProperties
   140664171987072    Low Pri      default           Compaction     59.440 ms     CompactionJob::FinishCompactionOutputFile              BaseInputLevel 1 | BytesRead 2264663 | BytesWritten 1934241 | IsDeletion 0 | IsManual 0 | IsTrivialMove 0 | JobID 277 | OutputLevel 2 | TotalInputBytes 3964158 |

          ThreadID ThreadType       cfName            Operation   ElapsedTime                                         Stage        State OperationProperties
   140664171987072    Low Pri      default           Compaction    226.375 ms                        CompactionJob::Install              BaseInputLevel 1 | BytesRead 3958013 | BytesWritten 3621940 | IsDeletion 0 | IsManual 0 | IsTrivialMove 0 | JobID 277 | OutputLevel 2 | TotalInputBytes 3964158 |

Reviewers: sdong, rven, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D37653
2015-05-06 22:51:06 -07:00
Igor Canadi
65fe1cfbb3 Cleanup CompactionJob
Summary:
Couple changes:
1. instead of SnapshotList, just take a vector of snapshots
2. don't take a separate parameter is_snapshots_supported. If there are snapshots in the list, that means they are supported. I actually think we should get rid of this notion of snapshots not being supported.
3. don't pass in mutable_cf_options as a parameter. Lifetime of mutable_cf_options is a bit tricky to maintain, so it's better to not pass it in for the whole compaction job. We only really need it when we install the compaction results.

Test Plan: make check

Reviewers: sdong, rven, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D36627
2015-05-05 19:01:12 -07:00
Laurent Demailly
df4130ad85 fix crashes in stats and compaction filter for db_ttl_impl
Summary: fix crashes in stats and compaction filter for db_ttl_impl

Test Plan:
Ran build with lots of debugging
https://reviews.facebook.net/differential/diff/194175/

Reviewers: yhchiang, igor, rven

Reviewed By: igor

Subscribers: rven, dhruba

Differential Revision: https://reviews.facebook.net/D38001
2015-05-05 16:54:47 -07:00
Venkatesh Radhakrishnan
7ea769487f Fix flakiness in column_family_test
Summary:
Fixes #6840824, running "make check" on centos6 hits
a deadlock in column_family_test

Test Plan:
seq 10000 | parallel --gnu --eta 't=/dev/shm/rdb-{}; rm -rf
$t; mkdir $t && export TEST_TMPDIR=$t; ./column_family_test > $t/log-{}'
Made the test deterministic by narrrowing the window for the flush.

Reviewers: igor, meyering

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38079
2015-05-05 15:59:02 -07:00
Liangjun Feng
9aa011fa36 Optimize GetRange Function
Summary: Optimize GetRange Function by checking the level of the files

Test Plan: pass make all check

Reviewers: rven, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D37977
2015-05-05 09:57:47 -07:00
Igor Canadi
36a7408896 Fix UNLIKELY parenthesis
Summary: Ooops :) status.ok() is acutally highly likely :)

Test Plan: none

Reviewers: rven, yhchiang, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D38043
2015-05-05 08:57:34 -07:00
Jim Meyering
2ab7065af1 build: avoid unused-variable warning
Summary:
[noticed a new warning when building with the very latest gcc]
* db/memtablerep_bench.cc (FLAGS_env): Remove declaration
of unused varaible, to avoid this warning/error:

db/memtablerep_bench.cc:135:22: error: ‘FLAGS_env’ defined but not\
  used [-Werror=unused-variable]
 static rocksdb::Env* FLAGS_env = rocksdb::Env::Default();
                      ^

Test Plan: compile

Reviewers: ljin, rven, igor.sugak, yhchiang, sdong, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D37983
2015-05-02 13:19:10 -07:00
Venkatesh Radhakrishnan
d2346c2cf0 Fix hang with large write batches and column families.
Summary:
This diff fixes a hang reported by a Github user.
https://www.facebook.com/l.php?u=https%3A%2F%2Fgithub.com%2Ffacebook%2Frocksdb%2Fissues%2F595%23issuecomment-96983273&h=9AQFYOWlo
Multiple large write batches with column families cause a hang.
The issue was caused by not doing flushes/compaction when the
write controller was stopped.

Test Plan: Create a DBTest from the user's test case

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D37929
2015-05-01 15:41:50 -07:00
Mark Callaghan
b6b100fe04 Remove iter_refresh_interval_us
Summary:
The default, use one iter for the whole test, isn't good. This cost me
a few hours of debugging and a few days of tessting. For readonly
that isn't realistic and for read-write that keeps a lot of old sst files around.
I remove the option because nothing uses it and not calling gettimeofday per
loop iteration adds about 3% to QPS at 20 threads.

Task ID: #

Blame Rev:

Test Plan:
run db_bench

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D37965
2015-05-01 14:17:45 -07:00
Igor Canadi
dddceefe5e Fix clang build
Summary: fix build

Test Plan: works

Reviewers: kradhakrishnan

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D37911
2015-04-30 11:11:35 -07:00
krad
d4540654e9 Optimize GetApproximateSizes() to use lesser CPU cycles.
Summary:
CPU profiling reveals GetApproximateSizes as a bottleneck for performance. The current implementation is sub-optimal, it scans every file in every level to compute the result.

We can take advantage of the fact that all levels above 0 are sorted in the increasing order of key ranges and use binary search to locate the starting index. This can reduce the number of comparisons required to compute the result.

Test Plan: We have good test coverage. Run the tests.

Reviewers: sdong, igor, rven, dynamike

Subscribers: dynamike, maykov, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D37755
2015-04-30 10:55:03 -07:00
Igor Canadi
7246ad34d0 Don't compact bottommost level in SuggestCompactRange
Summary: Before the fix we also marked the bottommost level for compaction. This is wrong because then RocksDB has N+1 levels instead of N as before the compaction.

Test Plan: SuggestCompactRangeTest in db_test

Reviewers: yhchiang, rven

Reviewed By: rven

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D37869
2015-04-29 13:35:48 -07:00
Igor Canadi
7f47ba0e26 Fix possible SIGSEGV in CompactRange (github issue #596)
Summary: For very detailed explanation of what's happening read this: https://github.com/facebook/rocksdb/issues/596

Test Plan: make check + new unit test

Reviewers: yhchiang, anthony, rven

Reviewed By: rven

Subscribers: adamretter, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D37779
2015-04-29 10:52:31 -07:00
agiardullo
d6f39c5ae3 Helper function to time Merges
Summary: Remove duplicate code.  If this diff looks good, I will cleanup other call sites as well.

Test Plan: unit tests

Reviewers: rven, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D37761
2015-04-27 20:23:50 -07:00
Igor Canadi
1bb4928da9 Include bunch of more events into EventLogger
Summary:
Added these events:
* Recovery start, finish and also when recovery creates a file
* Trivial move
* Compaction start, finish and when compaction creates a file
* Flush start, finish

Also includes small fix to EventLogger

Also added option ROCKSDB_PRINT_EVENTS_TO_STDOUT which is useful when we debug things. I've spent far too much time chasing LOG files.

Still didn't get sst table properties in JSON. They are written very deeply into the stack. I'll address in separate diff.

TODO:
* Write specification. Let's first use this for a while and figure out what's good data to put here, too. After that we'll write spec
* Write tools that parse and analyze LOGs. This can be in python or go. Good intern task.

Test Plan: Ran db_bench with ROCKSDB_PRINT_EVENTS_TO_STDOUT. Here's the output: https://phabricator.fb.com/P19811976

Reviewers: sdong, yhchiang, rven, MarkCallaghan, kradhakrishnan, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D37521
2015-04-27 15:20:02 -07:00
clark.kang
6ede020dc4 fix typos 2015-04-25 18:14:27 +09:00
sdong
98a44559d5 Build for CYGWIN
Summary:
Make it build for CYGWIN.
Need to define "-std=gnu++11" instead of "-std=c++11" and use some replacement functions.

Test Plan: Build it and run some unit tests in CYGWIN

Reviewers: yhchiang, rven, anthony, kradhakrishnan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D37605
2015-04-23 21:33:44 -07:00
sdong
d01bbb53ae Fix CompactRange for universal compaction with num_levels > 1
Summary:
CompactRange for universal compaction with num_levels > 1 seems to have a bug. The unit test also has a bug so it doesn't capture the problem.
Fix it. Revert the compact range to the logic equivalent to num_levels=1. Always compact all files together.

It should also fix DBTest.IncreaseUniversalCompactionNumLevels. The issue was that options.write_buffer_size = 100 << 10 and options.write_buffer_size = 100 << 10 are not used in later test scenarios. So write_buffer_size of 4MB was used. The compaction trigger condition is not anymore obvious as expected.

Test Plan: Run the new test and all test suites

Reviewers: yhchiang, rven, kradhakrishnan, anthony, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D37551
2015-04-23 19:12:31 -07:00
Igor Canadi
e003d3864c Abstract out SetMaxPossibleForUserKey() and SetMinPossibleForUserKey
Summary:
Based on feedback from D37083.

Are all of these correct? In some spaces it seems like we're doing SetMaxPossibleForUserKey() although we want the smallest possible internal key for user key.

Test Plan: make check

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D37341
2015-04-23 18:08:37 -07:00
Igor Canadi
aa14670b27 Add an assertion in CompactionPicker
Summary: Reading CompactionPicker I noticed this dangerous substraction of two unsigned integers. We should assert to mark this as safe.

Test Plan: make check

Reviewers: anthony, rven, yhchiang, sdong

Reviewed By: sdong

Subscribers: kradhakrishnan, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D37041
2015-04-23 17:46:15 -07:00
Giuseppe Ottaviano
2dc421df48 Implement DB::PromoteL0 method
Summary:
This diff implements a new `DB` method `PromoteL0` which moves all files in L0
to a given level skipping compaction, provided that the files have disjoint
ranges and all levels up to the target level are empty.

This method provides finer-grain control for trivial compactions, and it is
useful for bulk-loading pre-sorted keys. Compared to D34797, it does not change
the semantics of an existing operation, which can impact existing code.

PromoteL0 is designed to work well in combination with the proposed
`GetSstFileWriter`/`AddFile` interface, enabling to "design" the level structure
by populating one level at a time. Such fine-grained control can be very useful
for static or mostly-static databases.

Test Plan: `make check`

Reviewers: IslamAbdelRahman, philipp, MarkCallaghan, yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D37107
2015-04-23 12:10:36 -07:00
sdong
9bf40b64d0 Print max score in level summary
Summary: Add more logging to help debugging issues.

Test Plan: Run test suites

Reviewers: yhchiang, rven, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D37401
2015-04-23 11:34:36 -07:00
sdong
397b6588bd options.paranoid_file_checks to read all rows after writing to a file.
Summary: To further distinguish the corruption cases were caused by storage media or in memory states when writing it, add a paranoid check after writing the file to iterate all the rows.

Test Plan: Add a new unit test for it

Reviewers: rven, igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D37335
2015-04-23 11:34:35 -07:00
Venkatesh Radhakrishnan
618d07b068 Making PreShutdown tests more reliable.
Summary:
A couple of times on Travis, we have had the thread status say that there were no compactions done and since we assert for it, the test failed.
We now fix this by waiting till compaction started.

Test Plan:
run DBTEST::*PreShutdown*

d=/tmp/j; rm -rf $d; seq 200 | parallel --gnu --eta 'd=/tmp/j/d-{}; mkdir -p $d; TEST_TMPDIR=$d ./db_test --gtest_filter=DBTest.PreShutdown* >& '$d'/log-{}'

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D37545
2015-04-23 08:35:02 -07:00
Igor Canadi
6059bdf86a Add experimental API MarkForCompaction()
Summary:
Some Mongo+Rocks datasets in Parse's environment are not doing compactions very frequently. During the quiet period (with no IO), we'd like to schedule compactions so that our reads become faster. Also, aggressively compacting during quiet periods helps when write bursts happen. In addition, we also want to compact files that are containing deleted key ranges (like old oplog keys).

All of this is currently not possible with CompactRange() because it's single-threaded and blocks all other compactions from happening. Running CompactRange() risks an issue of blocking writes because we generate too much Level 0 files before the compaction is over. Stopping writes is very dangerous because they hold transaction locks. We tried running manual compaction once on Mongo+Rocks and everything fell apart.

MarkForCompaction() solves all of those problems. This is very light-weight manual compaction. It is lower priority than automatic compactions, which means it shouldn't interfere with background process keeping the LSM tree clean. However, if no automatic compactions need to be run (or we have extra background threads available), we will start compacting files that are marked for compaction.

Test Plan: added a new unit test

Reviewers: yhchiang, rven, MarkCallaghan, sdong

Reviewed By: sdong

Subscribers: yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D37083
2015-04-17 16:44:45 -07:00
Jim Meyering
acf8a4141d maint: use ASSERT_TRUE, not ASSERT_EQ(true; same for false
Summary:
The usage I'm fixing here caused trouble on Fedora 21 when
compiling with the current gcc version 4.9.2 20150212 (Red Hat 4.9.2-6) (GCC):

  db/write_controller_test.cc: In member function ‘virtual void rocksdb::WriteControllerTest_SanityTest_Test::TestBody()’:
  db/write_controller_test.cc:23:165: error: converting ‘false’ to pointer type for argument 1 of ‘char testing::internal::IsNullLiteralHelper(testing::internal::Secret*)’ [-Werror=conversion-null]
     ASSERT_EQ(false, controller.IsStopped());
                                                                                                                                                                          ^

This change was induced mechanically via:

  git grep -l -E 'ASSERT_EQ\(false'|xargs perl -pi -e 's/ASSERT_EQ\(false, /ASSERT_FALSE(/'
  git grep -l -E 'ASSERT_EQ\(true'|xargs perl -pi -e 's/ASSERT_EQ\(true, /ASSERT_TRUE(/'

Except for the three in utilities/backupable/backupable_db_test.cc for which
I ended up reformatting (joining lines) in the result.

As for why this problem is exhibited with that version of gcc, and none
of the others I've used (from 4.8.1 through gcc-5.0.0 and newer), I suspect
it's a bug in F21's gcc that has been fixed in gcc-5.0.0.

Test Plan:
  "make" now succeed on Fedora 21

Reviewers: ljin, rven, igor.sugak, yhchiang, sdong, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D37329
2015-04-17 14:54:17 -07:00
Igor Canadi
b5400f90fe Kill dead code
Summary: this is not used anywhere

Test Plan: compiles

Reviewers: yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D37053
2015-04-17 12:07:47 -07:00
Igor Canadi
00c2afcd38 Fix bug in ExpandWhileOverlapping()
Summary: If ExpandWhileOverlapping() we don't clear inputs. That's a bug introduced by my recent patch https://reviews.facebook.net/D36687. However, we have no tests covering ExpandWhileOverlapping(). I created a task t6771252 to add ExpandWhileOverlapping() tests.

Test Plan: make check

Reviewers: sdong, rven, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D37077
2015-04-16 19:31:10 -07:00
sdong
debaf85ef5 Bug of trivial move of dynamic level
Summary: D36669 introduces a bug that trivial moved data is not going to specific level but the next level, which will incorrectly be level 1 for level 0 compaciton if base level is not level 1. Fixing it by appreciating the output level

Test Plan: Run all tests

Reviewers: MarkCallaghan, rven, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D37119
2015-04-14 21:42:08 -07:00
sdong
12d7d3d28d Fix and Improve DBTest.DynamicLevelCompressionPerLevel2
Summary:
Recent change of DBTest.DynamicLevelCompressionPerLevel2 has a bug that the second sync point is not enabled. Fix it. Also add an assert for that.
Also, flush compression is not tracked in the test. Add it.

Test Plan: Build everything

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D37101
2015-04-14 21:42:08 -07:00
sdong
a1271c6c6f Fix build break introduced by new SyncPoint interface change
Summary: When commiting the sync point interface change, didn't resolve the new occurance of the old interface in rebase. Fix it.

Test Plan: Build and see it pass

Reviewers: igor, yhchiang, rven, anthony, kradhakrishnan

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D37095
2015-04-14 16:42:37 -07:00
sdong
fcb206b667 SyncPoint to allow a callback with an argument and use it to get DBTest.DynamicLevelCompressionPerLevel2 more straight-forward
Summary:
Allow users to give a callback function with parameter using sync point, so more complicated verification can be done in tests.
Use it in DBTest.DynamicLevelCompressionPerLevel2 so that failures will be more easy to debug.

Test Plan: Run all tests. Run DBTest.DynamicLevelCompressionPerLevel2 with valgrind check.

Reviewers: rven, yhchiang, anthony, kradhakrishnan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D36999
2015-04-14 16:18:50 -07:00
Igor Canadi
281db8bb62 Temporarily disable test CompactFilesOnLevelCompaction
Summary: https://reviews.facebook.net/D36963 made the debug build much faster and that triggered failures of CompactFilesOnLevelCompaction test. 3 out of 4 last tests on Jenkins failed. I'm disabling this test temporarily, since we likely know the reason why it's failing and there's already work in progress to address it -- https://reviews.facebook.net/D36225

Test Plan: none

Reviewers: sdong, rven, yhchiang, meyering

Reviewed By: meyering

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D36993
2015-04-13 19:30:40 -07:00
Igor Canadi
9b983befa8 Fix flakiness of WalManagerTest
Summary: We should use mocked-out env for these tests to make it more realiable. Added benefit is that instead of actually sleeping for 3 seconds, we can instead pretend to sleep and just increase time counters.

Test Plan: for i in `seq 100`; do ./wal_manager_test --gtest_filter=WalManagerTest.WALArchivalTtl ;done

Reviewers: rven, meyering

Reviewed By: meyering

Subscribers: meyering, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D36951
2015-04-13 16:15:05 -07:00
Igor Canadi
e7ad14926a Fix flakiness in FIFOCompaction test (github issue #573)
Summary:
The problem is that sometimes two memtables will be compacted together into a single file. In that case, our assertion

        ASSERT_EQ(NumTableFilesAtLevel(0), 5);

fails because same amount of data is in 4 files instead of 5. We should wait for flush so that we prevent two memtables merging into a single file.

Test Plan: `for i in `seq 20`; do mrtest FIFOCompactionTest; done` -- fails at least once before. fails zero times after.

Reviewers: rven

Reviewed By: rven

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D36939
2015-04-13 11:39:45 -07:00
Igor Canadi
abb4052278 Kill benchharness
Summary:
1. it doesn't work
2. we're not using it

In the future, if we need general benchmark framework, we should probably use https://github.com/google/benchmark

Test Plan: make all

Reviewers: yhchiang, rven, anthony, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D36777
2015-04-13 10:17:42 -07:00
Igor Canadi
590fadc407 Fix compile warning on CLANG
Summary: oops

Test Plan: compiles now

Reviewers: sdong, yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D36867
2015-04-10 15:14:57 -07:00
Igor Canadi
47b8743984 Make Compaction class easier to use
Summary:
The goal of this diff is to make Compaction class easier to use. This should also make new compaction algorithms easier to write (like CompactFiles from @yhchiang and dynamic leveled and multi-leveled universal from @sdong).

Here are couple of things demonstrating that Compaction class is hard to use:
1. we have two constructors of Compaction class
2. there's this thing called grandparents_, but it appears to only be setup for leveled compaction and not compactfiles
3. it's easy to introduce a subtle and dangerous bug like this: D36225
4. SetupBottomMostLevel() is hard to understand and it shouldn't be. See this comment: afbafeaeae/db/compaction.cc (L236-L241). It also made it harder for @yhchiang to write CompactFiles, as evidenced by this: afbafeaeae/db/compaction_picker.cc (L204-L210)

The problem is that we create Compaction object, which holds a lot of state, and then pass it around to some functions. After those functions are done mutating, then we call couple of functions on Compaction object, like SetupBottommostLevel() and MarkFilesBeingCompacted(). It is very hard to see what's happening with all that Compaction's state while it's travelling across different functions. If you're writing a new PickCompaction() function you need to try really hard to understand what are all the functions you need to run on Compaction object and what state you need to setup.

My proposed solution is to make important parts of Compaction immutable after construction. PickCompaction() should calculate compaction inputs and then pass them onto Compaction object once they are finalized. That makes it easy to create a new compaction -- just provide all the parameters to the constructor and you're done. No need to call confusing functions after you created your object.

This diff doesn't fully achieve that goal, but it comes pretty close. Here are some of the changes:
* have one Compaction constructor instead of two.
* inputs_ is constant after construction
* MarkFilesBeingCompacted() is now private to Compaction class and automatically called on construction/destruction.
* SetupBottommostLevel() is gone. Compaction figures it out on its own based on the input.
* CompactionPicker's functions are not passing around Compaction object anymore. They are only passing around the state that they need.

Test Plan:
make check
make asan_check
make valgrind_check

Reviewers: rven, anthony, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: sdong, yhchiang, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D36687
2015-04-10 15:01:54 -07:00
agiardullo
753dd1fdd0 Fix valgrind issues in memtable_list_test
Summary: Need to remember to unref MemTableList->current() before deleting.

Test Plan: ran test with valgrind

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D36855
2015-04-10 14:16:03 -07:00
krad
697380f3d7 Repairer documentation improvement.
Summary: Adding verbosity to existing comments.

Test Plan: None

Reviewers: sdong

CC: leveldb

Task ID: #6718960

Blame Rev:
2015-04-10 12:35:28 -07:00
agiardullo
0feeee6433 Fix memtable_list_test
Summary:
Test failing due to a missing directory caused by a simple bug (did not run into this on my dev box since the path already existed).

We should look into deleting test::TmpDir() before each test run.

Test Plan: ran test

Reviewers: igor, yhchiang, meyering, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D36831
2015-04-09 22:11:35 -07:00
Yueh-Hsuan Chiang
7b9581bc3b Fixed xfunc related compile errors in ROCKSDB_LITE
Summary:
Fixed xfunc related compile errors in ROCKSDB_LITE

Now make OPT=-DROCKSDB_LITE shared_lib -j32 would work

Test Plan:
make clean
make OPT=-DROCKSDB_LITE shared_lib -j32
make clean
make OPT=-DROCKSDB_LITE static_lib -j32

Reviewers: sdong, igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D36825
2015-04-09 21:05:18 -07:00
agiardullo
fabc115690 MemTableList tests
Summary: Add tests for MemTableList

Test Plan: run test

Reviewers: yhchiang, kradhakrishnan, sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D36735
2015-04-09 18:01:11 -07:00
Yueh-Hsuan Chiang
9741dec0e5 Fix a compile error in ROCKSDB_LITE in db/db_impl.cc
Summary:
Fix a compile error in ROCKSDB_LITE in db/db_impl.cc
related to internal_stats.

Test Plan: make OPT=-DROCKSDB_LITE shared_lib

Reviewers: sdong, igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D36819
2015-04-09 17:07:29 -07:00
Yueh-Hsuan Chiang
d2a056241a Fix a compilation error in ROCKSDB_LITE in db/internal_stats.h
Summary:
Fix a compilation error in ROCKSDB_LITE in db/internal_stats.h

Other compilation errors will be fixed in a separate diff.

Test Plan: make OPT=-DROCKSDB_LITE

Reviewers: sdong, igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D36807
2015-04-09 16:43:54 -07:00
sdong
316ec80bf8 fault_injection_test: add a test case to cover log syncing after a log roll
Summary:
Add a test case:
Write some keys without sync, flush, write other keys and do sync. Before flush finishes, host crashes and unsync data is dropped.
Tag the new test as disabled since it is not passing.

Test Plan: Run the test

Reviewers: MarkCallaghan, rven, anthony, igor, kradhakrishnan

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D36741
2015-04-09 16:15:42 -07:00
Mark Callaghan
ed229a0dee Fixes for readcache-flashcache
Summary:
This fixes two problems:
1) the env should not be created twice when use_existing_db is false
2) the env dtor should run before cachedev_fd_ is closed.

Task ID: #

Blame Rev:

Test Plan:
Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D36795
2015-04-09 15:51:34 -07:00
agiardullo
84c5bd7eb9 Add thread-safety documentation to MemTable and related classes
Summary: Other than making some class members private, this is a documentation-only change

Test Plan: unit tests

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D36567
2015-04-08 21:10:35 -07:00
krad
2b019a1512 Enabling checksum in repair db as it should have been.
Summary: I think the checksum was turned off by mistake.

Test Plan: Run make check

Reviewers: igor sdong chip

CC:

Task ID:

Blame Rev:
2015-04-08 15:52:02 -07:00
sdong
b1bbdd7919 Create EnvOptions using sanitized DB Options
Summary: Now EnvOptions uses unsanitized DB options. bytes_per_sync is tuned off when rate_limiter is used, but this change doesn't take effort.

Test Plan: See different I/O pattern in db_bench running fillseq.

Reviewers: yhchiang, kradhakrishnan, rven, anthony, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D36723
2015-04-08 14:40:42 -07:00
sdong
b118238a57 Trivial move to cover multiple input levels
Summary: Now trivial move is only triggered when moving from level n to n+1. With dynamic level base, it is possible that file is moved from level 0 to level n, while levels from 1 to n-1 are empty. Extend trivial move to this case.

Test Plan: Add a more unit test of sequential loading. Non-trivial compaction happened without the patch and now doesn't happen.

Reviewers: rven, yhchiang, MarkCallaghan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba, IslamAbdelRahman

Differential Revision: https://reviews.facebook.net/D36669
2015-04-08 09:26:40 -07:00
krad
58346b9e29 Log writer record format doc.
Summary: Added a ASCII doodle to represent the log writer format.

Test Plan: None

Reviewers: sdong

CC: leveldb

Task ID: 6179896

Blame Rev:
2015-04-07 16:25:56 -07:00
Yoshinori Matsunobu
f12614070f Fix TSAN build error of D36447
Summary:
D36447 caused build error when using COMPILE_WITH_TSAN=1.
This diff fixes the error.

Test Plan: jenkins

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D36579
2015-04-06 17:37:36 -07:00
Yoshinori Matsunobu
824e646341 Adding another NewFlashcacheAwareEnv function to support pre-opened fd
Summary:
There are some cases when flachcache file descriptor was
already allocated (i.e. fb-MySQL). Then NewFlashcacheAwareEnv returns an
error at open() because fd was already assigned. This diff adds another
function to instantiate FlashcacheAwareEnv, with pre-allocated fd cachedev_fd.

Test Plan: Tested with MyRocks using this function, then worked

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: dhruba, MarkCallaghan, rven

Differential Revision: https://reviews.facebook.net/D36447
2015-04-06 16:50:36 -07:00
Igor Canadi
5e067a7b19 Clean up compression logging
Summary: Now we add warnings when user configures compression and the compression is not supported.

Test Plan:
Configured compression to non-supported values. Observed messages in my log:

    2015/03/26-12:17:57.586341 7ffb8a496840 [WARN] Compression type chosen for level 2 is not supported: LZ4. RocksDB will not compress data on level 2.

    2015/03/26-12:19:10.768045 7f36f15c5840 [WARN] Compression type chosen is not supported: LZ4. RocksDB will not compress data.

Reviewers: rven, sdong, yhchiang

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D35979
2015-04-06 12:50:44 -07:00
sdong
953a885ebf A new call back to TablePropertiesCollector to allow users know the entry is add, delete or merge
Summary:
Currently users have no idea a key is add, delete or merge from TablePropertiesCollector call back. Add a new function to add it.

Also refactor the codes so that
(1) make table property collector and internal table property collector two separate data structures with the later one now exposed
(2) table builders only receive internal table properties

Test Plan: Add cases in table_properties_collector_test to cover both of old and new ways of using TablePropertiesCollector.

Reviewers: yhchiang, igor.sugak, rven, igor

Reviewed By: rven, igor

Subscribers: meyering, yoshinorim, maykov, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D35373
2015-04-06 10:27:21 -07:00
Jim Meyering
d2a92c13bc avoid returning a number-of-active-keys estimate of nearly 2^64
Summary:
If accumulated_num_non_deletions_ were ever smaller than
accumulated_num_deletions_, the computation of
"accumulated_num_non_deletions_ - accumulated_num_deletions_"
would result in a logically "negative" value, but since
the two operands are unsigned (uint64_t), the result corresponding
to e.g., -1 would 2^64-1.

Instead, return 0 in that case.

Test Plan:
  - ensure "make check" still passes
  - temporarily add an "abort();" call in the new "if"-block, and
      observe that it fails in some test cases.  However, note that
      this case is triggered only when the two numbers are equal.
      Thus, no test case triggers the erroneous behavior this
      change is designed to avoid. If anyone can construct a
      scenario in which that bug would be triggered, I'll be
      happy to add a test case.

Reviewers: ljin, igor, rven, igor.sugak, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D36489
2015-04-03 14:46:35 -07:00
sdong
a7ac6cef1f Fix level size overflow for options_.level_compaction_dynamic_level_bytes=true
Summary: Int is used for level size targets when options_.level_compaction_dynamic_level_bytes=true, which will cause overflow when database grows big. Fix it.

Test Plan: Add a new unit test which fails without the fix.

Reviewers: rven, yhchiang, MarkCallaghan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D36453
2015-04-03 09:04:35 -07:00
sdong
089509b847 db_test: clean up sync points in test cleaning up
Summary: In some db_test tests sync points are not cleared which will cause unexpected results in the next tests. Clean them up in test cleaning up.

Test Plan:
Run the same tests that used to fail:

build using USE_CLANG=1 and run
./db_test --gtest_filter="DBTest.CompressLevelCompaction:*DBTestUniversalCompactionParallel*"

Reviewers: rven, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D36429
2015-04-02 16:17:58 -07:00
Venkatesh Radhakrishnan
afbafeaeae Disallow trivial move if compression level is different
Summary:
Check compression level of start_level with output_compression
before allowing trivial move

Test Plan: New DBTest CompressLevelCompactionThirdPath added

Reviewers: igor, yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D36213
2015-04-02 11:06:30 -07:00
Venkatesh Radhakrishnan
d0695f3e26 Fix crash caused by opening an empty DB in readonly mode
Summary:
This diff fixes a crash found when an empty database is opened in readonly mode.
We now check the number of levels before we open the DB as a compacted DB.

Test Plan: DBTest.EmptyCompactedDB

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D36327
2015-04-01 16:55:08 -07:00
Herman Lee
51c8133a72 Fix make unity build compiler warning about "stats" shadowing global variable
Summary:
Fix the make unity build. The local stats variable name was shadowing a
global stats variable.

Test Plan:
Run the build
OPT=-DTRAVIS V=1 make unity

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D36285
2015-04-01 10:48:42 -07:00
sdong
76d63b4525 Fix one non-determinism of DBTest.DynamicCompactionOptions
Summary:
After recent change of DBTest.DynamicCompactionOptions, occasionally hit another non-deterministic case where L0 showdown is triggered while timeout should not triggered for hard limit.
Fix it by increasing L0 slowdown trigger at the same time.

Test Plan: Run the failed test.

Reviewers: igor, rven

Reviewed By: rven

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D36219
2015-03-30 15:53:44 -07:00
sdong
b23bbaa82a Universal Compactions with Small Files
Summary:
With this change, we use L1 and up to store compaction outputs in universal compaction.
The compaction pick logic stays the same. Outputs are stored in the largest "level" as possible.

If options.num_levels=1, it behaves all the same as now.

Test Plan:
1) convert most of existing unit tests for universal comapaction to include the option of one level and multiple levels.
2) add a unit test to cover parallel compaction in universal compaction and run it in one level and multiple levels
3) add unit test to migrate from multiple level setting back to one level setting
4) add a unit test to insert keys to trigger multiple rounds of compactions and verify results.

Reviewers: rven, kradhakrishnan, yhchiang, igor

Reviewed By: igor

Subscribers: meyering, leveldb, MarkCallaghan, dhruba

Differential Revision: https://reviews.facebook.net/D34539
2015-03-30 15:12:02 -07:00
Igor Canadi
2511b7d947 Makefile minor cleanup
Summary:
Just couple of small changes:
1. removed signal_test, since it doesn't seem useful and we don't even run it as part of `make check`
2. moved perf_context_test to TESTS instead of PROGRAMS
3. `make release` probably shouldn't compile benchmarks. We currently rely on `make release` building db_bench (via Jenkins), so I left db_bench there.

This is just a minor cleanup. We need to rethink our targets since they are a bit messy right now. We can do this during our tech debt week.

Test Plan: make release

Reviewers: anthony, rven, yhchiang, sdong, meyering

Reviewed By: meyering

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D36171
2015-03-30 16:05:35 -04:00
Mark Callaghan
1bd70fb54a Add --stats_interval_seconds to db_bench
Summary:
The --stats_interval_seconds determines interval for stats reporting
and overrides --stats_interval when set. I also changed tools/benchmark.sh
to report stats every 60 seconds so I can avoid trying to figure out a
good value for --stats_interval per test and per storage device.

Task ID: #6631621

Blame Rev:

Test Plan:
run tools/run_flash_bench, look at output

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D36189
2015-03-30 12:58:32 -07:00
Igor Canadi
fd3dbef22b Clean up old log files in background threads
Summary:
Cleaning up log files can do heavy IO, since we call ftruncate() in the destructor. We don't want to call ftruncate() in user threads.

This diff moves cleaning to background threads (flush and compaction)

Test Plan: make check, will also run valgrind

Reviewers: yhchiang, rven, MarkCallaghan, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D36177
2015-03-30 15:04:10 -04:00
Mark Callaghan
99ec2412e5 Make the benchmark scripts configurable and add tests
Summary:
This makes run_flash_bench.sh configurable. Previously it was hardwired for 1B keys and tests
ran for 12 hours each. That kept me from using it. This makes it configuable, adds more tests,
makes the duration per-test configurable and refactors the test scripts.

Adds the seekrandomwhilemerging test to db_bench which is the same as seekrandomwhilewriting except
the writer thread does Merge rather than Put.

Forces the stall-time column in compaction IO stats to use a fixed format (H:M:S) which makes
it easier to scrape and parse. Also adds an option to AppendHumanMicros to force a fixed format.
Sometimes automation and humans want different format.

Calls thread->stats.AddBytes(bytes); in db_bench for more tests to get the MB/sec summary
stats in the output at test end.

Adds the average ingest rate to compaction IO stats. Output now looks like:
https://gist.github.com/mdcallag/2bd64d18be1b93adc494

More information on the benchmark output is at https://gist.github.com/mdcallag/db43a58bd5ac624f01e1

For benchmark.sh changes default RocksDB configuration to reduce stalls:
* min_level_to_compress from 2 to 3
* hard_rate_limit from 2 to 3
* max_grandparent_overlap_factor and max_bytes_for_level_multiplier from 10 to 8
* L0 file count triggers from 4,8,12 to 4,12,20 for (start,stall,stop)

Task ID: #6596829

Blame Rev:

Test Plan:
run tools/run_flash_bench.sh

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D36075
2015-03-30 11:28:25 -07:00
Igor Canadi
d61cb0b9de db_bench can now disable flashcache for background threads
Summary: Most of the approach is copied from WebSQL's MySQL branch. It's nice that we can do this without touching core RocksDB code.

Test Plan: Compiles and runs. Didn't test flashback code, as I don't have flashback device and most if it is c/p

Reviewers: MarkCallaghan, sdong

Reviewed By: sdong

Subscribers: rven, lgalanis, kradhakrishnan, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D35391
2015-03-30 09:51:11 -07:00
Herman Lee
e018892bb6 Formalize the DB properties string definitions.
Summary:
Assign the string properties to const string variables under the
DB::Properties namespace. This helps catch typos during compilation and
also consolidates the property definition in one place.

Test Plan: Run rocksdb unit tests

Reviewers: sdong, yoshinorim, igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D35991
2015-03-27 14:50:20 -07:00
Igor Canadi
030859eb5d Dump compression info on startup
Summary: It's useful to know if we have compression support or no

Test Plan:
Observed this in my LOG:

      2015/03/26-10:34:35.460681 7f5b322b7840 Snappy supported
      2015/03/26-10:34:35.460682 7f5b322b7840 Zlib supported
      2015/03/26-10:34:35.460686 7f5b322b7840 Bzip supported
      2015/03/26-10:34:35.460687 7f5b322b7840 LZ4 NOT supported

Reviewers: sdong, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D35955
2015-03-26 11:22:20 -07:00
Alexander.Mikhaylov
a3e4b32483 fix compilation error (same as fix #284)
[maa@srv2-nskb-devg2 rocksdb-master]$ CXX=/usr/local/CC/gcc-4.7.4/bin/g++ EXTRA_CXXFLAGS=-std=c++11 DISABLE_WARNING_AS_ERROR=1  make db_bench
  CC       db/db_bench.o
db/db_bench.cc: In member function 'rocksdb::Slice rocksdb::Benchmark::AllocateKey(std::unique_ptr<const char []>*)':
db/db_bench.cc:1434:41: error: use of deleted function 'void std::unique_ptr<_Tp [], _Dp>::reset(_Up) [with _Up = char*; _Tp = const char; _Dp = std::default_delete<const char []>]'
In file included from /usr/local/CC/gcc-4.7.4/lib/gcc/x86_64-unknown-linux-gnu/4.7.4/../../../../include/c++/4.7.4/memory:86:0,
                 from ./include/rocksdb/db.h:14,
                 from ./db/dbformat.h:14,
                 from ./db/db_impl.h:21,
                 from db/db_bench.cc:33:
2015-03-26 14:53:42 +06:00
Anurag Indu
3d1a924ff3 Adding stats for the merge and filter operation
Summary:
We have addded new stats and perf_context for measuring the merge and filter operation time consumption.
We have bounded all the merge operations within the GUARD statment and collected the total time for these operations in the DB.

Test Plan: WIP

Reviewers: rven, yhchiang, kradhakrishnan, igor, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D34377
2015-03-24 14:42:04 -07:00
Yueh-Hsuan Chiang
248c063ba1 Report elapsed time in micros in ThreadStatus instead of start time.
Summary:
Report elapsed time of a thread operation in micros in ThreadStatus
instead of start time of a thread operation in seconds since the
Epoch, 1970-01-01 00:00:00 (UTC).

Test Plan:
./db_bench --benchmarks=fillrandom --num=100000 --threads=40 \
--max_background_compactions=10 --max_background_flushes=3 \
--thread_status_per_interval=1000 --key_size=16 --value_size=1000 \
--num_column_families=10

Sample Output:
            ThreadID ThreadType                    cfName    Operation  ElapsedTime                                         Stage        State
     140667724562496   High Pri column_family_name_000002        Flush   772.419 ms                    FlushJob::WriteLevel0Table
     140667728756800   High Pri                   default        Flush   617.845 ms                    FlushJob::WriteLevel0Table
     140667732951104   High Pri column_family_name_000005        Flush   772.078 ms                    FlushJob::WriteLevel0Table
     140667875557440    Low Pri column_family_name_000008   Compaction  1409.216 ms                        CompactionJob::Install
     140667737145408    Low Pri
     140667749728320    Low Pri
     140667816837184    Low Pri column_family_name_000007   Compaction  1071.815 ms      CompactionJob::ProcessKeyValueCompaction
     140667787477056    Low Pri column_family_name_000009   Compaction   772.516 ms      CompactionJob::ProcessKeyValueCompaction
     140667741339712    Low Pri
     140667758116928    Low Pri column_family_name_000004   Compaction   620.739 ms      CompactionJob::ProcessKeyValueCompaction
     140667753922624    Low Pri
     140667842003008    Low Pri column_family_name_000006   Compaction  1260.079 ms      CompactionJob::ProcessKeyValueCompaction
     140667745534016    Low Pri

Reviewers: sdong, igor, rven

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D35769
2015-03-24 11:32:25 -07:00
Yueh-Hsuan Chiang
a057bb2a8e Improve ThreadStatusSingleCompaction
Summary:
Improve ThreadStatusSingleCompaction in two ways:
1. Use SYNC_POINT to ensure compaction won't happen
   before the test finishes its "Put Phase" instead of
   using sleep.
2. In Put Phase, it continues until we have sufficient
   number of L0 files.  Note that during the put phase,
   there won't be any compaction that consumes L0 files
   because of item 1.

Test Plan: ./db_test  --gtest_filter="*ThreadStatusSingleCompaction*"

Reviewers: sdong, igor, rven

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D35727
2015-03-23 15:30:45 -07:00
sdong
38d286f146 Clean-up WAL directory before running db_test
Summary: DBTest doesn't clean up wal directory. It might cause failure after a failure test run. Fix it.

Test Plan:
Run unit tests
Try open DB with non-empty db_path/wal.

Reviewers: rven, yhchiang, kradhakrishnan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D35559
2015-03-20 12:34:24 -07:00
Igor Sugak
9405b5ef8f rocksdb: Remove #include "util/string_util.h" from util/testharness.h
Summary:
1. Manually deleted #include "util/string_util.h" from util/testharness.h
2.
```
% USE_CLANG=1 make all -j55 -k 2> build.log
% perl -naF: -E 'say $F[0] if /: error:/' build.log | sort -u | xargs sed -i '/#include "util\/testharness.h"/i #include "util\/string_util.h"'
```

Test Plan:
Make sure make all completes with no errors.
```
% make all -j55
```

Reviewers: meyering, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D35493
2015-03-19 17:29:37 -07:00
Igor Canadi
b088c83e6e Don't delete files when column family is dropped
Summary:
To understand the bug read t5943287 and check out the new test in column_family_test (ReadDroppedColumnFamily), iter 0.

RocksDB contract allowes you to read a drop column family as long as there is a live reference. However, since our iteration ignores dropped column families, AddLiveFiles() didn't mark files of a dropped column families as live. So we deleted them.

In this patch I no longer ignore dropped column families in the iteration. I think this behavior was confusing and it also led to this bug. Now if an iterator client wants to ignore dropped column families, he needs to do it explicitly.

Test Plan: Added a new unit test that is failing on master. Unit test succeeds now.

Reviewers: sdong, rven, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D32535
2015-03-19 17:04:29 -07:00
Igor Canadi
52e0f3353f Clean up compactions_in_progress_
Summary: Suprisingly, the only way we use this vector is to keep track of level0 compactions. Thus, I simplified it.

Test Plan: make check

Reviewers: rven, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D35313
2015-03-18 18:25:15 -07:00
Igor Sugak
6b626ff24c rocksdb: change db_test::MultiThreadedDBTest as value parameterized test.
Summary: This is a simple change to make db_test::MultiThreadedDBTest as value parameterized test. There is a value of creating a separate set of such tests later.

Test Plan:
```lang=bash
% make db_test
% ./make db_test
```

Also with the following command I can execute all db_test in 2:37.87 on my box
```
% ./db_test --gtest_list_tests | sed 's/\# GetParam.*//' | tr -d ' ' | env time parallel --gnu --eta --joblog=LOG -- 'TEST_TMPDIR=/dev/shm/rocksdb-{} ./db_test --gtest_filter="*{}"'
```

Reviewers: igor, rven, meyering, sdong

Reviewed By: meyering

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D35361
2015-03-18 18:18:12 -07:00
sdong
0831a35994 Add a DB Property For Number of Deletions in Memtables
Summary: Add a DB property for number of deletions in memtables. It can sometimes help people debug slowness because of too many deletes.

Test Plan: Add test cases.

Reviewers: rven, yhchiang, kradhakrishnan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D35247
2015-03-18 17:03:59 -07:00
Mark Callaghan
dfccc7b4e2 Add readwhilemerging benchmark
Summary:
This is like readwhilewriting but uses Merge rather than Put in the writer thread.
I am using it for in-progress benchmarks. I don't think the other benchmarks for Merge
cover this behavior. The purpose for this test is to measure read performance when
readers might have to merge results. This will also benefit from work-in-progress
to add skewed key generation.

Task ID: #

Blame Rev:

Test Plan:
Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D35115
2015-03-18 13:50:52 -07:00
agiardullo
81345b90f9 Create an abstract interface for write batches
Summary: WriteBatch and WriteBatchWithIndex now both inherit from a common abstract base class.  This makes it easier to write code that is agnostic toward the implementation of the particular write batch.  In particular, I plan on utilizing this abstraction to allow transactions to support using either implementation of a write batch.

Test Plan: modified existing WriteBatchWithIndex tests to test new functions.  Running all tests.

Reviewers: igor, rven, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D34017
2015-03-17 19:23:08 -07:00
Igor Canadi
c88ff4ca76 Deprecate removeScanCountLimit in NewLRUCache
Summary: It is no longer used by the implementation, so we should also remove it from the public API.

Test Plan: make check

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D34971
2015-03-17 15:04:37 -07:00
Igor Sugak
b4b69e4f77 rocksdb: switch to gtest
Summary:
Our existing test notation is very similar to what is used in gtest. It makes it easy to adopt what is different.
In this diff I modify existing [[ https://code.google.com/p/googletest/wiki/Primer#Test_Fixtures:_Using_the_Same_Data_Configuration_for_Multiple_Te | test fixture ]] classes to inherit from `testing::Test`. Also for unit tests that use fixture class, `TEST` is replaced with `TEST_F` as required in gtest.

There are several custom `main` functions in our existing tests. To make this transition easier, I modify all `main` functions to fallow gtest notation. But eventually we can remove them and use implementation of `main` that gtest provides.

```lang=bash
% cat ~/transform
#!/bin/sh
files=$(git ls-files '*test\.cc')
for file in $files
do
  if grep -q "rocksdb::test::RunAllTests()" $file
  then
    if grep -Eq '^class \w+Test {' $file
    then
      perl -pi -e 's/^(class \w+Test) {/${1}: public testing::Test {/g' $file
      perl -pi -e 's/^(TEST)/${1}_F/g' $file
    fi
    perl -pi -e 's/(int main.*\{)/${1}::testing::InitGoogleTest(&argc, argv);/g' $file
    perl -pi -e 's/rocksdb::test::RunAllTests/RUN_ALL_TESTS/g' $file
  fi
done
% sh ~/transform
% make format
```

Second iteration of this diff contains only scripted changes.

Third iteration contains manual changes to fix last errors and make it compilable.

Test Plan:
Build and notice no errors.
```lang=bash
% USE_CLANG=1 make check -j55
```
Tests are still testing.

Reviewers: meyering, sdong, rven, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D35157
2015-03-17 14:08:00 -07:00
Venkatesh Radhakrishnan
98c37fda5d Remove unused parameter in CancelAllBackgroundWork
Summary: Some suggestions for cleanup from Igor.

Test Plan: Regression tests.

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D35169
2015-03-16 21:07:54 -07:00
Igor Sugak
9fd6edf81c rocksdb: Replace ASSERT* with EXPECT* in functions that does not return void value
Summary:
gtest does not use exceptions to fail a unit test by design, and `ASSERT*`s are implemented using `return`. As a consequence we cannot use `ASSERT*` in a function that does not return `void` value ([[ https://code.google.com/p/googletest/wiki/AdvancedGuide#Assertion_Placement | 1]]), and have to fix our existing code. This diff does this in a generic way, with no manual changes.

In order to detect all existing `ASSERT*` that are used in functions that doesn't return void value, I change the code to generate compile errors for such cases.

In `util/testharness.h` I defined `EXPECT*` assertions, the same way as `ASSERT*`, and redefined `ASSERT*` to return `void`. Then executed:

```lang=bash
% USE_CLANG=1 make all -j55 -k 2> build.log
% perl -naF: -e 'print "-- -number=".$F[1]." ".$F[0]."\n" if  /: error:/' \
build.log | xargs -L 1 perl -spi -e 's/ASSERT/EXPECT/g if $. == $number'
% make format
```
After that I reverted back change to `ASSERT*` in `util/testharness.h`. But preserved introduced `EXPECT*`, which is the same as `ASSERT*`. This will be deleted once switched to gtest.

This diff is independent and contains manual changes only in `util/testharness.h`.

Test Plan:
Make sure all tests are passing.
```lang=bash
% USE_CLANG=1 make check
```

Reviewers: igor, lgalanis, sdong, yufei.zhu, rven, meyering

Reviewed By: meyering

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D33333
2015-03-16 20:52:32 -07:00
Venkatesh Radhakrishnan
b2b3086524 Speed up rocksDB close call.
Summary:
On RocksDB, when there are multiple instances doing
flushes/compactions in the background, the close call takes a long time
because the flushes/compactions need to complete before the database can
shut down. If another instance is using the background threads and the compaction for this instance is in the queue since it has been scheduled, we still cannot shutdown. We now remove the scheduled background tasks which have not yet started running, so that shutdown is speeded up.

Test Plan: DB Test added.

Reviewers: yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D33741
2015-03-16 18:49:14 -07:00
Igor Sugak
95344346af rocksdb: Small refactoring before migrating to gtest
Summary: These changes are necessary to make tests look more generic, and avoid feature conflicts with gtest.

Test Plan:
Make sure no build errors, and all test are passing.
```
% make check
```

Reviewers: igor, meyering

Reviewed By: meyering

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D35145
2015-03-16 18:08:59 -07:00
Mark Callaghan
56337faf3e Fix compaction IO stats to handle large file counts
Summary:
The output did not have space for 6-digit file counts or for 3-digit
counts of files being compacted. This adds space for that while preserving
existing alignment. See https://gist.github.com/mdcallag/0a61c6a18dd467224c11

Task ID: #

Blame Rev:

Test Plan:
run db_bench, look at output

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D35091
2015-03-16 11:50:23 -07:00
Igor Canadi
c6967a1a5e Make RecordIn/RecordOut human readable
Summary: I had hard time understanding these big numbers. Here's how the output looks like now: https://gist.github.com/igorcanadi/4c39c17685049584a992

Test Plan: db_bench

Reviewers: sdong, MarkCallaghan

Reviewed By: MarkCallaghan

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D35073
2015-03-14 15:12:41 -07:00
Mark Callaghan
c8da670325 Stop printing per-level stall times.
Summary:
Per-level stall times are the suggested stall time, not the actual stall time so this change stops printing them
both in the per-level output lines and in the summary. Also changed output for total stall time to include units
in all cases. The new output looks like:
Level   Files   Size(MB) Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) Stall(cnt)    RecordIn   RecordDrop
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0     4/1          7   0.8      0.0     0.0      0.0       0.6      0.6       0.0   0.0      0.0     12.9        50       352    0.141        882            0            0
  L1     5/0          9   0.9      0.0     0.0      0.0       0.0      0.0       0.6   0.0      0.0      0.0         0         0    0.000          0            0            0
  L2    54/0         99   1.0      0.0     0.0      0.0       0.0      0.0       0.6   0.0      0.0      0.0         0         0    0.000          0            0            0
  L3   289/0        527   0.5      0.0     0.0      0.0       0.0      0.0       0.5   0.0      0.0      0.0         0         0    0.000          0            0            0
 Sum   352/1        642   0.0      0.0     0.0      0.0       0.6      0.6       1.7   1.0      0.0     12.9        50       352    0.141        882            0            0
 Int     0/0          0   0.0      0.0     0.0      0.0       0.0      0.0       0.0   1.0      0.0     15.5         0         3    0.118          7            0            0
Flush(GB): accumulative 0.627, interval 0.005
Stalls(count): 0 level0_slowdown, 0 level0_numfiles, 882 memtable_compaction, 0 leveln_slowdown_soft, 0 leveln_slowdown_hard

Task ID: #6493861

Blame Rev:

Test Plan:
run db_bench, look at output

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D35085
2015-03-14 15:01:43 -07:00
Yueh-Hsuan Chiang
12134139e3 Fixed the unit-test issue in PreShutdownCompactionMiddle
Summary: Fixed the unit-test issue in PreShutdownCompactionMiddle

Test Plan: export ROCKSDB_TESTS=PreShutdownCompactionMiddle

Reviewers: rven, sdong, igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D35061
2015-03-14 08:25:27 -07:00
Yueh-Hsuan Chiang
fd1b3f385a Fix the issue in PreShutdownMultipleCompaction
Summary: Fix the issue in PreShutdownMultipleCompaction

Test Plan:
export ROCKSDB_TESTS=PreShutdownMultipleCompaction
./db_test

Reviewers: rven, sdong, igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D35055
2015-03-14 08:03:02 -07:00
Igor Canadi
417367c42d Fix SIGSEGV when not using cache 2015-03-13 16:41:00 -07:00
Venkatesh Radhakrishnan
e25ff039c8 Prevent slowdowns and stalls in PreShutdown tests
Summary:
The preshutdown tests check for stopped compactions/flushes.
Removing stalls on the write path.

Test Plan: DBTests.PreShutdown*

Reviewers: yhchiang, sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D35037
2015-03-13 14:51:40 -07:00
Igor Canadi
f690712652 Speed up db_bench shutdown
Summary: See t6489044

Test Plan: compiles

Reviewers: MarkCallaghan

Reviewed By: MarkCallaghan

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D34977
2015-03-13 14:45:15 -07:00
Yueh-Hsuan Chiang
c1b3cde18a Improve the robustness of ThreadStatusSingleCompaction
Summary:
Improve the robustness of ThreadStatusSingleCompaction
by ensuring the number of files flushed in the test.

Test Plan:
export ROCKSDB_TESTS=ThreadStatus
./db_test

Reviewers: sdong, igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D35019
2015-03-13 13:16:53 -07:00
Yueh-Hsuan Chiang
8c12426c93 Fix the deadlock issue in ThreadStatusSingleCompaction.
Summary:
Fix the deadlock issue in ThreadStatusSingleCompaction.

In the previous version of ThreadStatusSingleCompaction, the compaction
thread will wait for a SYNC_POINT while its db_mutex is held.  However,
if the test hasn't finished its Put cycle while a compaction is running,
a deadlock will happen in the test.

Test Plan:
export ROCKSDB_TESTS=ThreadStatus
./db_test

Reviewers: sdong, igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D35001
2015-03-13 12:53:00 -07:00
sdong
b16ead531d DBTest.DynamicLevelCompressionPerLevel should not run without snappy support
Summary: The test depends on snappy to be used. Skip the test if it is not supported.

Test Plan: Run the test

Reviewers: meyering, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D34995
2015-03-13 11:26:17 -07:00
Yueh-Hsuan Chiang
a5e60bafc2 Fix a typo / test failure in ThreadStatusSingleCompaction
Summary:
Fix a typo / test failure in ThreadStatusSingleCompaction

Test Plan:
export ROCKSDB_TESTS=ThreadStatus
./db_test
2015-03-13 11:20:17 -07:00
Igor Canadi
cb2c91850c Don't run some tests is snappy is not present
Summary: Currently, we have `ifdef SNAPPY` around bunch of db_test code. Some tests that don't even use compression are also blocked when running system doesn't have snappy. This also causes hard-to-catch bugs, like D34983. We should dynamically figure out if compression is supported or not.

Test Plan: compiles

Reviewers: sdong, meyering

Reviewed By: meyering

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D34989
2015-03-13 11:08:50 -07:00
Yueh-Hsuan Chiang
c594b0e89d Allow GetThreadList() to report operation stage.
Summary: Allow GetThreadList() to report operation stage.

Test Plan:
  ./thread_list_test
  ./db_bench --benchmarks=fillrandom --num=100000 --threads=40 \
    --max_background_compactions=10 --max_background_flushes=3 \
    --thread_status_per_interval=1000 --key_size=16 --value_size=1000 \
    --num_column_families=10

  export ROCKSDB_TESTS=ThreadStatus
  ./db_test

Sample output
          ThreadID ThreadType                    cfName    Operation        OP_StartTime    ElapsedTime                                         Stage        State
   140116265861184    Low Pri
   140116270055488    Low Pri
   140116274249792   High Pri column_family_name_000005        Flush 2015/03/10-14:58:11           0 us                    FlushJob::WriteLevel0Table
   140116400078912    Low Pri column_family_name_000004   Compaction 2015/03/10-14:58:11           0 us     CompactionJob::FinishCompactionOutputFile
   140116358135872    Low Pri column_family_name_000006   Compaction 2015/03/10-14:58:10           1 us     CompactionJob::FinishCompactionOutputFile
   140116341358656    Low Pri
   140116295221312   High Pri                   default        Flush 2015/03/10-14:58:11           0 us                    FlushJob::WriteLevel0Table
   140116324581440    Low Pri column_family_name_000009   Compaction 2015/03/10-14:58:11           0 us      CompactionJob::ProcessKeyValueCompaction
   140116278444096    Low Pri
   140116299415616    Low Pri column_family_name_000008   Compaction 2015/03/10-14:58:11           0 us     CompactionJob::FinishCompactionOutputFile
   140116291027008   High Pri column_family_name_000001        Flush 2015/03/10-14:58:11           0 us                    FlushJob::WriteLevel0Table
   140116286832704    Low Pri column_family_name_000002   Compaction 2015/03/10-14:58:11           0 us     CompactionJob::FinishCompactionOutputFile
   140116282638400    Low Pri

Reviewers: rven, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D34683
2015-03-13 10:45:40 -07:00
Igor Canadi
52d8347a91 EventLogger
Summary:
Here's my proposal for making our LOGs easier to read by machines.

The idea is to dump all events as JSON objects. JSON is easy to read by humans, but more importantly, it's easy to read by machines. That way, we can parse this, load into SQLite/mongo and then query or visualize.

I started with table_create and table_delete events, but if everybody agrees, I'll continue by adding more events (flush/compaction/etc etc)

Test Plan:
Ran db_bench. Observed:
2015/01/15-14:13:25.788019 1105ef000 EVENT_LOG_v1 {"time_micros": 1421360005788015, "event": "table_file_creation", "file_number": 12, "file_size": 1909699}
2015/01/15-14:13:25.956500 110740000 EVENT_LOG_v1 {"time_micros": 1421360005956498, "event": "table_file_deletion", "file_number": 12}

Reviewers: yhchiang, rven, dhruba, MarkCallaghan, lgalanis, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D31647
2015-03-13 10:15:54 -07:00
Islam AbdelRahman
9d22a1f136 Allow negative Wnew
Summary:
we are using uint64_t for Wnew this is not correct since this value can be negative
https://github.com/facebook/rocksdb/issues/535

Test Plan: run db_bench and check what happens when Wnew is -ve

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D34935
2015-03-12 20:53:18 -07:00
Venkatesh Radhakrishnan
b411d06031 Prevent stalls in preshutdown tests
Summary:
The tests using sync_point for intent to shutdown stop
compaction and this results in stalls if too many rows are written. We
now limit the number of rows written to prevent stalls, since the focus
of the test is to cancel background work, which is being correctly
tested. This fixes a Jenkins issue.

Test Plan: DBTest.PreShutdown*

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D34893
2015-03-12 10:49:06 -07:00
Islam AbdelRahman
1d43bc41fb Fixing segmentation fault in db_bench
Summary:
Fixing segmentation fault when running db_bench

This seg fault happens because num_created is used without being initialized

Test Plan:
running db_bench using these arguments
bpl=10485760;overlap=10;mcz=2;del=300000000;levels=6;ctrig=4; delay=8; stop=12; wbn=3; mbc=20; mb=67108864;wbs=134217728; dds=0; sync=0; r=1000000; t=1; vs=800; bs=65536; cs=1048576; of=500000; si=1000000; ./db_bench --benchmarks=overwrite --disable_seek_compaction=1 --mmap_read=0 --statistics=1 --histogram=1 --num=$r --threads=$t --value_size=$vs --block_size=$bs --cache_size=$cs --bloom_bits=10 --cache_numshardbits=4 --open_files=$of --verify_checksum=1 --db=/home/tec/koko/ --sync=$sync --disable_wal=1 --compression_type=zlib --stats_interval=$si --compression_ratio=0.5 --disable_data_sync=$dds --write_buffer_size=$wbs --target_file_size_base=$mb --max_write_buffer_number=$wbn --max_background_compactions=$mbc --level0_file_num_compaction_trigger=$ctrig --level0_slowdown_writes_trigger=$delay --level0_stop_writes_trigger=$stop --num_levels=$levels --delete_obsolete_files_period_micros=$del --min_level_to_compress=$mcz --max_grandparent_overlap_factor=$overlap --stats_per_interval=1 --max_bytes_for_level_base=$bpl --use_existing_db=1

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D34881
2015-03-11 17:57:16 -07:00
sdong
e9de8b65a6 Change the way options.compression_per_level is used when options.level_compaction_dynamic_level_bytes=true
Summary:
Change the way options.compression_per_level is used when options.level_compaction_dynamic_level_bytes=true so that options.compression_per_level[1] determines compression for the level L0 is merged to, options.compression_per_level[2] to the level after that, etc.

Test Plan: run all tests

Reviewers: rven, yhchiang, kradhakrishnan, igor

Reviewed By: igor

Subscribers: yoshinorim, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D34431
2015-03-11 13:14:52 -07:00
Yueh-Hsuan Chiang
2b785d76b8 Fixed a bug where CompactFiles won't delete obsolete files until flush.
Summary: Fixed a bug where CompactFiles won't delete obsolete files until flush.

Test Plan:
./compact_files_test
export ROCKSDB_TESTS=CompactFiles
./db_test

Reviewers: rven, sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D34671
2015-03-11 13:06:59 -07:00
sdong
2884b100ba db_bench: Better way to randomize repeated read keys in -read_random_exp_range
Summary: Use a better way to map from a key with locality to a random location. Now with the same -read_random_exp_range setting, hit rate drops, which it is expected.

Test Plan: ./db_bench --benchmarks=readrandom -statistics -use_existing_db -cache_size=5000000 --read_random_exp_range=<multiple_values>

Reviewers: MarkCallaghan, kradhakrishnan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D34761
2015-03-11 11:46:14 -07:00
Venkatesh Radhakrishnan
284be570c8 Provide a mechanism to inform Rocksdb that it is shutting down
Summary:
Provide an API which enables users to infor Rocksdb that it is
shutting down.

Test Plan: db_test

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D34617
2015-03-11 10:31:02 -07:00
Igor Canadi
2ddf53b2ca Get OptimizeFilterForHits work on Mac
Summary: Got it working by some voodoo programming

Test Plan: works!

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D34611
2015-03-10 17:53:22 -07:00
Yueh-Hsuan Chiang
89597bb66b Allow GetThreadList() to report the start time of the current operation.
Summary: Allow GetThreadList() to report the start time of the current operation.

Test Plan:
./db_bench --benchmarks=fillrandom --num=100000 --threads=40 \
  --max_background_compactions=10 --max_background_flushes=3 \
  --thread_status_per_interval=1000 --key_size=16 --value_size=1000 \
  --num_column_families=10

Sample output:
          ThreadID ThreadType                    cfName    Operation        OP_StartTime         State
   140338840797248   High Pri column_family_name_000003        Flush 2015/03/09-17:49:59
   140338844991552   High Pri column_family_name_000004        Flush 2015/03/09-17:49:59
   140338849185856    Low Pri
   140338983403584    Low Pri
   140339008569408    Low Pri
   140338861768768    Low Pri
   140338924683328    Low Pri
   140338899517504    Low Pri
   140338853380160    Low Pri
   140338882740288    Low Pri
   140338865963072   High Pri column_family_name_000006        Flush 2015/03/09-17:49:59
   140338954043456    Low Pri
   140338857574464    Low Pri

Reviewers: igor, rven, sdong

Reviewed By: sdong

Subscribers: lgalanis, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D34689
2015-03-10 14:51:28 -07:00
sdong
37921b4997 db_bench: Add Option -read_random_exp_range to allow read skewness.
Summary: Introduce parameter -read_random_exp_range in db_bench to provide some key skewness in readrandom and multireadrandom benchmarks. It will helpful to cover block cache better.

Test Plan:
Run benchmarks with this new parameter. I can clearly see block cache hit rate change while I increase this value (DB size is about 66MB):

./db_bench --benchmarks=readrandom -statistics -use_existing_db -cache_size=5000000 --read_random_exp_range=0.0
rocksdb.block.cache.data.miss COUNT : 958418
rocksdb.block.cache.data.hit COUNT : 41582

./db_bench --benchmarks=readrandom -statistics -use_existing_db -cache_size=5000000 --read_random_exp_range=5.0
rocksdb.block.cache.data.miss COUNT : 819518
rocksdb.block.cache.data.hit COUNT : 180482

./db_bench --benchmarks=readrandom -statistics -use_existing_db -cache_size=5000000 --read_random_exp_range=10.0
rocksdb.block.cache.data.miss COUNT : 450479
rocksdb.block.cache.data.hit COUNT : 549521

./db_bench --benchmarks=readrandom -statistics -use_existing_db -cache_size=5000000 --read_random_exp_range=20.0
rocksdb.block.cache.data.miss COUNT : 223192
rocksdb.block.cache.data.hit COUNT : 776808

Reviewers: MarkCallaghan, kradhakrishnan, yhchiang, rven, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D34629
2015-03-09 11:34:52 -07:00
Igor Canadi
485ac0dbd0 Add rate_limiter to string options
Summary: I want to be able to set this through mongo config.

Test Plan: added unit test

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D34599
2015-03-06 14:21:15 -08:00
Yueh-Hsuan Chiang
dc4532c497 Add --thread_status_per_interval to db_bench
Summary:
Add --thread_status_per_interval to db_bench, which allows
db_bench to optionally enable print the current thread status
periodically.

Test Plan:
./db_bench --benchmarks=fillrandom --num=100000 --threads=40 --max_background_compactions=10 --max_background_flushes=3 --thread_status_per_interval=1000 --key_size=16 --value_size=1000 --num_column_families=10

Sample output:
          ThreadID ThreadType                         dbName                     cfName       Operation           State
   140281571770432    Low Pri
   140281575964736   High Pri  /tmp/rocksdbtest-5297/dbbench  column_family_name_000001           Flush
   140281710182464    Low Pri  /tmp/rocksdbtest-5297/dbbench  column_family_name_000008      Compaction
   140281638879296    Low Pri  /tmp/rocksdbtest-5297/dbbench  column_family_name_000007      Compaction
   140281592741952    Low Pri
   140281580159040   High Pri  /tmp/rocksdbtest-5297/dbbench  column_family_name_000002           Flush
   140281676628032    Low Pri  /tmp/rocksdbtest-5297/dbbench  column_family_name_000006      Compaction
   140281584353344    Low Pri
   140281622102080    Low Pri  /tmp/rocksdbtest-5297/dbbench  column_family_name_000009      Compaction
   140281605324864    Low Pri  /tmp/rocksdbtest-5297/dbbench  column_family_name_000004      Compaction
   140281601130560   High Pri  /tmp/rocksdbtest-5297/dbbench                    default           Flush
   140281596936256    Low Pri
   140281588547648    Low Pri

Reviewers: igor, rven, sdong

Reviewed By: rven

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D34515
2015-03-06 11:22:06 -08:00
Yueh-Hsuan Chiang
694988b627 Fix a bug in stall time counter. Improve its output format.
Summary: Fix a bug in stall time counter.  Improve its output format.

Test Plan:
export ROCKSDB_TESTS=Timeout
./db_test

./db_bench --benchmarks=fillrandom --stats_interval=10000 --statistics=true --stats_per_interval=1 --num=1000000 --threads=4 --level0_stop_writes_trigger=3 --level0_slowdown_writes_trigger=2

sample output:
    Uptime(secs): 35.8 total, 0.0 interval
    Cumulative writes: 359590 writes, 359589 keys, 183047 batches, 2.0 writes per batch, 0.04 GB user ingest, stall seconds: 1786.008 ms
    Cumulative WAL: 359591 writes, 183046 syncs, 1.96 writes per sync, 0.04 GB written
    Interval writes: 253 writes, 253 keys, 128 batches, 2.0 writes per batch, 0.0 MB user ingest, stall time: 0 us
    Interval WAL: 253 writes, 128 syncs, 1.96 writes per sync, 0.00 MB written

Reviewers: MarkCallaghan, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D34275
2015-03-03 12:48:12 -08:00
Igor Canadi
db03739340 options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically.
Summary:
When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.

In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.

Test Plan: New unit tests and pass tests suites including valgrind.

Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo

Reviewed By: ikabiljo

Subscribers: yoshinorim, ikabiljo, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D31437
2015-03-02 22:40:41 -08:00
Mark Callaghan
c4bd03a97e Fix typo in log message
Summary:
fix typo

Task ID: #

Blame Rev:

Test Plan:
Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D34251
2015-03-02 09:35:50 -08:00
Igor Canadi
3cf7f353d9 Instrument memtable seeks
Summary: As title

Test Plan: compiles

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D34191
2015-02-27 17:06:06 -08:00
Sameet Agarwal
e7c434c364 Add columnfamily option optimize_filters_for_hits to optimize for key hits only
Summary:
    Summary:
    Added a new option to ColumnFamllyOptions  - optimize_filters_for_hits. This option can be used in the case where most
    accesses to the store are key hits and we dont need to optimize performance for key misses.
    This is useful when you have a very large database and most of your lookups succeed.  The option allows the store to
     not store and use filters in the last level (the largest level which contains data). These filters can take a large amount of
     space for large databases (in memory and on-disk). For the last level, these filters are only useful for key misses and not
     for key hits. If we are not optimizing for key misses, we can choose to not store these filters for that level.

    This option is only provided for BlockBasedTable. We skip the filters when we are compacting

Test Plan:
1. Modified db_test toalso run tests with an additonal option (skip_filters_on_last_level)
 2. Added another unit test to db_test which specifically tests that filters are being skipped

Reviewers: rven, igor, sdong

Reviewed By: sdong

Subscribers: lgalanis, yoshinorim, MarkCallaghan, rven, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D33717
2015-02-26 16:25:56 -08:00
Igor Sugak
62247ffa3b rocksdb: Add missing override
Summary:
When using latest clang (3.6 or 3.7/trunck) rocksdb is failing with many errors. Almost all of them are missing override errors. This diff adds missing override keyword. No manual changes.

Prerequisites: bear and clang 3.5 build with extra tools

```lang=bash
% USE_CLANG=1 bear make all # generate a compilation database http://clang.llvm.org/docs/JSONCompilationDatabase.html
% clang-modernize -p . -include . -add-override
% make format
```

Test Plan:
Make sure all tests are passing.
```lang=bash
% #Use default fb code clang.
% make check
```
Verify less error and no missing override errors.
```lang=bash
% # Have trunk clang present in path.
% ROCKSDB_NO_FBCODE=1 CC=clang CXX=clang++ make
```

Reviewers: igor, kradhakrishnan, rven, meyering, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D34077
2015-02-26 11:28:41 -08:00
Mark Callaghan
182b4ceacd Limit key range to number of keys, not number of writes
Summary:
An old commit (482401) changed DoWrite to use the value of --writes rather
than --num to determine the range for keys. This restores the old and correct
behavior which is to limit it using --num.

Task ID: #6353043

Blame Rev:

Test Plan:
run db_bench

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D34065
2015-02-25 15:53:45 -08:00
Venkatesh Radhakrishnan
4ade89962d Fix compile error on MacOS.
Summary:
In a release build, a member was not being accessed. This
member was only being accessed in a debug build. We now add an accessor
function for this member and the buid succeeds.

Test Plan: build release/unity/debug on linux/mac

Reviewers: sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D34035
2015-02-24 16:24:53 -08:00
Igor Canadi
ace3d85068 Revert "Unused managed iterator"
This reverts commit bd339a9798.

Conflicts:
	db/managed_iterator.cc
2015-02-24 13:27:41 -08:00
Igor Canadi
7b8f348e56 Attempt at fixing travis issue 2015-02-24 12:20:43 -08:00
Igor Canadi
bd339a9798 Unused managed iterator
Summary: This causes warnings on OS X

Test Plan: compiles

Reviewers: rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D33969
2015-02-24 09:51:52 -08:00
Jinfu Leng
96d989f70d catch config errors with L0 file count triggers
Test Plan: Run "make clean && make all check"

Reviewers: rven, igor, yhchiang, kradhakrishnan, MarkCallaghan, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D33627
2015-02-23 16:08:27 -08:00
Igor Sugak
62f7a1be4f rocksdb: Fixed 'Dead assignment' and 'Dead initialization' scan-build warnings
Summary:
This diff contains trivial fixes for 6 scan-build warnings:

**db/c_test.c**
`db` variable is never read. Removed assignment.
scan-build report:
http://home.fburl.com/~sugak/latest20/report-9b77d2.html#EndPath

**db/db_iter.cc**
`skipping` local variable is assigned to false. Then in the next switch block the only "non return" case assign `skipping` to true, the rest cases don't use it and all do return.
scan-build report:
http://home.fburl.com/~sugak/latest20/report-13fca7.html#EndPath

**db/log_reader.cc**
In `bool Reader::SkipToInitialBlock()` `offset_in_block` local variable is assigned to 0 `if (offset_in_block > kBlockSize - 6)` and then never used. Removed the assignment and renamed it to `initial_offset_in_block` to avoid confusion.
scan-build report:
http://home.fburl.com/~sugak/latest20/report-a618dd.html#EndPath

In `bool Reader::ReadRecord(Slice* record, std::string* scratch)` local variable `in_fragmented_record` in switch case `kFullType` block is assigned to false and then does `return` without use. In the other switch case `kFirstType` block the same `in_fragmented_record` is assigned to false, but later assigned to true without prior use. Removed assignment for both cases.
scan-build reprots:
http://home.fburl.com/~sugak/latest20/report-bb86b0.html#EndPath
http://home.fburl.com/~sugak/latest20/report-a975be.html#EndPath

**table/plain_table_key_coding.cc**
Local variable `user_key_size` is assigned when declared. But then in both places where it is used assigned to `static_cast<uint32_t>(key.size() - 8)`. Changed to initialize the variable to the proper value in declaration.
scan-build report:
http://home.fburl.com/~sugak/latest20/report-9e6b86.html#EndPath

**tools/db_stress.cc**
Missing `break` in switch case block. This seems to be a bug. Added missing `break`.

Test Plan:
Make sure all tests are passing and scan-build does not report 'Dead assignment' and 'Dead initialization' bugs.
```lang=bash
% make check
% make analyze
```

Reviewers: meyering, igor, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D33795
2015-02-23 14:10:09 -08:00
Jim Meyering
a2b911b63f inputs: restore "const" attribute removed by D33759
Summary:
The "const" attribute applies to the type, and placing it
before that return type retains the desired semantics,
yet avoids the compiler error/warning.

Test Plan: Run make

Reviewers: ljin, sdong, igor.sugak, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D33789
2015-02-20 11:52:20 -08:00
Jim Meyering
c6d54b5037 fix erroneous assert: cast kBlockSize (of type unsigned int) to "int"
Summary:
Otherwise, we would assert that an unsigned expression is always >= 0.
The intent was to form a possibly negative number, and to assert that
that value is always >= 0, but since one variable in the computation
was unsigned, the result was guaranteed to be unsigned, too, rendering
the assertion useless.

Cast that unsigned variable to "int", so that all operands
are signed, and thus so that the result can be negative.

Test Plan:
  Run "make EXTRA_CXXFLAGS='-W -Wextra'" and see fewer errors.

Reviewers: ljin, sdong, igor.sugak, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D33771
2015-02-20 11:07:17 -08:00
Jim Meyering
c37937a9ce maint: remove extraneous "const" attribute from return type
Summary:
The "const" attribute does not make sense on a return type,
and provokes a warning/error from gcc -W -Wextra.

Test Plan:
  Run "make EXTRA_CXXFLAGS='-W -Wextra'" and see fewer errors.

Reviewers: ljin, sdong, igor.sugak, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D33759
2015-02-20 11:07:07 -08:00
Jim Meyering
9283c7afd2 build: remove always-true assertions
Summary:
Remove some always-true assertions.
They provoke these compilation failures:

  table/plain_table_key_coding.cc:279:20: error: comparison of unsigned expression >= 0 is always true [-Werror=type-limits]
  db/version_set.cc:336:15: error: comparison of unsigned expression >= 0 is always true [-Werror=type-limits]

* table/plain_table_key_coding.cc (rocksdb): Remove assertion that
unsigned type variable is >= 0.
* db/version_set.cc (DoGenerateLevelFilesBrief): Likewise.

Test Plan:
  Run "make EXTRA_CXXFLAGS='-W -Wextra'" and see fewer errors.

Reviewers: ljin, sdong, igor.sugak, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D33747
2015-02-20 11:07:03 -08:00
Igor Sugak
73711f956c rocksdb: Fix scan-build bug 'Memory leak' in db/db_bench.cc
Summary:
The bug is detected by scan-build.

In `void WriteSeqSeekSeq(ThreadState* thread)` memory is allocated in line 3118 `Slice key = AllocateKey();` but `Slice` is not responsible deleting `Slice::data()`.

Added `std::unique_ptr<const char[]>*` parameter to ` AllocateKey()`, so that it requires caller to not forget about Slice::data() management.

scan-build bug report: http://home.fburl.com/~sugak/latest6/report-6e9754.html#EndPath

Test Plan:
Make sure scan-build does not report 'Memory leak' in db/db_bench.cc and all tests are passing.
```lang=bash
% make analyze
% make check
```

Reviewers: lgalanis, igor, meyering, sdong

Reviewed By: meyering, sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D33501
2015-02-19 14:27:48 -08:00
Jim Meyering
a42324e370 build: do not relink every single binary just for a timestamp
Summary:
Prior to this change, "make check" would always waste a lot of
time relinking 60+ binaries. With this change, it does that
only when the generated file, util/build_version.cc, changes,
and that happens only when the date changes or when the
current git SHA changes.

This change makes some other improvements: before, there was no
rule to build a deleted util/build_version.cc. If it was somehow
removed, any attempt to link a program would fail.
There is no longer any need for the separate file,
build_tools/build_detect_version.  Its functionality is
now in the Makefile.

* Makefile (DEPFILES): Don't filter-out util/build_version.cc.
No need, and besides, removing that dependency was wrong.
(date, git_sha, gen_build_version): New helper variables.
(util/build_version.cc): New rule, to create this file
and update it only if it would contain new information.
* build_tools/build_detect_platform: Remove file.
* db/db_impl.cc: Now, print only date (not the time).
* util/build_version.h (rocksdb_build_compile_time): Remove
declaration.  No longer used.

Test Plan:
- Run "make check" twice, and note that the second time no linking is performed.
- Remove util/build_version.cc and ensure that any "make"
command regenerates it before doing anything else.
- Run this: strings librocksdb.a|grep _build_.
That prints output including the following:

  rocksdb_build_git_date:2015-02-19
  rocksdb_build_git_sha:2.8.fb-1792-g3cb6cc0

Reviewers: ljin, sdong, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D33591
2015-02-19 13:11:10 -08:00
sdong
d45a6a4002 Add rocksdb.num-live-versions: number of live versions
Summary: Add a DB property about live versions. It can be helpful to figure out whether there are files not live but not yet deleted, in some use cases.

Test Plan: make all check

Reviewers: rven, yhchiang, igor

Reviewed By: igor

Subscribers: yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D33327
2015-02-19 13:10:37 -08:00
Venkatesh Radhakrishnan
7d817268b9 Managed iterator
Summary:
This is a diff for managed iterator. A managed iterator
is a wrapper around an iterator which saves the options for that
iterator as well as the current key/value so that the underlying iterator
and its associated memory can be released when it is aged out
automatically or on the request of the user. Will provide the automatic release as a follow-up diff.

Test Plan: Managed* tests in db_test and XF tests for managed iterator

Reviewers: igor, yhchiang, anthony, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D31401
2015-02-18 11:49:31 -08:00
Yueh-Hsuan Chiang
12753130ec Remove ThreadStatusMultiCompaction test
Summary:
Remove ThreadStatusMultiCompaction test as it's currently written
in a way that depends on some randomness, while the flush / compaction
status of a single thread is also covered in ThreadStatusFlush
and ThreadStatusSingleCompaction tests.

Test Plan: ./db_test

Reviewers: igor, sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D33537
2015-02-17 12:04:56 -08:00
Yueh-Hsuan Chiang
e60bc99fe0 Allow GetThreadList to reflect flush activity.
Summary: Allow GetThreadList to reflect flush activity.

Test Plan:
Developed ThreadStatusFlush test and updated ThreadStatusMultiCompaction test.

./db_test  ./thread_list_test

Reviewers: sdong, rven, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D32871
2015-02-17 10:13:52 -08:00
Igor Canadi
e7ea51a8e7 Introduce job_id for flush and compaction
Summary:
It would be good to assing background job their IDs. Two benefits:
1) makes LOGs more readable
2) I might use it in my EventLogger, which will try to make our LOG easier to read/query/visualize

Test Plan: ran rocksdb, read the LOG

Reviewers: sdong, rven, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D31617
2015-02-12 09:54:48 -08:00
sdong
5f00af4570 DBTest.DestroyDBMetaDatabase: create DB directories if not exists
Summary: DBTest.DestroyDBMetaDatabase occasionally fails on my dev host, for file not existing. Always create directories to avoid that.

Test Plan: Run the test

Reviewers: rven, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D33321
2015-02-11 16:16:50 -08:00
sdong
68af7811ea Remember whole key/prefix filtering on/off in SST file
Summary: Remember whole key or prefix filtering on/off in SST files. If user opens the DB with a different setting that cannot be satisfied while reading the SST file, ignore the bloom filter.

Test Plan: Add a unit test for it

Reviewers: yhchiang, igor, rven

Reviewed By: rven

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D32889
2015-02-11 11:20:04 -08:00
sdong
6d6305dd7d Perf Context to report DB mutex waiting time
Summary: Add counters in perf context to allow users to figure out how time spent on waiting for DB mutex

Test Plan: Add a test and run it.

Reviewers: yhchiang, rven, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D33177
2015-02-09 17:55:12 -08:00
Igor Canadi
863009b5a5 Fix deleting obsolete files #2
Summary: For description of the bug, see comment in db_test. The fix is pretty straight forward.

Test Plan: added unit test. eventually we need better testing of FOF/POF process.

Reviewers: yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D33081
2015-02-09 17:38:32 -08:00
Grace Law
1851f977c2 Added RocksDB stats GET_HIT_L0 and GET_HIT_L1
Summary:
  - In statistics.h , added tickers.
  - In version_set.cc,
  -- Added a getter method for hit_file_level_ in the class FilePicker
  -- Added a line in the Get() method in case of a found, increment the corresponding counters based on the level of the file respectively.

Corresponding task: https://our.intern.facebook.com/intern/tasks/?s=506100481&t=5952818
Personal fork: 0c3f2e3600

Test Plan:
In terminal,
```
make -j32 db_test
ROCKSDB_TESTS=L0L1L2AndUpHitCounter ./db_test
```

Or to use debugger,
```
make -j32 db_test
export ROCKSDB_TESTS=L0L1L2AndUpHitCounter
gdb db_test
```

Reviewers: rven, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D32205
2015-02-09 14:53:58 -08:00
sdong
91ac3b2067 Print DB pointer when opening a DB
Summary: Having a pointer for DB will be helpful to debug when GDB or working on a dump. If the client process doesn't have any thread actively working on RocksDB, it can be hard to find out.

Test Plan: make all check

Reviewers: rven, yhchiang, igor

Reviewed By: igor

Subscribers: yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D33159
2015-02-09 12:52:58 -08:00
fyrz
cfe8837e43 Switch logv with loglevel to virtual 2015-02-09 20:59:29 +01:00
Igor Canadi
aaceef3638 Fix formatting 2015-02-09 09:53:30 -08:00
Igor Canadi
ee4aa9a0ee Merge pull request #481 from mkevac/backupable
Allow creating and restoring backups from C
2015-02-09 09:51:19 -08:00
Marko Kevac
9651308307 renamed backup to backup_and_restore in c_test for clarity 2015-02-09 12:11:42 +03:00
Marko Kevac
7e50ed8c24 Added some more wrappers and wrote a test for backup in C 2015-02-07 14:25:10 +03:00
Igor Canadi
2a979822b6 Fix deleting obsolete files
Summary:
This diff basically reverts D30249 and also adds a unit test that was failing before this patch.

I have no idea how I didn't catch this terrible bug when writing a diff, sorry about that :(

I think we should redesign our system of keeping track of and deleting files. This is already a second bug in this critical piece of code. I'll think of few ideas.

BTW this diff is also a regression when running lots of column families. I plan to revisit this separately.

Test Plan: added a unit test

Reviewers: yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D33045
2015-02-06 08:44:30 -08:00
Igor Canadi
6f10130354 Fix DestroyDB
Summary:
When DestroyDB() finds a wal file in the DB directory, it assumes it is actually in WAL directory. This can lead to confusion, since it reports IO error when it tries to delete wal file from DB directory. For example: https://ci-builds.fb.com/job/rocksdb_clang_build/296/console

This change will fix our unit tests.

Test Plan: unit tests work

Reviewers: yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D32907
2015-02-05 20:09:42 -08:00
Igor Canadi
7de4e99a8e Revert "Fix wal_dir not getting cleaned"
This reverts commit f36d394aed.
2015-02-05 11:44:17 -08:00
Yueh-Hsuan Chiang
181191a1e4 Add a counter for collecting the wait time on db mutex.
Summary:
Add a counter for collecting the wait time on db mutex.
Also add MutexWrapper and CondVarWrapper for measuring wait time.

Test Plan:
./db_test
export ROCKSDB_TESTS=MutexWaitStats
./db_test

verify stats output using db_bench
make clean
make release
./db_bench --statistics=1 --benchmarks=fillseq,readwhilewriting --num=10000 --threads=10

Sample output:
    rocksdb.db.mutex.wait.micros COUNT : 7546866

Reviewers: MarkCallaghan, rven, sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D32787
2015-02-04 21:39:45 -08:00
Igor Canadi
f36d394aed Fix wal_dir not getting cleaned 2015-02-04 18:57:22 -08:00
sdong
53ae09c398 db_test: fix a data race in SpecialEnv
Summary: db_test's test class SpecialEnv has a thread unsafe variable rnd_ but it can be accessed by multiple threads. It is complained by TSAN. Protect it by a mutex.

Test Plan: Run the test

Reviewers: yhchiang, rven, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D32895
2015-02-04 18:32:53 -08:00
Igor Canadi
3e53760fc4 Fix compaction_picker_test 2015-02-04 16:20:25 -08:00
Igor Canadi
e39f4f6cf9 Fix data race #3
Summary: Added requirement that ComputeCompactionScore() be executed in mutex, since it's accessing being_compacted bool, which can be mutated by other threads. Also added more comments about thread safety of FileMetaData, since it was a bit confusing. However, it seems that FileMetaData doesn't have data races (except being_compacted)

Test Plan: Ran 100 ConvertCompactionStyle tests with thread sanitizer. On master -- some failures. With this patch -- none.

Reviewers: yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D32283
2015-02-04 16:04:51 -08:00
sdong
e63140d52b Get() to use prefix bloom filter when filter is not block based
Summary:
Get() now doesn't make use of bloom filter if it is prefix based. Add the check.
Didn't touch block based bloom filter. I can't fully reason whether it is correct to do that. But it's straight-forward to for full bloom filter.

Test Plan:
make all check
Add a test case in DBTest

Reviewers: rven, yhchiang, igor

Reviewed By: igor

Subscribers: MarkCallaghan, leveldb, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D31941
2015-02-04 15:15:41 -08:00
Venkatesh Radhakrishnan
dad98dd4ae Changes for supporting cross functional tests for inplace_update
Summary:
This diff containes the changes to the code and db_test
for supporting cross functional tests for inplace_update

Test Plan: Run XF with inplace_test and also without

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D32367
2015-02-03 12:19:56 -08:00
sdong
9898f63988 Divide test DBIteratorTest.DBIterator to smaller tests
Summary:
When building on my host, I saw warning:

In file included from db/db_iter_test.cc:17:0:
db/db_iter_test.cc: In member function ‘void rocksdb::_Test_DBIterator::_Run()’:
./util/testharness.h:147:14: note: variable tracking size limit exceeded with -fvar-tracking-assignments, retrying without
 void TCONCAT(_Test_,name)::_Run()
              ^
./util/testharness.h:134:23: note: in definition of macro ‘TCONCAT1’
 #define TCONCAT1(a,b) a##b
                       ^
./util/testharness.h:147:6: note: in expansion of macro ‘TCONCAT’
 void TCONCAT(_Test_,name)::_Run()
      ^
db/db_iter_test.cc:589:1: note: in expansion of macro ‘TEST’
 TEST(DBIteratorTest, DBIterator) {
 ^

By dividing the test into small tests, it should fix the problem

Test Plan: Run the test

Reviewers: rven, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D32679
2015-02-03 09:48:03 -08:00
Venkatesh Radhakrishnan
0b8dec7172 Cross functional test infrastructure for RocksDB.
Summary:
This Diff provides the implementation of the cross functional
test infrastructure. This provides the ability to test a single feature
with every existing regression test in order to identify issues with
interoperability between features.

Test Plan:
Reference implementation of inplace update support cross
functional test. Able to find interoperability issues with inplace
support and ran all of db_test. Will add separate diff for those changes.

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D32247
2015-02-02 14:49:22 -08:00
Marko Kevac
86e2a1eeea Allow creating backups from C 2015-01-31 15:47:49 +03:00
sdong
5917de0bae CappedFixTransform: return fixed length prefix, or full key if key is shorter than the fixed length
Summary: Add CappedFixTransform, which is the same as fixed length prefix extractor, except that when slice is shorter than the fixed length, it will use the full key.

Test Plan:
Add a test case for
db_test
options_test
and a new test

Reviewers: yhchiang, rven, igor

Reviewed By: igor

Subscribers: MarkCallaghan, leveldb, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D31887
2015-01-30 16:04:30 -08:00
Igor Canadi
6c6037f60c Expose Snapshot's SequenceNumber
Summary:
Requested here: https://www.facebook.com/groups/rocksdb.dev/permalink/705524519546065/

It might also help with mongo. I don't see a reason why we shouldn't expose this info.

Test Plan: make check

Reviewers: sdong, yhchiang, rven

Reviewed By: rven

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D32547
2015-01-29 18:22:43 -08:00