Commit Graph

449 Commits

Author SHA1 Message Date
SherlockNoMad
a6dd0831d5 Add Option to Skip Flushing in TableBuilder 2015-10-29 22:10:25 -07:00
Dmitri Smirnov
5c8f2ee786 Fix MockTable ID storage
On Windows two tests fail that use MockTable:
  flush_job_test and compaction_job_test with the following message:
  compaction_job_test_je.exe : Assertion failed: result.size() == 4,
  file c:\dev\rocksdb\rocksdb\table\mock_table.cc, line 110

  Investigation reveals that this failure occurs when a 4 byte
  ID written to a beginning of the physically open file (main
  contents remains in a in-memory map) can not be read back.

  The reason for the failure is that the ID is written directly
  to a WritableFile bypassing WritableFileWriter. The side effect of that
  is that pending_sync_ never becomes true so the file is never flushed,
  however, the direct cause of the failure is that the filesize_ member
  of the WritableFileWriter remains zero. At Close() the file is truncated
  to that size and the file becomes empty so the ID can not be read back.
2015-10-28 10:53:14 -07:00
Venkatesh Radhakrishnan
a98fbacfa0 Moving memtable related files from util to a new directory memtable
Summary:
We are cleaning up dependencies.
This diff takes a first step at moving memtable files to their own
directory called memtable. In future diffs, we will move other memtable
files from db to memtable.

Test Plan: make check

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D48915
2015-10-16 14:10:33 -07:00
sdong
35ad531be3 Seperate InternalIterator from Iterator
Summary:
Separate a new class InternalIterator from class Iterator, when the look-up is done internally, which also means they operate on key with sequence ID and type.

This change will enable potential future optimizations but for now InternalIterator's functions are still the same as Iterator's.
At the same time, separate the cleanup function to a separate class and let both of InternalIterator and Iterator inherit from it.

Test Plan: Run all existing tests.

Reviewers: igor, yhchiang, anthony, kradhakrishnan, IslamAbdelRahman, rven

Reviewed By: rven

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D48549
2015-10-13 15:32:13 -07:00
Islam AbdelRahman
1fe78a4073 Fix tests failing in ROCKSDB_LITE
Summary:
Fix tests that compile under ROCKSDB_LITE but currently failing.

table_test:
RandomizedLongDB test is using internal stats which is not supported in ROCKSDB_LITE

compaction_job_test:
Using CompactionJobStats which is not supported

perf_context_test:
KeyComparisonCount test try to open DB in ReadOnly mode which is not supported

Test Plan: run the tests under ROCKSDB_LITE

Reviewers: yhchiang, sdong, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48585
2015-10-13 10:32:05 -07:00
Alexey Maykov
3d07b815f6 Passing table properties to compaction callback
Summary: It would be nice to have and access to table properties in compaction callbacks. In MyRocks project, it will make possible to update optimizer statistics online.

Test Plan: ran the unit test. Ran myrocks with the new way of collecting stats.

Reviewers: igor, rven, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48267
2015-10-09 18:10:55 -07:00
sdong
776bd8d5eb Pass column family ID to table property collector
Summary: Pass column family ID through TablePropertiesCollectorFactory::CreateTablePropertiesCollector() so that users can identify which column family this file is for and handle it differently.

Test Plan: Add unit test scenarios in tests related to table properties collectors to verify the information passed in is correct.

Reviewers: rven, yhchiang, anthony, kradhakrishnan, igor, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: yoshinorim, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D48411
2015-10-09 14:36:51 -07:00
Islam AbdelRahman
51fa7ecec5 Bytes read/written from cache statistics
Summary: Add 2 new counters BLOCK_CACHE_BYTES_WRITE, BLOCK_CACHE_BYTES_READ to keep track of how many bytes were written to the cache and how many bytes that we read from cache

Test Plan: make check

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D48195
2015-10-07 15:17:20 -07:00
dyniusz
a065cdb388 bloom hit/miss stats for SST and memtable
Summary:
	hit and miss bloom filter stats for memtable and SST
	stats added to perf_context struct
	key matches and prefix matches combined into one stat

Test Plan: unit test veryfing the functionality added, see BloomStatsTest in db_test.cc for details

Reviewers: yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47859
2015-10-07 11:23:20 -07:00
Igor Canadi
d80ce7f99a Compaction filter on merge operands
Summary:
Since Andres' internship is over, I took over https://reviews.facebook.net/D42555 and rebased and simplified it a bit.

The behavior in this diff is a bit simpler than in D42555:
* only merge operators are passed through FilterMergeValue(). If fitler function returns true, the merge operator is ignored
* compaction filter is *not* called on: 1) results of merge operations and 2) base values that are getting merged with merge operands (the second case was also true in previous diff)

Do we also need a compaction filter to get called on merge results?

Test Plan: make && make check

Reviewers: lovro, tnovak, rven, yhchiang, sdong

Reviewed By: sdong

Subscribers: noetzli, kolmike, leveldb, dhruba, sdong

Differential Revision: https://reviews.facebook.net/D47847
2015-10-07 09:30:03 -07:00
sdong
a70d08ec07 Fix the bug of using freed memory introduced by recent plain table reader patch
Summary: Recent patch introduced a bug that if non-mmap mode is used, in prefix encoding case, there is a resizing of cur_key_ within the same prefix, we still read prefix from the released buffer. It fails ASAN tests and this commit fixes it.

Test Plan: Run the ASAN tests for the failing test case.

Reviewers: IslamAbdelRahman, yhchiang, anthony, igor, kradhakrishnan, rven

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D47457
2015-09-23 16:16:26 -07:00
Islam AbdelRahman
f03b5c987b Add experimental DB::AddFile() to plug sst files into empty DB
Summary:
This is an initial version of bulk load feature

This diff allow us to create sst files, and then bulk load them later, right now the restrictions for loading an sst file are
(1) Memtables are empty
(2) Added sst files have sequence number = 0, and existing values in database have sequence number = 0
(3) Added sst files values are not overlapping

Test Plan: unit testing

Reviewers: igor, ott, sdong

Reviewed By: sdong

Subscribers: leveldb, ott, dhruba

Differential Revision: https://reviews.facebook.net/D39081
2015-09-23 12:42:43 -07:00
sdong
df34aea331 PlainTableReader to support non-mmap mode
Summary:
PlainTableReader now only allows mmap-mode. Add the support to non-mmap mode for more flexibility.
Refactor the codes to move all logic of reading data to PlainTableKeyDecoder, and consolidate the calls to Read() call and ReadVarint32() call. Implement the calls for both of mmap and non-mmap case seperately. For non-mmap mode, make copy of keys in several places when we need to move the buffer after reading the keys.

Test Plan: Add the mode of non-mmap case in plain_table_db_test. Run it in valgrind mode too.

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D47187
2015-09-23 11:41:07 -07:00
sdong
d746eaad5e RandomAccessFileReader should not inherit RandomAccessFile
Summary: RandomAccessFileReader unnecessarily inherited RandomAccessFile, which can introduce unnecessarily extra costs. Remove it.

Test Plan: Run all existing tests

Reviewers: yhchiang, anthony, igor, kradhakrishnan, rven, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D47409
2015-09-23 11:00:41 -07:00
Andres Noetzli
014fd55adc Support for SingleDelete()
Summary:
This patch fixes #7460559. It introduces SingleDelete as a new database
operation. This operation can be used to delete keys that were never
overwritten (no put following another put of the same key). If an overwritten
key is single deleted the behavior is undefined. Single deletion of a
non-existent key has no effect but multiple consecutive single deletions are
not allowed (see limitations).

In contrast to the conventional Delete() operation, the deletion entry is
removed along with the value when the two are lined up in a compaction. Note:
The semantics are similar to @igor's prototype that allowed to have this
behavior on the granularity of a column family (
https://reviews.facebook.net/D42093 ). This new patch, however, is more
aggressive when it comes to removing tombstones: It removes the SingleDelete
together with the value whenever there is no snapshot between them while the
older patch only did this when the sequence number of the deletion was older
than the earliest snapshot.

Most of the complex additions are in the Compaction Iterator, all other changes
should be relatively straightforward. The patch also includes basic support for
single deletions in db_stress and db_bench.

Limitations:
- Not compatible with cuckoo hash tables
- Single deletions cannot be used in combination with merges and normal
  deletions on the same key (other keys are not affected by this)
- Consecutive single deletions are currently not allowed (and older version of
  this patch supported this so it could be resurrected if needed)

Test Plan: make all check

Reviewers: yhchiang, sdong, rven, anthony, yoshinorim, igor

Reviewed By: igor

Subscribers: maykov, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43179
2015-09-17 11:42:56 -07:00
Igor Canadi
81a61d75dc Skipped tests shouldn't be failures [part 2]
Summary: Missed one file in the previous commit

Test Plan: compiles

Reviewers: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47055
2015-09-15 22:59:53 -07:00
Islam AbdelRahman
45e9e4f0bb Refactor NewTableReader to accept TableReaderOptions
Summary:
Refactoring NewTableReader to accept TableReaderOptions
This will make it easier to add new options in the future, for example in this diff https://reviews.facebook.net/D46071

Test Plan: run existing tests

Reviewers: igor, yhchiang, anthony, rven, sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D46179
2015-09-11 11:36:33 -07:00
Andres Noetzli
6bdc484fd8 Added Equal method to Comparator interface
Summary:
In some cases, equality comparisons can be done more efficiently than three-way
comparisons. There are quite a few places in the code where we only care about
equality. This patch adds an Equal() method that defaults to using the
Compare() method.

Test Plan: make clean all check

Reviewers: rven, anthony, yhchiang, igor, sdong

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46233
2015-09-08 15:30:49 -07:00
Igor Canadi
14456aea52 Fix compile
Summary: There was a merge conflict with https://reviews.facebook.net/D45993

Test Plan: make check

Reviewers: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46065
2015-09-02 16:05:53 -07:00
Igor Canadi
76f286cc82 Optimize bloom filter cache misses
Summary:
This optimizes the case when (cache_index_and_filter_blocks=1) and bloom filter is not present in the cache.

Previously we did:
1. Read meta block from file
2. Read the filter position from the meta block
3. Read the filter

Now, we pre-load the filter position on Table::Open(), so we can skip steps (1) and (2) on bloom filter cache miss. Instead of 2 IOs, we do only 1.

Test Plan: make check

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D46047
2015-09-02 15:36:47 -07:00
Andres Noetzli
3c9cef1eed Unified maps with Comparator for sorting, other cleanup
Summary:
This diff is a collection of cleanups that were initially part of D43179.
Additionally it adds a unified way of defining key-value maps that use a
Comparator for sorting (this was previously implemented in four different
places).

Test Plan: make clean check all

Reviewers: rven, anthony, yhchiang, sdong, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45993
2015-09-02 13:58:22 -07:00
sdong
7a0dbdf3ac Add ZSTD (not final format) compression type
Summary: Add ZSTD compression type. The same way as adding LZ4.

Test Plan: run all tests. Generate files in db_bench. Make sure reads succeed. But the SST files cannot be opened in older versions. Also some other adhoc tests.

Reviewers: rven, anthony, IslamAbdelRahman, kradhakrishnan, igor

Reviewed By: igor

Subscribers: MarkCallaghan, maykov, yoshinorim, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D45747
2015-08-28 11:01:13 -07:00
Yueh-Hsuan Chiang
6996de87af Expose per-level aggregated table properties via GetProperty()
Summary:
This patch adds "rocksdb.aggregated-table-properties"
and "rocksdb.aggregated-table-properties-at-levelN", the former
returns the aggreated table properties of a column family,
while the later returns the aggregated table properties
of the specified level N.

Test Plan: Added tests in db_test

Reviewers: igor, sdong, IslamAbdelRahman, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D45087
2015-08-25 12:03:54 -07:00
sdong
c852968465 db_iter_test: add more test cases for the data race bug
Summary: Add more test cases of data race causing wrong iterating results. Tag tests not passing as DISABLED_

Test Plan: Run the tests

Reviewers: igor, rven, IslamAbdelRahman, anthony, kradhakrishnan, yhchiang

Reviewed By: yhchiang

Subscribers: tnovak, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D44907
2015-08-21 12:14:12 -07:00
sdong
4637207120 Add test case to repro the mispositional iterator in a low-chance data race case
Summary: Iterator has a bug: if a child iterator reaches its end, and user issues a Prev(), and just before SeekToLast() of the child iterator is called, some extra rows is added in the end, the position of iterator can be misplaced.

Test Plan: Run the tests with or without valgrind

Reviewers: rven, yhchiang, IslamAbdelRahman, anthony

Reviewed By: anthony

Subscribers: tnovak, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D43671
2015-08-12 10:50:52 -07:00
Andres Notzli
68f934355a Better CompactionJob testing
Summary:
Changed compaction_job_test to support better/more thorough
tests and added two tests. Also changed MockFileContents
to order using InternalKeyComparator.

Test Plan: make compaction_job_test && ./compaction_job_test; make all && make check

Reviewers: sdong, rven, igor, yhchiang, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42837
2015-08-07 21:59:51 -07:00
Andres Notzli
c465071029 Removing duplicate code
Summary:
While working on https://reviews.facebook.net/D43179 , I found
duplicate code in the tests. This patch removes it.

Test Plan: make clean all check

Reviewers: igor, sdong, rven, anthony, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D43263
2015-08-05 07:33:27 -07:00
agiardullo
064294081b Improved FileExists API
Summary: Add new CheckFileExists method.  Considered changing the FileExists api but didn't want to break anyone's builds.

Test Plan: unit tests

Reviewers: yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42003
2015-07-20 17:20:40 -07:00
Islam AbdelRahman
4853e228ef Make table_test runnable in ROCKSDB_LITE
Summary: Remove plain table tests from table_test since plain table is not supported in ROCKSDB_LITE

Test Plan: table_test

Reviewers: igor, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D42153
2015-07-20 11:09:14 -07:00
Islam AbdelRahman
cf6a7bebc8 Block cuckoo table tests in ROCKSDB_LITE
Summary: Cuckoo table is not supported in ROCKSDB_LITE, blocking it's tests

Test Plan:
cuckoo_table_builder_test
cuckoo_table_db_test
cuckoo_table_reader_test

Reviewers: sdong, igor, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D42141
2015-07-20 10:50:46 -07:00
sdong
6e9fbeb27c Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.

Test Plan: Run all existing unit tests.

Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D42321
2015-07-17 16:58:18 -07:00
Andres Notzli
1d20fa9d0f Fixed and simplified merge_helper
Summary:
MergeUntil was not reporting a success when merging an operand with
a Value/Deletion despite the comments in MergeHelper and CompactionJob
indicating otherwise. This lead to operands being written to the compaction
output unnecessarily:

M1 M2 M3 P M4 M5 --> (P+M1+M2+M3) M2 M3 M4 M5 (before the diff)
M1 M2 M3 P M4 M5 --> (P+M1+M2+M3) M4 M5 (after the diff)

In addition, the code handling Values/Deletion was basically identical.
This patch unifies the code. Finally, this patch also adds testing for
merge_helper.

Test Plan: make && make check

Reviewers: sdong, rven, yhchiang, tnovak, igor

Reviewed By: igor

Subscribers: tnovak, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D42351
2015-07-17 09:27:24 -07:00
Dmitri Smirnov
d1a457181d Ensure Windows build w/o port/port.h in public headers
- Remove make file defines from public headers and use _WIN32 because it is compiler defined
 - use __GNUC__ and __clang__ to guard non-portable attributes
 - add #include "port/port.h" to some new .cc files.
 - minor changes in CMakeLists to reflect recent changes
2015-07-16 12:10:16 -07:00
lovro
e1c99e10c1 Replace std::priority_queue in MergingIterator with custom heap, take 2
Summary: Repeat of b6655a679d (reverted in b7a2369fb2) with a proper fix for the issue that 57d216ea65 was trying to fix.

Test Plan:
make check

for i in $(seq 100); do ./db_stress --test_batches_snapshots=1 --threads=32 --write_buffer_size=4194304 --destroy_db_initially=0 --reopen=20 --readpercent=45 --prefixpercent=5 --writepercent=35 --delpercent=5 --iterpercent=10 --db=/tmp/rocksdb_crashtest_KdCI5F --max_key=100000000 --mmap_read=0 --block_size=16384 --cache_size=1048576 --open_files=500000 --verify_checksum=1 --sync=0 --progress_reports=0 --disable_wal=0 --disable_data_sync=1 --target_file_size_base=2097152 --target_file_size_multiplier=2 --max_write_buffer_number=3 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --filter_deletes=0 --memtablerep=prefix_hash --prefix_size=7 --ops_per_thread=200 || break; done

Reviewers: anthony, sdong, igor, yhchiang

Reviewed By: igor, yhchiang

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D41391
2015-07-15 03:34:40 -07:00
sdong
f9728640f3 "make format" against last 10 commits
Summary: This helps Windows port to format their changes, as discussed. Might have formatted some other codes too becasue last 10 commits include more.

Test Plan: Build it.

Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D41961
2015-07-13 13:50:18 -07:00
sdong
76d3cd3286 Fix public API dependency on internal codes and dependency on MAX_INT32
Summary:
Public API depends on port/port.h which is wrong. Fix it.
Also with gcc 4.8.1 build was broken as MAX_INT32 was not recognized. Fix it by using ::max in linux.

Test Plan: Build it and try to build an external project on top of it.

Reviewers: anthony, yhchiang, kradhakrishnan, igor

Reviewed By: igor

Subscribers: yoshinorim, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D41745
2015-07-11 10:32:11 -07:00
Siying Dong
e41cbd9c2f Merge pull request #646 from yuslepukhin/ms_win_port
Windows Port from Microsoft
2015-07-10 15:53:39 -07:00
sdong
041b6f95a2 perf_context: report time spent on reading index and bloom blocks
Summary: Add a perf context counter to help users figure out time spent on reading indexes and bloom filter blocks.

Test Plan: Will write a unit test

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D41433
2015-07-10 14:45:42 -07:00
Dmitri Smirnov
c903ccc4c2 Merge from github/master 2015-07-09 18:01:08 -07:00
unknown
5c79132335 Revert the changes related to Options, as requested to seperate them into
a different patch.
2015-07-09 11:31:42 -07:00
Dmitri Smirnov
ef4b87f1b2 Commit both PR and internal code review changes 2015-07-07 16:58:20 -07:00
Yueh-Hsuan Chiang
b7a2369fb2 Revert "Replace std::priority_queue in MergingIterator with custom heap"
Summary:
This patch reverts "Replace std::priority_queue in MergingIterator
with custom heap" (commit commit b6655a679d)
as it causes db_stress failure.

Test Plan: ./db_stress --test_batches_snapshots=1 --threads=32 --write_buffer_size=4194304 --destroy_db_initially=0 --reopen=20 --readpercent=45 --prefixpercent=5 --writepercent=35 --delpercent=5 --iterpercent=10 --db=/tmp/rocksdb_crashtest_KdCI5F --max_key=100000000 --mmap_read=0 --block_size=16384 --cache_size=1048576 --open_files=500000 --verify_checksum=1 --sync=0 --progress_reports=0 --disable_wal=0 --disable_data_sync=1 --target_file_size_base=2097152 --target_file_size_multiplier=2 --max_write_buffer_number=3 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --filter_deletes=0 --memtablerep=prefix_hash --prefix_size=7 --ops_per_thread=200 --kill_random_test=97

Reviewers: igor, anthony, lovro, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41343
2015-07-07 14:45:20 -07:00
Yueh-Hsuan Chiang
57d216ea65 Remove assert(current_ == CurrentReverse()) in MergingIterator::Prev()
Summary:
Remove assert(current_ == CurrentReverse()) in MergingIterator::Prev()
because it is possible to have some keys larger than the seek-key
inserted between Seek() and SeekToLast(), which makes current_ not
equal to CurrentReverse().

Test Plan: db_stress

Reviewers: igor, sdong, IslamAbdelRahman, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41331
2015-07-07 12:45:06 -07:00
Yueh-Hsuan Chiang
685582a0b4 Revert two diffs related to DBIter::FindPrevUserKey()
Summary:
This diff reverts the following two previous diffs related to
DBIter::FindPrevUserKey(), which makes db_stress unstable.
We should bake a better fix for this.

* "Fix a comparison in DBIter::FindPrevUserKey()"
  ec70fea4c4.

* "Fixed endless loop in DBIter::FindPrevUserKey()"
  acee2b08a2.

Test Plan: db_stress

Reviewers: anthony, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41301
2015-07-07 11:36:24 -07:00
lovro
b6655a679d Replace std::priority_queue in MergingIterator with custom heap
Summary:
While profiling compaction in our service I noticed a lot of CPU (~15% of compaction) being spent in MergingIterator and key comparison.  Looking at the code I found MergingIterator was (understandably) using std::priority_queue for the multiway merge.

Keys in our dataset include sequence numbers that increase with time.  Adjacent keys in an L0 file are very likely to be adjacent in the full database.  Consequently, compaction will often pick a chunk of rows from the same L0 file before switching to another one.  It would be great to avoid the O(log K) operation per row while compacting.

This diff replaces std::priority_queue with a custom binary heap implementation.  It has a "replace top" operation that is cheap when the new top is the same as the old one (i.e. the priority of the top entry is decreased but it still stays on top).

Test Plan:
make check

To test the effect on performance, I generated databases with data patterns that mimic what I describe in the summary (rows have a mostly increasing sequence number).  I see a 10-15% CPU decrease for compaction (and a matching throughput improvement on tmpfs).  The exact improvement depends on the number of L0 files and the amount of locality.  Performance on randomly distributed keys seems on par with the old code.

Reviewers: kailiu, sdong, igor

Reviewed By: igor

Subscribers: yoshinorim, dhruba, tnovak

Differential Revision: https://reviews.facebook.net/D29133
2015-07-06 04:24:09 -07:00
Yueh-Hsuan Chiang
acee2b08a2 Fixed endless loop in DBIter::FindPrevUserKey()
Summary: Fixed endless loop in DBIter::FindPrevUserKey()

Test Plan: ./db_stress --test_batches_snapshots=1 --threads=32 --write_buffer_size=4194304 --destroy_db_initially=0 --reopen=20 --readpercent=45 --prefixpercent=5 --writepercent=35 --delpercent=5 --iterpercent=10 --db=/tmp/rocksdb_crashtest_KdCI5F --max_key=100000000 --mmap_read=0 --block_size=16384 --cache_size=1048576 --open_files=500000 --verify_checksum=1 --sync=0 --progress_reports=0 --disable_wal=0 --disable_data_sync=1 --target_file_size_base=2097152 --target_file_size_multiplier=2 --max_write_buffer_number=3 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --filter_deletes=0 --memtablerep=prefix_hash --prefix_size=7 --ops_per_thread=200 --kill_random_test=97

Reviewers: tnovak, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D41085
2015-07-02 16:10:31 -07:00
Dmitri Smirnov
9dbde7277c Merge remote-tracking branch 'origin' into ms_win_port 2015-07-02 11:34:22 -07:00
Dmitri Smirnov
18285c1e2f Windows Port from Microsoft
Summary: Make RocksDb build and run on Windows to be functionally
 complete and performant. All existing test cases run with no
 regressions. Performance numbers are in the pull-request.

 Test plan: make all of the existing unit tests pass, obtain perf numbers.

 Co-authored-by: Praveen Rao praveensinghrao@outlook.com
 Co-authored-by: Sherlock Huang baihan.huang@gmail.com
 Co-authored-by: Alex Zinoviev alexander.zinoviev@me.com
 Co-authored-by: Dmitri Smirnov dmitrism@microsoft.com
2015-07-01 16:13:56 -07:00
sdong
05e2831966 Allocate LevelFileIteratorState and LevelFileNumIterator from DB iterator's arena
Summary: Try to allocate LevelFileIteratorState and LevelFileNumIterator from DB iterator's arena, instead of calling malloc and free.

Test Plan: valgrind check

Reviewers: rven, yhchiang, anthony, kradhakrishnan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D40929
2015-06-30 17:30:38 -07:00
Igor Canadi
0a019d74a0 Use malloc_usable_size() for accounting block cache size
Summary:
Currently, when we insert something into block cache, we say that the block cache capacity decreased by the size of the block. However, size of the block might be less than the actual memory used by this object. For example, 4.5KB block will actually use 8KB of memory. So even if we configure block cache to 10GB, our actually memory usage of block cache will be 20GB!

This problem showed up a lot in testing and just recently also showed up in MongoRocks production where we were using 30GB more memory than expected.

This diff will fix the problem. Instead of counting the block size, we will count memory used by the block. That way, a block cache configured to be 10GB will actually use only 10GB of memory.

I'm using non-portable function and I couldn't find info on portability on Google. However, it seems to work on Linux, which will cover majority of our use-cases.

Test Plan:
1. fill up mongo instance with 80GB of data
2. restart mongo with block cache size configured to 10GB
3. do a table scan in mongo
4. memory usage before the diff: 12GB. memory usage after the diff: 10.5GB

Reviewers: sdong, MarkCallaghan, rven, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D40635
2015-06-26 11:48:09 -07:00