Commit Graph

375 Commits

Author SHA1 Message Date
Lei Jin
5187d896b9 unfriend Compaction and CompactionPicker from VersionSet
Summary: as title

Test Plan: running make all check

Reviewers: sdong, yhchiang, rven, igor

Reviewed By: igor

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D27549
2014-10-28 09:59:56 -07:00
Lei Jin
f1841985e4 dynamic inplace_update options
Summary:
Make inplace_update_support and inplace_update_num_locks dynamic.
inplace_callback becomes immutable
We are almost free of references to cfd->options() in db_impl

Test Plan: unit test

Reviewers: igor, yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D25293
2014-10-27 12:10:13 -07:00
Lei Jin
122f98e0b9 dynamic max_mem_compact_level
Summary: as title

Test Plan: unit test

Reviewers: sdong, yhchiang, rven, igor

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D25347
2014-10-23 15:37:14 -07:00
Yueh-Hsuan Chiang
6c66918645 Speed up DB::Open() and Version creation by limiting the number of FileMetaData initialization.
Summary:
This diff speeds up DB::Open() and Version creation by limiting the number of FileMetaData initialization. The behavior of Version::UpdateAccumulatedStats() is changed as follows:

* It only initializes the first 20 uninitialized FileMetaData from file.  This guarantees the size of the latest 20 files will always be compensated when they have any deletion entries.  Previously it may initialize all FileMetaData by loading all files at DB::Open().
* In case none the first 20 files has any data entry, UpdateAccumulatedStats() will initialize the FileMetaData of the oldest file.

Test Plan: db_test

Reviewers: igor, sdong, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D24255
2014-10-17 14:58:30 -07:00
Igor Canadi
d6987216c9 Merge pull request #327 from dalgaaf/wip-da-SCA-20141001
Fix some issues from SCA
2014-10-02 10:59:52 -07:00
Lei Jin
5ec53f3edf make compaction related options changeable
Summary:
make compaction related options changeable. Most of changes are tedious,
following the same convention: grabs MutableCFOptions at the beginning
of compaction under mutex, then pass it throughout the job and register
it in SuperVersion at the end.

Test Plan: make all check

Reviewers: igor, yhchiang, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D23349
2014-10-01 16:19:16 -07:00
Danny Al-Gaaf
4a171882d6 db/version_set.cc: remove unnecessary checks
Fix for:

[db/version_set.cc:1219]: (style) Unsigned variable 'last_file'
 can't be negative so it is unnecessary to test it.
[db/version_set.cc:1234]: (style) Unsigned variable 'first_file'
 can't be negative so it is unnecessary to test it.

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
2014-10-01 11:09:22 +02:00
Danny Al-Gaaf
b8b7117e97 db/version_set.cc: use !empty() instead of 'size() > 0'
Use empty() since it should be prefered as it has, following
the standard, a constant time complexity regardless of the
containter type. The same is not guaranteed for size().

Fix for:
[db/version_set.cc:2250]: (performance) Possible inefficient
 checking for 'column_families_not_found' emptiness.

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
2014-09-30 23:30:31 +02:00
Lei Jin
2faf49d5f1 use GetContext to replace callback function pointer
Summary:
Intead of passing callback function pointer and its arg on Table::Get()
interface, passing GetContext. This makes the interface cleaner and
possible better perf. Also adding a fast pass for SaveValue()

Test Plan: make all check

Reviewers: igor, yhchiang, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D24057
2014-09-29 11:09:09 -07:00
Lei Jin
3c68006109 CompactedDBImpl
Summary:
Add a CompactedDBImpl that will enabled when calling OpenForReadOnly()
and the DB only has one level (>0) of files. As a performan comparison,
CuckooTable performs 2.1M/s with CompactedDBImpl vs. 1.78M/s with
ReadOnlyDBImpl.

Test Plan: db_bench

Reviewers: yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D23553
2014-09-25 11:14:01 -07:00
Lei Jin
a062e1f2c4 SetOptions() for memtable related options
Summary: as title

Test Plan:
make all check
I will think a way to set up stress test for this

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D23055
2014-09-17 12:49:13 -07:00
Igor Canadi
4a27a2f193 Don't sync manifest when disableDataSync = true
Summary: As we discussed offline

Test Plan: compiles

Reviewers: yhchiang, sdong, ljin, dhruba

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D22989
2014-09-15 11:32:01 -07:00
Lei Jin
9b0f7ffa1c rename version_set options_ to db_options_ to avoid confusion
Summary: as title

Test Plan: make release

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D23007
2014-09-08 15:25:01 -07:00
Lei Jin
048560a642 reduce references to cfd->options() in DBImpl
Summary:
I found it is almost impossible to get rid of this function in a single
batch. I will take a step by step approach

Test Plan: make release

Reviewers: sdong, yhchiang, igor

Reviewed By: igor

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D22995
2014-09-08 15:04:34 -07:00
Igor Canadi
a2bb7c3c33 Push- instead of pull-model for managing Write stalls
Summary:
Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either:
* proceed with all writes without delay
* delay all writes by fixed time
* stop all writes

The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case).

When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal.

This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write.

Test Plan: make check for now. I'll add some unit tests later. Also, perf test.

Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D22791
2014-09-08 11:20:25 -07:00
liuhuahang
bb6ae0f80c fix more compile warnings
N/A

Change-Id: I5b6f9c70aea7d3f3489328834fed323d41106d9f
Signed-off-by: liuhuahang <liuhuahang@zerus.co>
2014-09-05 14:14:37 +08:00
Stanislau Hlebik
45a5e3ede0 Remove path with arena==nullptr from NewInternalIterator
Summary:
Simply code by removing code path which does not use Arena
from NewInternalIterator

Test Plan:
make all check
make valgrind_check

Reviewers: sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D22395
2014-09-04 17:40:41 -07:00
Yueh-Hsuan Chiang
570ba5aca8 Avoid retrying to read property block from a table when it does not exist.
Summary:
Avoid retrying to read property block from a table when it does not exist
in updating stats for compensating deletion entries.

In addition, ReadTableProperties() now returns Status::NotFound instead
of Status::Corruption when table properties does not exist in the file.

Test Plan:
make db_test -j32
export ROCKSDB_TESTS=CompactionDeleteionTrigger
./db_test

Reviewers: ljin, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D21867
2014-08-15 12:17:44 -07:00
Stanislau Hlebik
0c9dc9f8e0 Remove malloc from FormatFileNumber
Summary: Replace unnecessary malloc with stack allocation

Test Plan: make all check

Reviewers: sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D21771
2014-08-13 11:57:40 -07:00
miguelportilla
93e6b5e9d9 Changes to support unity build:
* Script for building the unity.cc file via Makefile
* Unity executable Makefile target for testing builds
* Source code changes to fix compilation of unity build
2014-08-11 13:22:47 -04:00
sdong
1242bfcad7 Add DB property "rocksdb.estimate-table-readers-mem"
Summary:
Add a DB Property "rocksdb.estimate-table-readers-mem" to return estimated memory usage by all loaded table readers, other than allocated from block cache.

Refactor the property codes to allow getting property from a version, with DB mutex not acquired.

Test Plan: Add several checks of this new property in existing codes for various cases.

Reviewers: yhchiang, ljin

Reviewed By: ljin

Subscribers: xjin, igor, leveldb

Differential Revision: https://reviews.facebook.net/D20733
2014-08-06 11:39:46 -07:00
Feng Zhu
1129921e9b logging_when_create_and_delete_manifest
Summary:
  1. logging when create and delete manifest file
  2. fix formating in table/format.cc

Test Plan:
  make all check
  run db_bench, track the LOG file.

Reviewers: ljin, yhchiang, igor, yufei.zhu, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D21009
2014-08-04 11:25:42 -07:00
Yueh-Hsuan Chiang
49ee5a4ac4 Fixed the crash when merge_operator is not properly set after reopen.
Summary:
Fixed the crash when merge_operator is not properly set after reopen
and added two test cases for this.

Test Plan:
make merge_test
./merge_test

Reviewers: igor, ljin, sdong

Reviewed By: sdong

Subscribers: benj, mvikjord, leveldb

Differential Revision: https://reviews.facebook.net/D20793
2014-07-30 17:24:36 -07:00
sdong
f6784766db Add DB property estimated number of keys
Summary: Add a DB property of estimated number of live keys, by adding number of entries of all mem tables and all files, subtracted by all deletions in all files.

Test Plan: Add the case in unit tests

Reviewers: hobbymanyp, ljin

Reviewed By: ljin

Subscribers: MarkCallaghan, yoshinorim, leveldb, igor, dhruba

Differential Revision: https://reviews.facebook.net/D20631
2014-07-28 15:27:58 -07:00
Lei Jin
40fa8a4cd5 make statistics forward-able
Summary:
Make StatisticsImpl being able to forward stats to provided statistics
implementation. The main purpose is to allow us to collect internal
stats in the future even when user supplies custom statistics
implementation. It avoids intrumenting 2 sets of stats collection code.
One immediate use case is tuning advisor, which needs to collect some
internal stats, users may not be interested.

Test Plan:
ran db_bench and see stats show up at the end of run
Will run make all check since some tests rely on statistics

Reviewers: yhchiang, sdong, igor

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D20145
2014-07-28 12:05:36 -07:00
Yueh-Hsuan Chiang
6480717a26 Fixed compaction-related errors where number of input levels are hard-coded.
Summary:
Fixed compaction-related errors where number of input levels are hard-coded.
It's a bug found in compaction branch.
This diff will be pushed into master.

Test Plan:
export ROCKSDB_TESTS=Compact
make db_test -j32
./db_test
also passed the tests in compaction branch

Reviewers: igor, sdong, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D20577
2014-07-24 17:06:00 -07:00
sdong
f6b7e1ed1a Allow user to specify DB path of output file of manual compaction
Summary: Add a parameter path_id to DB::CompactRange(), to indicate where the output file should be placed to.

Test Plan: add a unit test

Reviewers: yhchiang, ljin

Reviewed By: ljin

Subscribers: xjin, igor, dhruba, MarkCallaghan, leveldb

Differential Revision: https://reviews.facebook.net/D20085
2014-07-21 19:06:00 -07:00
Yueh-Hsuan Chiang
052ddbe0e2 Add MaxInputLevel() to CompactionPicker
Summary:
Having if-then branch for different compaction strategies is considered
hacky and make CompactionPicker less pluggable.  This diff removes two
of such if-then branches in version_set.cc by adding MaxInputLevel() to
CompactionPicker.

    // Given the current number of levels, returns the lowest allowed level
    // for compaction input.
    virtual int MaxInputLevel(int current_num_levels) const;

Test Plan:
make db_test
export ROCKSDB_TESTS=Compaction
./db_test

Reviewers: igor, sdong, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19971
2014-07-17 18:01:04 -07:00
Radheshyam Balasundaram
0d57e3ad7d Guarding files_ attribute with #ifndef NDEBUG guard in FilePicker class.
Summary: Adding guards to files_ attribute of FilePicker class. This attribute is used only in DEBUG mode. This fixes build of static_lib in mac.

Test Plan:
make static_lib in mac
make check all in devserver

Reviewers: ljin, igor, sdong

Reviewed By: sdong

Differential Revision: https://reviews.facebook.net/D20163
2014-07-17 15:07:05 -07:00
Yueh-Hsuan Chiang
296e340753 Add struct CompactionInputFiles to manage compaction input files.
Summary: Add struct CompactionInputFiles to manage compaction input files.

Test Plan:
export ROCKSDB_TESTS=Compact
make db_test
./db_test

Reviewers: ljin, igor, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D20061
2014-07-16 18:12:17 -07:00
Radheshyam Balasundaram
0418e66e2a Refactoring Version::Get()
Summary: Refactoring Version::Get() method to move file picker logic to a separate class.

Test Plan: make check all

Reviewers: igor, sdong, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19713
2014-07-16 13:33:02 -07:00
Feng Zhu
c11d604ab3 store file_indexer info in sequential memory
Summary:
  use arena to allocate space for next_level_index_ and level_rb_
  Thus increasing data locality and make Version::Get faster.

Benchmark detail
Base version: commit d2a727c182

command used:
./db_bench --db=/mnt/db/rocksdb --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --block_size=4096 --cache_size=17179869184 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=2097152 --max_bytes_for_level_base=1073741824 --disable_wal=0 --sync=0 --disable_data_sync=1 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_grandparent_overlap_factor=10 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --perf_level=0 --benchmarks=fillseq, readrandom,readrandom,readrandom --use_existing_db=0 --num=52428800 --threads=1

Result:
cpu running percentage:
Version::Get, improved from 7.98% to 7.42%
FileIndexer::GetNextLevelIndex, improved from 1.18% to 0.68%.

Test Plan:
  make all check

Reviewers: ljin, haobo, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, igor

Differential Revision: https://reviews.facebook.net/D19845
2014-07-16 11:21:30 -07:00
sdong
0abaed2e08 Support multiple DB directories in universal compaction style
Summary:
This patch adds a target size parameter in options.db_paths and universal compaction will base it to determine which DB path to place a new file.
Level-style stays the same.

Test Plan: Add new unit tests

Reviewers: ljin, yhchiang

Reviewed By: yhchiang

Subscribers: MarkCallaghan, dhruba, igor, leveldb

Differential Revision: https://reviews.facebook.net/D19869
2014-07-15 12:06:28 -07:00
Feng Zhu
178fd6f9db use FileLevel in LevelFileNumIterator
Summary:
  Use FileLevel in LevelFileNumIterator, thus use new version of findFile.
  Old version of findFile function is deleted.
  Write a function in version_set.cc to generate FileLevel from files_.
  Add GenerateFileLevelTest in version_set_test.cc

Test Plan:
  make all check

Reviewers: ljin, haobo, yhchiang, sdong

Reviewed By: sdong

Subscribers: igor, dhruba

Differential Revision: https://reviews.facebook.net/D19659
2014-07-11 12:52:41 -07:00
Feng Zhu
f697cad15c create compressed_levels_ in Version, allocate its space using arena. Make Version::Get, Version::FindFile faster
Summary:
    Define CompressedFileMetaData that just contains fd, smallest_slice, largest_slice. Create compressed_levels_ in Version, the space is allocated using arena
    Thus increase the file meta data locality, speed up "Get" and "FindFile"

    benchmark with in-memory tmpfs, could have 4% improvement under "random read" and 2% improvement under "read while writing"

benchmark command:
./db_bench --db=/mnt/db/rocksdb --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --block_size=4096 --cache_size=17179869184 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=33554432 --max_bytes_for_level_base=1073741824 --disable_wal=0 --sync=0 --disable_data_sync=1 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_grandparent_overlap_factor=10 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --perf_level=0 --benchmarks=readwhilewriting,readwhilewriting,readwhilewriting --use_existing_db=1 --num=52428800 --threads=1 —writes_per_second=81920

Read Random:
From 1.8363 ms/op, improve to 1.7587 ms/op.
Read while writing:
From 2.985 ms/op, improve to 2.924 ms/op.

Test Plan:
    make all check

Reviewers: ljin, haobo, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, igor

Differential Revision: https://reviews.facebook.net/D19419
2014-07-09 22:14:39 -07:00
Yueh-Hsuan Chiang
70828557ef Some fixes on size compensation logic for deletion entry in compaction
Summary:
This patch include two fixes:
1. newly created Version will now takes the aggregated stats for average-value-size from the latest Version.
2. compensated size of a file is now computed only for newly created / loaded file, this addresses the issue where files are already sorted by their compensated file size but might sometimes observe some out-of-order due to later update on compensated file size.

Test Plan:
export ROCKSDB_TESTS=CompactionDele
./db_test

Reviewers: ljin, igor, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19557
2014-07-09 12:46:08 -07:00
Igor Canadi
4203431e71 Fix mac os compile error 2014-07-03 23:03:24 +02:00
sdong
2459f7ec4e Support Multiple DB paths (without having an interface to expose to users)
Summary:
In this patch, we allow RocksDB to support multiple DB paths internally.
No user interface is supported yet so this patch is silent to users.

Test Plan: make all check

Reviewers: igor, haobo, ljin, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D18921
2014-07-02 21:14:44 -07:00
Igor Canadi
a2e0d890ed No need for files_by_size_ in universal compaction
Summary: files_by_size_ is sorted by time in case of universal compaction. However, Version::files_ is also sorted by time. So no need for files_by_size_

Test Plan:
1) make check with the change
2) make check with `assert(last_index == c->input_version_->files_[level].size() - 1);` in compaction picker

Reviewers: dhruba, haobo, yhchiang, sdong, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19125
2014-07-01 08:55:04 +02:00
Yueh-Hsuan Chiang
e813f5b6d9 Allow compaction to reclaim storage more effectively.
Summary:
This diff allows compaction to reclaim storage more effectively.
In the current design, compactions are mainly triggered based on
the file sizes.  However, since deletion entries does not have
value, files which have many deletion entries are less likely
to be compacted.  As a result, it may took a while to make
deletion entries to be compacted.

This diff address issue by compensating the size of deletion
entries during compaction process: the size of each deletion
entry in the compaction process is augmented by 2x average
value size.  The diff applies to both leveled and universal
compacitons.

Test Plan:
develop CompactionDeletionTrigger
make db_test
./db_test

Reviewers: haobo, igor, ljin, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19029
2014-06-24 16:37:06 -06:00
Igor Canadi
d4a8423334 Remove seek compaction
Summary:
As discussed in our internal group, we don't get much use of seek compaction at the moment, while it's making code more complicated and slower in some cases.

This diff removes seek compaction and (hopefully) all code that was introduced to support seek compaction.

There is one test case that relied on didIO information. I'll try to find another way to implement it.

Test Plan: make check

Reviewers: sdong, haobo, yhchiang, ljin, dhruba

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19161
2014-06-20 10:23:02 +02:00
Igor Canadi
107e08baa7 Use same sorting for all level 0 files
Summary:
We decided that one of the long term goals is to unify level and universal compaction.

As a small first step, I'm unifying level 0 sorting methods.

Previously, we used to sort level 0 files in level compaction by file number and in universal compaction by sequence number.

But it turns out that in level compaction, sorting by file number is exactly the same as sorting by sequence number.

Test Plan:
Ran make check with bunch of asserts to verify the sorting order is exactly the same.
Also, make check with this patch

Reviewers: haobo, yhchiang, ljin, dhruba, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19131
2014-06-20 09:12:14 +02:00
sdong
cadc1adffa Refactor: group metadata needed to open an SST file to a separate copyable struct
Summary:
We added multiple fields to FileMetaData recently and are planning to add more.
This refactoring separate the minimum information for accessing the file. This object is copyable (FileMetaData is not copyable since the ref counter). I hope this refactoring can enable further improvements:

(1) use it to design a more efficient data structure to speed up read queries.
(2) in the future, when we add information of storage level, we can easily do the encoding, instead of enlarge this structure, which might expand memory work set for file meta data.

The definition is same as current EncodedFileMetaData used in two level iterator, so now the logic in two level iterator is easier to understand.

Test Plan: make all check

Reviewers: haobo, igor, ljin

Reviewed By: ljin

Subscribers: leveldb, dhruba, yhchiang

Differential Revision: https://reviews.facebook.net/D18933
2014-06-16 16:10:52 -07:00
sdong
983c93d731 VersionSet::Get(): Bring back the logic of skipping key range check when there are <=3 level 0 files
Summary:
https://reviews.facebook.net/D17205 removed the logic of skipping file key range check when there are less than 3 level 0 files. This patch brings it back.

Other than that, add another small optimization to avoid to check all the levels if most higher levels don't have any file.

Test Plan: make all check

Reviewers: ljin

Reviewed By: ljin

Subscribers: yhchiang, igor, haobo, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D19035
2014-06-13 15:51:44 -07:00
Lei Jin
c83b085770 prefetch bloom filter data block for L0 files
Summary: as title

Test Plan:
db_bench
the initial result is very promising. I will post results of complete
runs

Reviewers: dhruba, haobo, sdong, igor

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D18867
2014-06-12 10:06:18 -07:00
sdong
df9069d23f In DB::NewIterator(), try to allocate the whole iterator tree in an arena
Summary:
In this patch, try to allocate the whole iterator tree starting from DBIter from an arena
1. ArenaWrappedDBIter is created when serves as the entry point of an iterator tree, with an arena in it.
2. Add an option to create iterator from arena for following iterators: DBIter, MergingIterator, MemtableIterator, all mem table's iterators, all table reader's iterators and two level iterator.
3. MergeIteratorBuilder is created to incrementally build the tree of internal iterators. It is passed to mem table list and version set and add iterators to it.

Limitations:
(1) Only DB::NewIterator() without tailing uses the arena. Other cases, including readonly DB and compactions are still from malloc
(2) Two level iterator itself is allocated in arena, but not iterators inside it.

Test Plan: make all check

Reviewers: ljin, haobo

Reviewed By: haobo

Subscribers: leveldb, dhruba, yhchiang, igor

Differential Revision: https://reviews.facebook.net/D18513
2014-06-02 17:44:57 -07:00
Igor Canadi
6de6a06631 FIFO compaction style
Summary:
Introducing new compaction style -- FIFO.

FIFO compaction style has write amplification of 1 (+1 for WAL) and it deletes the oldest files when the total DB size exceeds pre-configured values.

FIFO compaction style is suited for storing high-frequency event logs.

Test Plan: Added a unit test

Reviewers: dhruba, haobo, sdong

Reviewed By: dhruba

Subscribers: alberts, leveldb

Differential Revision: https://reviews.facebook.net/D18765
2014-05-21 11:43:35 -07:00
Igor Canadi
f4574449e9 Clean up compaction logging
Summary: Cleaned up compaction logging a little bit. Now file sizes are easier to read. Also, removed the trailing space.

Test Plan:
verified that i'm happy with logging output:

        files_size[#33(seq=101,sz=98KB,0) #31(seq=81,sz=159KB,0) #26(seq=0,sz=637KB,0)]

Reviewers: sdong, haobo, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18549
2014-05-14 12:13:50 -07:00
sdong
9efbd85ac9 fsync directory after creating current file in NewDB()
Summary: One of our users reported current file corruption. The machine was rebooted during the time. This is the only think I can think of which could cause current file corruption. Just add this paranoid check.

Test Plan: make all check

Reviewers: haobo, igor

Reviewed By: haobo

CC: yhchiang, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D18495
2014-05-06 17:51:33 -07:00
Igor Canadi
096f5be0ed Put column family information in LiveFileMetaData
Summary: As summary

Test Plan: compiles :)

Reviewers: dhruba, haobo, sdong, yhchiang

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18405
2014-04-30 16:24:52 -04:00
Lei Jin
ccaca59bee avoid calling FindFile twice in TwoLevelIterator for PlainTable
Summary:
this is to reclaim the regression introduced in
https://reviews.facebook.net/D17853

Test Plan: make all check

Reviewers: igor, haobo, sdong, dhruba, yhchiang

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17985
2014-04-25 12:23:07 -07:00
Lei Jin
d642c60bdc Check PrefixMayMatch on Seek()
Summary:
As a follow-up diff for https://reviews.facebook.net/D17805, add
optimization to check PrefixMayMatch on Seek()

Test Plan: make all check

Reviewers: igor, haobo, sdong, yhchiang, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17853
2014-04-25 12:22:23 -07:00
Lei Jin
3995e801ab kill ReadOptions.prefix and .prefix_seek
Summary:
also add an override option total_order_iteration if you want to use full
iterator with prefix_extractor

Test Plan: make all check

Reviewers: igor, haobo, sdong, yhchiang

Reviewed By: haobo

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D17805
2014-04-25 12:21:34 -07:00
Igor Canadi
ad3cd39ccd Column family logging
Summary:
Now that we have column families involved, we need to add extra context to every log message. They now start with "[column family name] log message"

Also added some logging that I think would be useful, like level summary after every flush (I often needed that when going through the logs).

Test Plan: make check + ran db_bench to confirm I'm happy with log output

Reviewers: dhruba, haobo, ljin, yhchiang, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18303
2014-04-25 09:51:16 -04:00
Lei Jin
0f2d768191 hints for narrowing down FindFile range and avoiding checking unrelevant L0 files
Summary:
The file tree structure in Version is prebuilt and the range of each file is known.
On the Get() code path, we do binary search in FindFile() by comparing
target key with each file's largest key and also check the range for each L0 file.
With some pre-calculated knowledge, each key comparision that has been done can serve
as a hint to narrow down further searches:
(1) If a key falls within a L0 file's range, we can safely skip the next
file if its range does not overlap with the current one.
(2) If a key falls within a file's range in level L0 - Ln-1, we should only
need to binary search in the next level for files that overlap with the current one.

(1) will be able to skip some files depending one the key distribution.
(2) can greatly reduce the range of binary search, especially for bottom
levels, given that one file most likely only overlaps with N files from
the level below (where N is max_bytes_for_level_multiplier). So on level
L, we will only look at ~N files instead of N^L files.

Some inital results: measured with 500M key DB, when write is light (10k/s = 1.2M/s), this
improves QPS ~7% on top of blocked bloom. When write is heavier (80k/s =
9.6M/s), it gives us ~13% improvement.

Test Plan: make all check

Reviewers: haobo, igor, dhruba, sdong, yhchiang

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17205
2014-04-21 09:10:12 -07:00
sdong
651792251a Fix bugs introduced by D17961
Summary:
D17961 has two bugs:
(1) two level iterator fails to populate FileMetaData.table_reader, causing performance regression.
(2) table cache handle the !status.ok() case in the wrong place, causing seg fault which shouldn't happen.

Test Plan: make all check

Reviewers: ljin, igor, haobo

Reviewed By: ljin

CC: yhchiang, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D17991
2014-04-17 17:25:28 -07:00
sdong
fa430bfd04 Minimize accessing multiple objects in Version::Get()
Summary:
One of our profilings shows that Version::Get() sometimes is slow when getting pointer of user comparators or other global objects. In this patch:
(1) we keep pointers of immutable objects in Version to avoid accesses them though option objects or cfd objects
(2) table_reader is directly cached in FileMetaData so that table cache don't have to go through handle first to fetch it
(3) If level 0 has less than 3 files, skip the filtering logic based on SST tables' key range. Smallest and largest key are stored in separated memory locations, which has potential cache misses

Test Plan: make all check

Reviewers: haobo, ljin

Reviewed By: haobo

CC: igor, yhchiang, nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D17739
2014-04-17 14:14:00 -07:00
Igor Canadi
588bca2020 RocksDBLite
Summary:
Introducing RocksDBLite! Removes all the non-essential features and reduces the binary size. This effort should help our adoption on mobile.

Binary size when compiling for IOS (`TARGET_OS=IOS m static_lib`) is down to 9MB from 15MB (without stripping)

Test Plan: compiles :)

Reviewers: dhruba, haobo, ljin, sdong, yhchiang

Reviewed By: yhchiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17835
2014-04-15 13:39:26 -07:00
Igor Canadi
e6acb874cd Don't roll empty logs
Summary:
With multiple column families, especially when manual Flush is executed, we might roll the log file, although the current log file is empty (no data has been written to the log).

After the diff, we won't create new log file if current is empty.

Next, I will write an algorithm that will flush column families that reference old log files (i.e., that weren't flushed in a while)

Test Plan: Added an unit test. Confirmed that unit test failes in master

Reviewers: dhruba, haobo, ljin, sdong

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17631
2014-04-15 09:57:25 -07:00
Igor Canadi
2014915d32 Fix ASAN issue 2014-04-09 10:38:05 -07:00
Igor Canadi
b947fdc89d Column family support for DB::OpenForReadOnly()
Summary: When opening DB in read-only mode, client can choose to only specify a subset of column families ("default" column family can't be omitted, though)

Test Plan: added a unit test in column_family_test

Reviewers: haobo, sdong, ljin, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17565
2014-04-09 09:56:17 -07:00
Igor Canadi
5b345b76cb Remove env_ from MergingIterator
Summary: env_ is not used. Compiling for iOS complains.

Test Plan: compiles now

Reviewers: ljin, haobo, sdong, dhruba

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17589
2014-04-08 13:40:42 -07:00
Igor Canadi
3d2fe844ab Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
	db/memtable_list.cc
	db/version_set.cc
2014-04-07 11:31:11 -07:00
sdong
284c365b77 Fix valgrind error caused by FileMetaData as two level iterator's index block handle
Summary: It is a regression valgrind bug caused by using FileMetaData as index block handle. One of the fields of FileMetaData is not initialized after being contructed and copied, but I'm not able to find which one. Also, I realized that it's not a good idea to use FileMetaData as in TwoLevelIterator::InitDataBlock(), a copied FileMetaData can be compared with the one in version set byte by byte, but the refs can be changed. Also comparing such a large structure is slightly more expensive. Use a simpler structure instead

Test Plan:
Run the failing valgrind test (Harness.RandomizedLongDB)
make all check

Reviewers: igor, haobo, ljin

Reviewed By: igor

CC: yhchiang, leveldb

Differential Revision: https://reviews.facebook.net/D17409
2014-04-02 14:38:28 -07:00
Igor Canadi
ddbd1ece88 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_test.cc
	db/internal_stats.cc
	db/internal_stats.h
	db/version_edit.cc
	db/version_edit.h
	db/version_set.cc
	include/rocksdb/options.h
	util/options.cc
2014-03-31 13:39:24 -07:00
Igor Canadi
577556d5f9 Don't store version number in MANIFEST
Summary: Talked to <insert internal project name> folks and they found it really scary that they won't be able to roll back once they upgrade to 2.8. We should fix this.

Test Plan: make check

Reviewers: haobo, ljin

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17343
2014-03-31 11:33:09 -07:00
Igor Canadi
6a08bc042a Fix no return warning in FileComparator 2014-03-26 14:46:07 -07:00
Igor Canadi
1e9621d4e5 Sort files correctly in Builder::SaveTo
Summary:
Previously, we used to sort all files by BySmallestFirst comparator and then re-sort level0 files in the Finalize() (recently moved to end of SaveTo).

In this diff, I chose the correct comparator at the beginning and sort the files correctly in Builder::SaveTo.

I also added a verification that all files are sorted correctly in CheckConsistency()

NOTE: This diff depends on D17037

Test Plan: make check. Will also run db_stress

Reviewers: dhruba, haobo, sdong, ljin

Reviewed By: ljin

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17049
2014-03-26 13:30:14 -07:00
Igor Canadi
ad9a39c9b4 [RocksDB] Preallocate new MANIFEST files
Summary: We don't preallocate MANIFEST file, even though we have an option for that. This diff preallocates manifest file every time we create it

Test Plan: make check

Reviewers: dhruba, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17163
2014-03-26 09:37:53 -07:00
sdong
6b2e7a2a01 When Options.max_num_files=-1, non level0 files also by pass table cache
Summary:
This is the part that was not finished when doing the Options.max_num_files=-1 feature. For iterating non level0 SST files (which was done using two level iterator), table cache is not bypassed. With this patch, the leftover feature is done.

Test Plan: make all check; change Options.max_num_files=-1 in one of the tests to cover the codes.

Reviewers: haobo, igor, dhruba, ljin, yhchiang

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17001
2014-03-25 18:40:52 -07:00
Igor Canadi
e8168382c4 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	include/rocksdb/options.h
	util/options.cc
2014-03-25 11:09:40 -07:00
Yueh-Hsuan Chiang
cda4006e87 Enhance partial merge to support multiple arguments
Summary:
* PartialMerge api now takes a list of operands instead of two operands.
* Add min_pertial_merge_operands to Options, indicating the minimum
  number of operands to trigger partial merge.
* This diff is based on Schalk's previous diff (D14601), but it also
  includes necessary changes such as updating the pure C api for
  partial merge.

Test Plan:
* make check all
* develop tests for cases where partial merge takes more than two
  operands.

TODOs (from Schalk):
* Add test with min_partial_merge_operands > 2.
* Perform benchmarks to measure the performance improvements (can probably
  use results of task #2837810.)
* Add description of problem to doc/index.html.
* Change wiki pages to reflect the interface changes.

Reviewers: haobo, igor, vamsi

Reviewed By: haobo

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D16815
2014-03-24 17:57:13 -07:00
Igor Canadi
e20fa3f8a4 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/internal_stats.cc
	db/internal_stats.h
	db/version_set.cc
2014-03-19 17:22:20 -07:00
Igor Canadi
fcd5c5e828 ComputeCompactionScore in CompactionPicker
Summary:
As it turns out, we need the call to ComputeCompactionScore (previously: Finalize) in CompactionPicker.

The issue caused a deadlock in db_stress: http://ci-builds.fb.com/job/rocksdb_crashtest/290/console

The last two lines before a deadlock were:
2014/03/18-22:43:41.481029 7facafbee700 (Original Log Time 2014/03/18-22:43:41.480989) Compaction nothing to do
2014/03/18-22:43:41.481041 7faccf7fc700 wait for fewer level0 files...

"Compaction nothing to do" and other thread waiting for fewer level0 files. Hm hm.

I moved the pre-sorting to SaveTo, which should fix both the original and the new issue.

Test Plan: make check for now, will run db_stress in jenkins

Reviewers: dhruba, haobo, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17037
2014-03-19 16:52:26 -07:00
Lei Jin
6dc940d4c9 avoid shared_ptr assignment in Version::Get()
Summary:
This is a 500ns operation while the whole Get() call takes only a few
micro!

Test Plan: ran db_bench, for a DB with 50M keys, QPS jumps from 5.2M/s to 7.2M/s

Reviewers: haobo, igor, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17007
2014-03-19 10:54:32 -07:00
Igor Canadi
69aa6ecb26 Finalize fist version in column family 2014-03-18 14:23:47 -07:00
Igor Canadi
e25819a185 Merge branch 'master' into columnfamilies
Conflicts:
	db/version_set.cc
2014-03-18 14:00:20 -07:00
Igor Canadi
758fa8c359 Don't Finalize in CompactionPicker
Summary:
Finalize re-sorts (read: mutates) the files_ in Version* and it is called by CompactionPicker during normal runtime. At the same time, this same Version* lives in the SuperVersion* and is accessed without the mutex in GetImpl() code path.

Mutating the files_ in one thread and reading the same files_ in another thread is a bad idea. It caused this issue: http://ci-builds.fb.com/job/rocksdb_crashtest/285/console

Long-term, we need to be more careful with method contracts and clearly document what state can be mutated when. Now that we are much faster because we don't lock in GetImpl(), we keep running into data races that were not a problem before when we were slower. db_stress has been very helpful in detecting those.

Short-term, I removed Finalize() from CompactionPicker.

Note: I believe this is an issue in current 2.7 version running in production.

Test Plan:
make check
Will also run db_stress to see if issue is gone

Reviewers: sdong, ljin, dhruba, haobo

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16983
2014-03-18 13:59:59 -07:00
Igor Canadi
3055a15b29 Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/version_edit.cc
	db/version_edit.h
	db/version_set.cc
2014-03-18 13:24:27 -07:00
Lei Jin
63cef90078 disable the log_number check in Recover()
Summary:
There is a chance that an old MANIFEST is corrupted in 2.7 but just not noticed.
This check would fail them. Change it to log instead of returning a
Corruption status.

Test Plan: make

Reviewers: haobo, igor

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16923
2014-03-18 12:46:29 -07:00
Igor Canadi
bcea9c1296 Finalize version in dumpmanifest 2014-03-18 09:45:52 -07:00
Igor Canadi
f26cb0f093 Optimize fallocation
Summary:
Based on my recent findings (posted in our internal group), if we use fallocate without KEEP_SIZE flag, we get superior performance of fdatasync() in append-only workloads.

This diff provides an option for user to not use KEEP_SIZE flag, thus optimizing his sync performance by up to 2x-3x.

At one point we also just called posix_fallocate instead of fallocate, which isn't very fast: http://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html (tl;dr it manually writes out zero bytes to allocate storage). This diff also fixes that, by first calling fallocate and then posix_fallocate if fallocate is not supported.

Test Plan: make check

Reviewers: dhruba, sdong, haobo, ljin

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16761
2014-03-17 21:52:14 -07:00
Igor Canadi
ae25742af9 Fix race condition in manifest roll
Summary:
When the manifest is getting rolled the following happens:
1) manifest_file_number_ is assigned to a new manifest number (even though the old one is still current)
2) mutex is unlocked
3) SetCurrentFile() creates temporary file manifest_file_number_.dbtmp
4) SetCurrentFile() renames manifest_file_number_.dbtmp to CURRENT
5) mutex is locked

If FindObsoleteFiles happens between (3) and (4) it will:
1) Delete manifest_file_number_.dbtmp (because it's not in pending_outputs_)
2) Delete old manifest (because the manifest_file_number_ already points to a new one)

I introduce the concept of prev_manifest_file_number_ that will avoid the race condition.

However, we should discuss the future of MANIFEST file rolling. We found some race conditions with it last week and who knows how many more are there. Nobody is using it in production because we don't trust the implementation. Should we even support it?

Test Plan: make check

Reviewers: ljin, dhruba, haobo, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16929
2014-03-17 21:50:15 -07:00
Lei Jin
0cf6c8f7ce fix: use the correct edit when comparing log_number
Summary:
In the last fix, I forgot to point to the writer when comparing edit,
which is apparently not correct.

Test Plan: still running make whitebox_crash_test

Reviewers: igor, haobo, igor2

Reviewed By: igor2

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16911
2014-03-15 23:30:43 -07:00
Lei Jin
453ec52ca1 journal log_number correctly in MANIFEST
Summary:
Here is what it can cause probelm:
There is one memtable flush and one compaction. Both call LogAndApply(). If both edits are applied in the same batch with flush edit first and the compaction edit followed. LogAndApplyHelper() will assign compaction edit current VersionSet's log number(which should be smaller than the log number from flush edit). It cause log_numbers in MANIFEST to be not monotonic increasing, which violates the assume Recover() makes. What is more is after comitting to MANIFEST file, log_number_ in VersionSet is updated to the log_number from the last edit, which is the compaction one. It ends up not updating the log_number.

Test Plan:
make whitebox_crash_test
got another assertion about iter->valid(), not sure if that is related
to this.

Reviewers: igor, haobo

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16875
2014-03-14 18:36:47 -07:00
Igor Canadi
a782bb989e Fix log_number in LogAndApply 2014-03-14 13:45:30 -07:00
Igor Canadi
f0e1e3ebf1 CF cleanup part 2 2014-03-13 14:34:01 -07:00
Igor Canadi
b5d6ad69fc Bug fixes introduced by code cleanup 2014-03-12 11:10:26 -07:00
Igor Canadi
fb2346fc1f [CF] Code cleanup part 1
Summary:
I'm cleaning up some code preparing for the big diff review tomorrow. This is the first part of the cleanup.

Changes are mostly cosmetic. The goal is to decrease amount of code difference between columnfamilies and master branch.

This diff also fixes race condition when dropping column family.

Test Plan: Ran db_stress with variety of parameters

Reviewers: dhruba, haobo

Differential Revision: https://reviews.facebook.net/D16833
2014-03-12 09:56:53 -07:00
Igor Canadi
dad8603fc4 [CF] Fix column family dropping
Summary: Column family should be dropped after the change has been commited

Test Plan: db stress

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16779
2014-03-11 09:47:29 -07:00
Igor Canadi
9634ba42ac Merge branch 'master' into columnfamilies
Conflicts:
	db/compaction_picker.cc
	db/db_impl.cc
	db/db_impl.h
	db/tailing_iter.cc
	db/version_set.h
	include/rocksdb/options.h
	util/options.cc
2014-03-10 17:26:09 -07:00
Igor Canadi
04d2c26e17 Add option verify_checksums_in_compaction
Summary:
If verify_checksums_in_compaction is true, compaction will verify checksums. This is default.
If it's false, compaction doesn't verify checksums. This is useful for in-memory workloads.

Test Plan: corruption_test

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16695
2014-03-10 10:06:34 -07:00
sdong
ecb1ffa2a8 Buffer info logs when picking compactions and write them out after releasing the mutex
Summary: Now while the background thread is picking compactions, it writes out multiple info_logs, especially for universal compaction, which introduces a chance of waiting log writing in mutex, which is bad. To remove this risk, write all those info logs to a buffer and flush it after releasing the mutex.

Test Plan:
make all check
check the log lines while running some tests that trigger compactions.

Reviewers: haobo, igor, dhruba

Reviewed By: dhruba

CC: i.am.jin.lei, dhruba, yhchiang, leveldb, nkg-

Differential Revision: https://reviews.facebook.net/D16515
2014-03-05 15:36:32 -08:00
Igor Canadi
9625acbf70 [CF] Dont reuse dropped column family IDs
Summary:
Column family IDs should be unique, even if column family is dropped. To achieve this, we save max column family in manifest.

Note that the diff is still not ready. I'm only using differential to move the patch to my Mac machine.

Test Plan: added a test to column_family_test

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16581
2014-03-05 12:13:44 -08:00
Igor Canadi
9d0577a6be Merge branch 'master' into columnfamilies
Conflicts:
	db/db_impl.cc
	db/db_impl.h
	db/transaction_log_impl.cc
	db/transaction_log_impl.h
	include/rocksdb/options.h
	util/env.cc
	util/options.cc
2014-03-03 18:29:03 -08:00
Igor Canadi
5142b37000 Fix a group commit bug in LogAndApply
Summary:
EncodeTo(&record) does not overwrite, it appends to it.

This means that group commit log and apply will look something like:
record1
record1record2
record1record2record3

I'm surprised this didn't show up in production, but I think the reason is that MANIFEST group commit almost never happens.

This bug turned up in column family work, where opening a database failed with "adding a same column family twice".

Test Plan: Tested the change in column family branch and observed that the problem is gone (with db_stress)

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16461
2014-03-03 17:10:43 -08:00
kailiu
bf86af5174 Remove the terrible hack in for flush_block_policy_factory
Summary:
Previous code is too convoluted and I must be drunk for letting
such code to be written without a second thought.

Thanks to the discussion with @sdong, I added the `Options` when
generating the flusher, thus avoiding the tricks.

Just FYI: I resisted to add Options in flush_block_policy.h since I
wanted to avoid cyclic dependencies: FlushBlockPolicy dpends on Options
and Options also depends FlushBlockPolicy... While I appreciate my
effort to prevent it, the old design turns out creating more troubles than
it tried to avoid.

Test Plan: ran ./table_test

Reviewers: sdong

Reviewed By: sdong

CC: sdong, leveldb

Differential Revision: https://reviews.facebook.net/D16503
2014-02-28 16:39:27 -08:00
Igor Canadi
8ea21a778b [CF] Rething LogAndApply for column families
Summary:
I though I might get away with as little changes to LogAndApply() as possible. It turns out this is not the case.

This diff introduces different behavior of LogAndApply() for three cases:
1. column family add
2. column family drop
3. no-column family manipulation

(1) and (2) don't support group commit yet.

There were a lot of problems with old version od LogAndApply, detected by db_stress. The biggest was non-atomicity of manifest writes and metadata changes (i.e. if column family add is in manifest, it also has to be in in-memory data structure).

Test Plan: db_stress

Reviewers: dhruba, haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16491
2014-02-28 14:46:48 -08:00
Igor Canadi
58ca641d53 Make Log::Reader more robust
Summary:
This diff does two things:
(1) Log::Reader does not report a corruption when the last record in a log or manifest file is truncated (meaning that log writer died in the middle of the write). Inherited the code from LevelDB: https://code.google.com/p/leveldb/source/detail?r=269fc6ca9416129248db5ca57050cd5d39d177c8#
(2) Turn off mmap writes for all writes to log and manifest files

(2) is necessary because if we use mmap writes, the last record is not truncated, but is actually filled with zeros, making checksum fail. It is hard to recover from checksum failing.

Test Plan:
Added unit tests from LevelDB
Actually recovered a "corrupted" MANIFEST file.

Reviewers: dhruba, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16119
2014-02-28 13:19:47 -08:00
Igor Canadi
12966ec1bb Fix LogAndApply() group commit 2014-02-28 12:22:45 -08:00