Commit Graph

257 Commits

Author SHA1 Message Date
Dhruba Borthakur
e45c7a8444 Abilty to support upto a million .sst files in the database
Summary:
There was an artifical limit of 50K files per database. This is
insifficient if the database is 1 TB in size and each file is 2 MB.

Test Plan: make check

Reviewers: sheki, emayanke

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D8919
2013-02-26 16:27:51 -08:00
Abhishek Kona
a9866b721b Refactor statistics. Remove individual functions like incNumFileOpens
Summary:
Use only the counter mechanism. Do away with
incNumFileOpens, incNumFileClose, incNumFileErrors
s/NULL/nullptr/g in db/table_cache.cc

Test Plan: make clean check

Reviewers: dhruba, heyongqiang, emayanke

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D8841
2013-02-25 13:58:34 -08:00
Abhishek Kona
959337ed5b Measure compaction time.
Summary: just record time consumed in compaction

Test Plan: compile

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D8781
2013-02-22 11:38:40 -08:00
Abhishek Kona
ec77366e14 Counters for bytes written and read.
Summary:
* Counters for bytes read and write.
as a part of this diff, I want to=>
* Measure compaction times. @dhruba can you point which function, should
* I time to get Compaction-times. Was looking at CompactRange.

Test Plan: db_test

Reviewers: dhruba, emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D8763
2013-02-21 16:06:32 -08:00
Vamsi Ponnekanti
6abb30d4d0 [Missed adding cmdline parsing for new flags added in D8685]
Summary:
I had added FLAGS_numdistinct and FLAGS_deletepercent for randomwithverify
but forgot to add cmdline parsing for those flags.

Test Plan:
[nponnekanti@dev902 /data/users/nponnekanti/rocksdb] ./db_bench --benchmarks=randomwithverify --numdistinct=500
LevelDB:    version 1.5
Date:       Thu Feb 21 10:34:40 2013
CPU:        24 * Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz
CPUCache:   12288 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Compression: snappy
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
Created bg thread 0x7fbf90bff700
randomwithverify :       4.693 micros/op 213098 ops/sec; ( get:900000 put:80000 del:20000 total:1000000 found:714556)

[nponnekanti@dev902 /data/users/nponnekanti/rocksdb] ./db_bench --benchmarks=randomwithverify --deletepercent=5
LevelDB:    version 1.5
Date:       Thu Feb 21 10:35:03 2013
CPU:        24 * Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz
CPUCache:   12288 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Compression: snappy
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
Created bg thread 0x7fe14dfff700
randomwithverify :       4.883 micros/op 204798 ops/sec; ( get:900000 put:50000 del:50000 total:1000000 found:443847)
[nponnekanti@dev902 /data/users/nponnekanti/rocksdb]
[nponnekanti@dev902 /data/users/nponnekanti/rocksdb] ./db_bench --benchmarks=randomwithverify --deletepercent=5 --numdistinct=500
LevelDB:    version 1.5
Date:       Thu Feb 21 10:36:18 2013
CPU:        24 * Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz
CPUCache:   12288 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Compression: snappy
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
Created bg thread 0x7fc31c7ff700
randomwithverify :       4.920 micros/op 203233 ops/sec; ( get:900000 put:50000 del:50000 total:1000000 found:445522)

Revert Plan: OK

Task ID: #

Reviewers: dhruba, emayanke

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D8769
2013-02-21 12:26:32 -08:00
Vamsi Ponnekanti
945d2b59b9 [Add randomwithverify benchmark option]
Summary: Added RandomWithVerify benchmark option.

Test Plan:
This whole diff is to test.
[nponnekanti@dev902 /data/users/nponnekanti/rocksdb] ./db_bench --benchmarks=randomwithverify
LevelDB:    version 1.5
Date:       Tue Feb 19 17:50:28 2013
CPU:        24 * Intel(R) Xeon(R) CPU           X5650  @ 2.67GHz
CPUCache:   12288 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Compression: snappy
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
Created bg thread 0x7fa9c3fff700
randomwithverify :       5.004 micros/op 199836 ops/sec; ( get:900000 put:80000 del:20000 total:1000000 found:711992)

Revert Plan: OK

Task ID: #

Reviewers: dhruba, emayanke

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D8685
2013-02-21 10:27:02 -08:00
amayank
b2c50f1c3f Fix for the weird behaviour encountered by ldb Get where it could read only the second-latest value
Summary:
Changed the Get and Scan options with openForReadOnly mode to have access to the memtable.
Changed the visibility of NewInternalIterator in db_impl from private to protected so that
the derived class db_impl_read_only can call that in its NewIterator function for the
scan case. The previous approach which changed the default for flush_on_destroy_ from false to true
caused many problems in the unit tests due to empty sst files that it created. All
unit tests pass now.

Test Plan: make clean; make all check; ldb put and get and scans

Reviewers: dhruba, heyongqiang, sheki

Reviewed By: dhruba

CC: kosievdmerwe, zshao, dilipj, kailiu

Differential Revision: https://reviews.facebook.net/D8697
2013-02-20 10:45:52 -08:00
Abhishek Kona
fe10200ddc Introduce histogram in statistics.h
Summary:
* Introduce is histogram in statistics.h
* stop watch to measure time.
* introduce two timers as a poc.
Replaced NULL with nullptr to fight some lint errors
Should be useful for google.

Test Plan:
ran db_bench and check stats.
make all check

Reviewers: dhruba, heyongqiang

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D8637
2013-02-20 10:43:32 -08:00
amayank
f3901e0647 Revert "Fix for the weird behaviour encountered by ldb Get where it could read only the second-latest value"
This reverts commit 4c696ed001.
2013-02-18 22:32:27 -08:00
Dhruba Borthakur
fd367e677e Fix unit test failure in db_filename.cc
Summary:
    c_test: db/filename.cc:74: std::string leveldb::DescriptorFileName(const string&,....

Test Plan:
  this is a failure in a unit test

Differential Revision: https://reviews.facebook.net/D8667
2013-02-18 21:53:56 -08:00
Dhruba Borthakur
4564915446 Zero out redundant sequence numbers for kvs to increase compression efficiency
Summary:
The sequence numbers in each record eat up plenty of space on storage.
The optimization zeroes out sequence numbers on kvs in the Lmax
layer that are earlier than the earliest snapshot.

Test Plan: Unit test attached.

Differential Revision: https://reviews.facebook.net/D8619
2013-02-18 21:51:15 -08:00
amayank
4c696ed001 Fix for the weird behaviour encountered by ldb Get where it could read only the second-latest value
Summary:
flush_on_destroy has a default value of false and the memtable is flushed
in the dbimpl-destructor only when that is set to true. Because we want the memtable to be flushed everytime that
the destructor is called(db is closed) and the cases where we work with the memtable only are very less
it is a good idea to give this a default value of true. Thus the put from ldb
wil have its data flushed to disk in the destructor and the next Get will be able to
read it when opened with OpenForReadOnly. The reason that ldb could read the latest value when
the db was opened in the normal Open mode is that the Get from normal Open first reads
the memtable and directly finds the latest value written there and the Get from OpenForReadOnly
doesn't have access to the memtable (which is correct because all its Put/Modify) are disabled

Test Plan: make all; ldb put and get and scans

Reviewers: dhruba, heyongqiang, sheki

Reviewed By: heyongqiang

CC: kosievdmerwe, zshao, dilipj, kailiu

Differential Revision: https://reviews.facebook.net/D8631
2013-02-15 16:56:06 -08:00
Kai Liu
b63aafce42 Allow the logs to be purged by TTL.
Summary:
* Add a SplitByTTLLogger to enable this feature. In this diff I implemented generalized AutoSplitLoggerBase class to simplify the
development of such classes.
* Refactor the existing AutoSplitLogger and fix several bugs.

Test Plan:
* Added a unit tests for different types of "auto splitable" loggers individually.
* Tested the composited logger which allows the log files to be splitted by both TTL and log size.

Reviewers: heyongqiang, dhruba

Reviewed By: heyongqiang

CC: zshao, leveldb

Differential Revision: https://reviews.facebook.net/D8037
2013-02-04 19:42:40 -08:00
Chip Turner
0b83a83191 Fix poor error on num_levels mismatch and few other minor improvements
Summary:
Previously, if you opened a db with num_levels set lower than
the database, you received the unhelpful message "Corruption:
VersionEdit: new-file entry."  Now you get a more verbose message
describing the issue.

Also, fix handling of compression_levels (both the run-over-the-end
issue and the memory management of it).

Lastly, unique_ptr'ify a couple of minor calls.

Test Plan: make check

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D8151
2013-01-25 15:37:26 -08:00
Chip Turner
772f75b3fb Stop continually re-creating build_version.c
Summary:
We continually rebuilt build_version.c because we put the
current date into it, but that's what __DATE__ already is.  This makes
builds faster.

This also fixes an issue with 'make clean FOO' not working properly.

Also tweak the build rules to be more consistent, always have warnings,
and add a 'make release' rule to handle flags for release builds.

Test Plan: make, make clean

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D8139
2013-01-24 17:51:39 -08:00
Chip Turner
3dafdfb2c4 Use fallocate to prevent excessive allocation of sst files and logs
Summary:
On some filesystems, pre-allocation can be a considerable
amount of space.  xfs in our production environment pre-allocates by
1GB, for instance.  By using fallocate to inform the kernel of our
expected file sizes, we eliminate this wasteage (that isn't recovered
until the file is closed which, in the case of LOG files, can be a
considerable amount of time).

Test Plan:
created an xfs loopback filesystem, mounted with
allocsize=4M, and ran db_stress.  LOG file without this change was 4M,
and with it it was 128k then grew to normal size.

Reviewers: dhruba

Reviewed By: dhruba

CC: adsharma, leveldb

Differential Revision: https://reviews.facebook.net/D7953
2013-01-24 12:25:13 -08:00
Chip Turner
2fdf91a4f8 Fix a number of object lifetime/ownership issues
Summary:
Replace manual memory management with std::unique_ptr in a
number of places; not exhaustive, but this fixes a few leaks with file
handles as well as clarifies semantics of the ownership of file handles
with log classes.

Test Plan: db_stress, make check

Reviewers: dhruba

Reviewed By: dhruba

CC: zshao, leveldb, heyongqiang

Differential Revision: https://reviews.facebook.net/D8043
2013-01-23 16:54:11 -08:00
Abhishek Kona
16903c35b0 Add counters to count gets and writes
Summary: Add Tickers to count Write's and Get's

Test Plan: make check

Reviewers: dhruba, chip

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7977
2013-01-17 12:27:56 -08:00
Kosie van der Merwe
3c3df7402f Fixed issues Valgrind found.
Summary:
Found issues with `db_test` and `db_stress` when running valgrind.

`DBImpl` had an issue where if an compaction failed then it will use the uninitialised file size of an output file is used. This manifested as the final call to output to the log in `DoCompactionWork()` branching on uninitialized memory (all the way down in printf's innards).

Test Plan:
Ran `valgrind --track_origins=yes ./db_test` and `valgrind ./db_stress` to see if issues disappeared.

Ran `make check` to see if there were no regressions.

Reviewers: vamsi, dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D8001
2013-01-17 10:04:45 -08:00
Abhishek Kona
7d5a4383bb rollover manifest file.
Summary:
Check in LogAndApply if the file size is more than the limit set in
Options.
Things to consider : will this be expensive?

Test Plan: make all check. Inputs on a new unit test?

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7701
2013-01-16 12:09:44 -08:00
Chip Turner
a2dcd79c1e Add optional clang compile mode
Summary:
clang is an alternate compiler based on llvm.  It produces
nicer error messages and finds some bugs that gcc doesn't, such as the
size_t change in this file (which caused some write return values to be
misinterpreted!)

Clang isn't the default; to try it, do "USE_CLANG=1 make" or "export
USE_CLANG=1" then make as normal

Test Plan: "make check" and "USE_CLANG=1 make check"

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D7899
2013-01-15 18:48:37 -08:00
Chip Turner
9bbcab57a9 Fix broken build
Summary: Mis-merged from HEAD, had a duplicate declaration.

Test Plan: make -j32 OPT=-g

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D7911
2013-01-15 14:05:49 -08:00
Kosie van der Merwe
28fe86c48a Fixed bug with seek compactions on Level 0
Summary: Due to how the code handled compactions in Level 0 in `PickCompaction()` it could be the case that two compactions on level 0 ran that produced tables in level 1 that overlap. However, this case seems like it would only occur on a seek compaction which is unlikely on level 0. Furthermore, level 0 and level 1 had to have a certain arrangement of files.

Test Plan:
make check

Reviewers: dhruba, vamsi

Reviewed By: dhruba

CC: leveldb, sheki

Differential Revision: https://reviews.facebook.net/D7923
2013-01-15 12:43:09 -08:00
Chip Turner
c0cb289d57 Various build cleanups/improvements
Summary:
Specific changes:

1) Turn on -Werror so all warnings are errors
2) Fix some warnings the above now complains about
3) Add proper dependency support so changing a .h file forces a .c file
to rebuild
4) Automatically use fbcode gcc on any internal machine rather than
whatever system compiler is laying around
5) Fix jemalloc to once again be used in the builds (seemed like it
wasn't being?)
6) Fix issue where 'git' would fail in build_detect_version because of
LD_LIBRARY_PATH being set in the third-party build system

Test Plan:
make, make check, make clean, touch a header file, make sure
rebuild is expected

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D7887
2013-01-14 18:40:22 -08:00
Mark Callaghan
2ba125faf6 fix warning for unused variable
Test Plan: compile

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7857
2013-01-11 15:00:47 -08:00
Abhishek Kona
85ad13be1a Port fix for Leveldb manifest writing bug from Open-Source
Summary:
Pretty much a blind copy of the patch in open source.
Hope to get this in before we make a release

Test Plan: make clean check

Reviewers: dhruba, heyongqiang

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7809
2013-01-10 12:06:03 -08:00
Kosie van der Merwe
d8371ef1f6 Fixing some issues Valgrind found
Summary: Found some issues running Valgrind on `db_test` (there are still some outstanding ones) and fixed them.

Test Plan:
make check

ran `valgrind ./db_test` and saw that errors no longer occur

Reviewers: dhruba, vamsi, emayanke, sheki

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7803
2013-01-08 12:16:40 -08:00
Dhruba Borthakur
628dc2aad9 db_bench should use the default value for max_grandparent_overlap_factor.
Summary:
This was a peformance regression caused by https://reviews.facebook.net/D6729.
The default value of max_grandparent_overlap_factor was erroneously
set to 0 in db_bench.

This was causing compactions to create really really small files because the max_grandparent_overlap_factor was erroneously set to zero in the benchmark.

Test Plan: Run --benchmarks=overwrite

Reviewers: heyongqiang, emayanke, sheki, MarkCallaghan

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7797
2013-01-08 11:21:11 -08:00
Kosie van der Merwe
d6e873f22f Added clearer error message for failure to create db directory in DBImpl::Recover()
Summary:
Changed CreateDir() to CreateDirIfMissing() so a directory that already exists now causes and error.

Fixed CreateDirIfMissing() and added Env.DirExists()

Test Plan:
make check to test for regessions

Ran the following to test if the error message is not about lock files not existing
./db_bench --db=dir/testdb

After creating a file "testdb", ran the following to see if it failed with sane error message:
./db_bench --db=testdb

Reviewers: dhruba, emayanke, vamsi, sheki

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7707
2013-01-07 10:11:18 -08:00
Mark Callaghan
4069f66cc7 Add --seed, --read_range to db_bench
Summary:
Adds the option --seed to db_bench to specify the base for the per-thread RNG.
When not set each thread uses the same value across runs of db_bench which defeats
IO stress testing.

Adds the option --read_range. When set to a value > 1 an iterator is created and
each query done for the randomread benchmark will do a range scan for that many
rows. When not set or set to 1 the existing behavior (a point lookup) is done.

Fixes a bug where a printf format string was missing.

Test Plan: run db_bench

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7749
2013-01-07 09:56:10 -08:00
Kosie van der Merwe
8cd86a7be5 Fixing and adding some comments
Summary:
`MemTableList::Add()` neglected to mention that it took ownership of the reference held by its caller.

The comment in `MemTable::Get()` was wrong in describing the format of the key.

Test Plan: None

Reviewers: dhruba, sheki, emayanke, vamsi

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7755
2013-01-03 17:13:56 -08:00
Dhruba Borthakur
d7d43ae21a ExtendOverlappingInputs too slow for large databases.
Summary:
There was a bug in the ExtendOverlappingInputs method so that
the terminating condition for the backward search was incorrect.

Test Plan: make clean check

Reviewers: sheki, emayanke, MarkCallaghan

Reviewed By: MarkCallaghan

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7725
2013-01-02 13:19:06 -08:00
Dhruba Borthakur
f4c2b7cf97 Enhance ReadOnly mode to process the all committed transactions.
Summary:
Leveldb has an api OpenForReadOnly() that opens the database
in readonly mode. This call had an option to not process the
transaction log.  This patch removes this option and always
processes all transactions that had been committed. It has
been done in such a way that it does not create/write to
any new files in the process. The invariant of "no-writes"
to the leveldb data directory is still true.

This enhancement allows multiple threads to open the same database
in readonly mode and access all trancations that were committed right
upto the OpenForReadOnly call.

I changed the public API to match the new semantics because
there are no users who are currently using this api.

Test Plan: make clean check

Reviewers: sheki

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7479
2012-12-19 16:30:46 -08:00
Dhruba Borthakur
3d1e92b05a Enhancements to rocksdb for better support for replication.
Summary:
1. The OpenForReadOnly() call should not lock the db. This is useful
so that multiple processes can open the same database concurrently
for reading.
2. GetUpdatesSince should not error out if the archive directory
does not exist.
3. A new constructor for WriteBatch that can takes a serialized
string as a parameter of the constructor.

Test Plan: make clean check

Reviewers: sheki

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7449
2012-12-17 11:40:19 -08:00
Kosie van der Merwe
62d48571de Added meta-database support.
Summary:
Added kMetaDatabase for meta-databases in db/filename.h along with supporting
fuctions.
Fixed switch in DBImpl so that it also handles kMetaDatabase.
Fixed DestroyDB() that it can handle destroying meta-databases.

Test Plan: make check

Reviewers: sheki, emayanke, vamsi, dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D7245
2012-12-17 11:26:59 -08:00
Abhishek Kona
2f0585fb97 Fix a bug. Where DestroyDB deletes a non-existant archive directory.
Summary:
C tests would fail sometimes as DestroyDB would return a Failure Status
message when deleting an archival directory which was not created
(WAL_ttl_seconds = 0).

Fix: Ignore the Status returned on Deleting Archival Directory.

Test Plan: * make check

Reviewers: dhruba, emayanke

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7395
2012-12-17 10:25:26 -08:00
Zheng Shao
c28097538a manifest_dump: Add --hex=1 option
Summary: Without this option, manifest_dump does not print binary keys for files in a human-readable way.

Test Plan:
./manifest_dump --hex=1 --verbose=0 --file=/data/users/zshao/fdb_comparison/leveldb/fbobj.apprequest-0_0_original/MANIFEST-000002
manifest_file_number 589 next_file_number 590 last_sequence 2311567 log_number 543  prev_log_number 0
--- level 0 --- version# 0 ---
 532:1300357['0000455BABE20000' @ 2183973 : 1 .. 'FFFCA5D7ADE20000' @ 2184254 : 1]
 536:1308170['000198C75CE30000' @ 2203313 : 1 .. 'FFFCF94A79E30000' @ 2206463 : 1]
 542:1321644['0002931AA5E50000' @ 2267055 : 1 .. 'FFF77B31C5E50000' @ 2270754 : 1]
 544:1286390['000410A309E60000' @ 2278592 : 1 .. 'FFFE470A73E60000' @ 2289221 : 1]
 538:1298778['0006BCF4D8E30000' @ 2217050 : 1 .. 'FFFD77DAF7E30000' @ 2220489 : 1]
 540:1282353['00090D5356E40000' @ 2231156 : 1 .. 'FFFFF4625CE40000' @ 2231969 : 1]
--- level 1 --- version# 0 ---
 510:2112325['000007F9C2D40000' @ 1782099 : 1 .. '146F5B67B8D80000' @ 1905458 : 1]
 511:2121742['146F8A3023D60000' @ 1824388 : 1 .. '28BC8FBB9CD40000' @ 1777993 : 1]
 512:801631['28BCD396F1DE0000' @ 2080191 : 1 .. '3082DBE9ADDB0000' @ 1989927 : 1]

Reviewers: dhruba, sheki, emayanke

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7425
2012-12-16 08:58:28 -08:00
Abhishek Kona
2ba866e0c5 GetSequence API in write batch.
Summary:
WriteBatch is now used by the GetUpdatesSinceAPI. This API is external
and will be used by the rocks server. Rocks Server and others will need
to know about the Sequence Number in the WriteBatch. This public method
will allow for that.

Test Plan: make all check.

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7293
2012-12-12 22:21:10 -08:00
Abhishek Kona
22c283625b Fix Bug in Binary Search for files containing a seq no. and delete Archived Log Files during Destroy DB.
Summary:
* Fixed implementation bug in Binary_Searvch introduced in https://reviews.facebook.net/D7119
* Binary search is also overflow safe.
* Delete archive log files and archive dir during DestroyDB

Test Plan: make check

Reviewers: dhruba

CC: kosievdmerwe, emayanke

Differential Revision: https://reviews.facebook.net/D7263
2012-12-11 16:15:02 -08:00
Dhruba Borthakur
24fc379273 An public api to fetch the latest transaction id.
Summary:
Implement a interface to retrieve the most current transaction
id from the database.

Test Plan: Added unit test.

Reviewers: sheki

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7269
2012-12-10 16:04:19 -08:00
Abhishek Kona
1c6742e32f Refactor GetArchivalDirectoryName to filename.h
Summary:
filename.h has functions to do similar things.
Moving code away from db_impl.cc

Test Plan: make check

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D7251
2012-12-10 10:51:07 -08:00
Abhishek Kona
8055008909 GetUpdatesSince API to enable replication.
Summary:
How it works:
* GetUpdatesSince takes a SequenceNumber.
* A LogFile with the first SequenceNumber nearest and lesser than the requested Sequence Number is found.
* Seek in the logFile till the requested SeqNumber is found.
* Return an iterator which contains logic to return record's one by one.

Test Plan:
* Test case included to check the good code path.
* Will update with more test-cases.
* Feedback required on test-cases.

Reviewers: dhruba, emayanke

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7119
2012-12-07 11:42:13 -08:00
Dhruba Borthakur
c847a31727 Print compaction score for every compaction run.
Summary:
A compaction is picked based on its score. It is useful to
print the compaction score in the LOG because it aids in
debugging. If one looks at the logs, one can find out why
a compaction was preferred over another.

Test Plan: make clean check

Differential Revision: https://reviews.facebook.net/D7137
2012-12-04 10:03:47 -08:00
sheki
d4627e6de4 Move WAL files to archive directory, instead of deleting.
Summary:
Create a directory "archive" in the DB directory.
During DeleteObsolteFiles move the WAL files (*.log) to the Archive directory,
instead of deleting.

Test Plan: Created a DB using DB_Bench. Reopened it. Checked if files move.

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D6975
2012-11-28 17:28:08 -08:00
Abhishek Kona
d29f181923 Fix all the lint errors.
Summary:
Scripted and removed all trailing spaces and converted all tabs to
spaces.

Also fixed other lint errors.
All lint errors from this point of time should be taken seriously.

Test Plan: make all check

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7059
2012-11-28 17:18:41 -08:00
Dhruba Borthakur
9a357847eb Delete non-visible keys during a compaction even in the presense of snapshots.
Summary:
 LevelDB should delete almost-new keys when a long-open snapshot exists.
The previous behavior is to keep all versions that were created after the
oldest open snapshot. This can lead to database size bloat for
high-update workloads when there are long-open snapshots and long-open
snapshot will be used for logical backup. By "almost new" I mean that the
key was updated more than once after the oldest snapshot.

If there were two snapshots with seq numbers s1 and s2 (s1 < s2), and if
we find two instances of the same key k1 that lie entirely within s1 and
s2 (i.e. s1 < k1 < s2), then the earlier version
of k1 can be safely deleted because that version is not visible in any snapshot.

Test Plan:
unit test attached
make clean check

Differential Revision: https://reviews.facebook.net/D6999
2012-11-28 15:47:40 -08:00
Dhruba Borthakur
3366eda839 Print out status at the end of a compaction run.
Summary:
Print out status at the end of a compaction run. This helps in
debugging.

Test Plan: make clean check

Reviewers: sheki

Reviewed By: sheki

Differential Revision: https://reviews.facebook.net/D7035
2012-11-27 22:17:38 -08:00
sheki
43f5a07989 Remove unused varibles. Cause compiler warnings.
Test Plan: make check

Reviewers: dhruba

Reviewed By: dhruba

CC: emayanke

Differential Revision: https://reviews.facebook.net/D6993
2012-11-26 20:55:24 -08:00
Dhruba Borthakur
2a39699900 Assertion failure while running with unit tests with OPT=-g
Summary:
When we expand the range of keys for a level 0 compaction, we
need to invoke ParentFilesInCompaction() only once for the
entire range of keys that is being compacted. We were invoking
it for each file that was being compacted, but this triggers
an assertion because each file's range were contiguous but
non-overlapping.

I renamed ParentFilesInCompaction to ParentRangeInCompaction
to adequately represent that it is the range-of-keys and
not individual files that we compact in a single compaction run.

Here is the assertion that is fixed by this patch.
db_test: db/version_set.cc:585: void leveldb::Version::ExtendOverlappingInputs(int, const leveldb::Slice&, const leveldb::Slice&, std::vector<leveldb::FileMetaData*, std::allocator<leveldb::FileMetaData*> >*, int): Assertion `user_cmp->Compare(flimit, user_begin) >= 0' failed.

Test Plan: make clean check OPT=-g

Reviewers: sheki

Reviewed By: sheki

CC: MarkCallaghan, emayanke, leveldb

Differential Revision: https://reviews.facebook.net/D6963
2012-11-26 14:00:39 -08:00
Dhruba Borthakur
e0cd6bf0e9 The c_test was sometimes failing with an assertion.
Summary:
On fast filesystems (e.g. /dev/shm and ext4), the flushing
of memstore to disk was fast and quick, and the background compaction
thread was not getting scheduled fast enough to delete obsolete
files before the db was closed. This caused the repair method
to pick up those files that were not part of the db and the unit
test was failing.

The fix is to enhance the unti test to run a compaction before
closing the database so that all files that are not part of the
database are truly deleted from the filesystem.

Test Plan: make c_test; ./c_test

Reviewers: chip, emayanke, sheki

Reviewed By: chip

CC: leveldb

Differential Revision: https://reviews.facebook.net/D6915
2012-11-26 11:59:51 -08:00
Dhruba Borthakur
7632fdb5cb Support taking a configurable number of files from the same level to compact in a single compaction run.
Summary:
The compaction process takes some files from LevelK and
merges it into LevelK+1. The number of files it picks from
LevelK was capped such a way that the total amount of
data picked does not exceed the maxfilesize of that level.
This essentially meant that only one file from LevelK
is picked for a single compaction.

For bulkloads, we would like to take many many file from
LevelK and compact them using a single compaction run.

This patch introduces a option called the 'source_compaction_factor'
(similar to expanded_compaction_factor). It is a multiplier
that is multiplied by the maxfilesize of that level to arrive
at the limit that is used to throttle the number of source
files from LevelK.  For bulk loads, set source_compaction_factor
to a very high number so that multiple files from the same
level are picked for compaction in a single compaction.

The default value of source_compaction_factor is 1, so that
we can keep backward compatibilty with existing compaction semantics.

Test Plan: make clean check

Reviewers: emayanke, sheki

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D6867
2012-11-21 08:37:03 -08:00
Dhruba Borthakur
fbb73a4ac3 Support to disable background compactions on a database.
Summary:
This option is needed for fast bulk uploads. The goal is to load
all the data into files in L0 without any interference from
background compactions.

Test Plan: make clean check

Reviewers: sheki

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D6849
2012-11-20 21:12:06 -08:00
Dhruba Borthakur
3754f2f4ff A major bug that was not considering the compaction score of the n-1 level.
Summary:
The method Finalize() recomputes the compaction score of each
level and then sorts these score from largest to smallest. The
idea is that the level with the largest compaction score will
be a better candidate for compaction.  There are usually very
few levels, and a bubble sort code was used to sort these
compaction scores. There existed a bug in the sorting code that
skipped looking at the score for the n-1 level. This meant that
even if the compaction score of the n-1 level is large, it will
not be picked for compaction.

This patch fixes the bug and also introduces "asserts" in the
code to detect any possible inconsistencies caused by future bugs.

This bug existed in the very first code change that introduced
multi-threaded compaction to the leveldb code. That version of
code was committed on Oct 19th via
1ca0584345

Test Plan: make clean check OPT=-g

Reviewers: emayanke, sheki, MarkCallaghan

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D6837
2012-11-20 15:44:21 -08:00
Dhruba Borthakur
dde70898a1 Fix asserts
Summary:
make check OPT=-g fails with the following assert.
==== Test DBTest.ApproximateSizes
db_test: db/version_set.cc:765: void leveldb::VersionSet::Builder::CheckConsistencyForDeletes(leveldb::VersionEdit*, int, int): Assertion `found' failed.

The assertion was that file #7 that was being deleted did not
preexists, but actualy it did pre-exist as shown in the manifest
dump shows below. The bug was that we did not check for file
existance at the same level.

*************************Edit[0] = VersionEdit {
  Comparator: leveldb.BytewiseComparator
}

*************************Edit[1] = VersionEdit {
  LogNumber: 8
  PrevLogNumber: 0
  NextFile: 9
  LastSeq: 80
  AddFile: 0 7 8005319 'key000000' @ 1 : 1 .. 'key000079' @ 80 : 1
}

*************************Edit[2] = VersionEdit {
  LogNumber: 8
  PrevLogNumber: 0
  NextFile: 13
  LastSeq: 80
  CompactPointer: 0 'key000079' @ 80 : 1
  DeleteFile: 0 7
  AddFile: 1 9 2101425 'key000000' @ 1 : 1 .. 'key000020' @ 21 : 1
  AddFile: 1 10 2101425 'key000021' @ 22 : 1 .. 'key000041' @ 42 : 1
  AddFile: 1 11 2101425 'key000042' @ 43 : 1 .. 'key000062' @ 63 : 1
  AddFile: 1 12 1701165 'key000063' @ 64 : 1 .. 'key000079' @ 80 : 1
}

Test Plan:

Reviewers:

CC:

Task ID: #

Blame Rev:
2012-11-19 14:51:22 -08:00
Dhruba Borthakur
a4b79b6e28 Merge branch 'master' into performance 2012-11-19 13:20:25 -08:00
Dhruba Borthakur
74054fa993 Fix compilation error while compiling unit tests with OPT=-g
Summary:
Fix compilation error while compiling with OPT=-g

Test Plan:
make clean check OPT=-g

Reviewers:

CC:

Task ID: #

Blame Rev:
2012-11-19 13:16:46 -08:00
Dhruba Borthakur
48dafb2c59 Fix compilation error introduced by previous commit
7889e09455

Summary:
Fix compilation error introduced by previous commit
7889e09455

Test Plan:
make clean check
2012-11-19 12:16:45 -08:00
Dhruba Borthakur
7889e09455 Enhance manifest_dump to print each individual edit.
Summary:
The manifest file contains a series of edits. If the verbose
option is switched on, then print each individual edit in the
manifest file. This helps in debugging.

Test Plan: make clean manifest_dump

Reviewers: emayanke, sheki

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D6807
2012-11-19 12:04:35 -08:00
amayank
65b035a47f Fix a coding error in db_test.cc
Summary: The new function MinLevelToCompress in db_test.cc was incomplete. It needs to tell the calling function-TEST whether the test has to be skipped or not

Test Plan: make all;./db_test

Reviewers: dhruba, heyongqiang

Reviewed By: dhruba

CC: sheki

Differential Revision: https://reviews.facebook.net/D6771
2012-11-19 12:04:35 -08:00
Dhruba Borthakur
4b622ab0f2 Enhance manifest_dump to print each individual edit.
Summary:
The manifest file contains a series of edits. If the verbose
option is switched on, then print each individual edit in the
manifest file. This helps in debugging.

Test Plan: make clean manifest_dump

Reviewers: emayanke, sheki

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D6807
2012-11-19 12:02:27 -08:00
Dhruba Borthakur
62e7583f94 enhance dbstress to simulate hard crash
Summary:
dbstress has an option to reopen the database. Make it such that the
previous handle is not closed before we reopen, this simulates a
situation similar to a process crash.

Added new api to DMImpl to remove the lock file.

Test Plan: run db_stress

Reviewers: emayanke

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D6777
2012-11-18 23:16:17 -08:00
amayank
de278a6de9 Fix a coding error in db_test.cc
Summary: The new function MinLevelToCompress in db_test.cc was incomplete. It needs to tell the calling function-TEST whether the test has to be skipped or not

Test Plan: make all;./db_test

Reviewers: dhruba, heyongqiang

Reviewed By: dhruba

CC: sheki

Differential Revision: https://reviews.facebook.net/D6771
2012-11-16 14:56:50 -08:00
Dhruba Borthakur
6c5a4d646a Merge branch 'master' into performance
Conflicts:
	db/db_impl.h
2012-11-14 21:39:52 -08:00
Dhruba Borthakur
e988c11f58 Enhance db_bench to be able to specify a grandparent_overlap_factor.
Summary:
The value specified in max_grandparent_overlap_factor is used to
limit the file size in a compaction run. This patch makes it
configurable when using db_bench.

Test Plan: make clean db_bench

Reviewers: MarkCallaghan, heyongqiang

Reviewed By: heyongqiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D6729
2012-11-14 16:20:13 -08:00
Dhruba Borthakur
5d16e503a6 Improved CompactionFilter api: pass in a opaque argument to CompactionFilter invocation.
Summary:
There are applications that operate on multiple leveldb instances.
These applications will like to pass in an opaque type for each
leveldb instance and this type should be passed back to the application
with every invocation of the CompactionFilter api.

Test Plan: Enehanced unit test for opaque parameter to CompactionFilter.

Reviewers: heyongqiang

Reviewed By: heyongqiang

CC: MarkCallaghan, sheki, emayanke

Differential Revision: https://reviews.facebook.net/D6711
2012-11-13 16:22:26 -08:00
Dhruba Borthakur
43d9a8225a Fix asserts so that "make check OPT=-g" works on performance branch
Summary:
Compilation used to fail with the error:
db/version_set.cc:1773: error: ‘number_of_files_to_sort_’ is not a member of ‘leveldb::VersionSet’

I created a new method called CheckConsistencyForDeletes() so that
all the high cost checking is done only when OPT=-g is specified.

I also fixed a bug in PickCompactionBySize that was triggered when
OPT=-g was switched on. The base_index in the compaction record
was not set correctly.

Test Plan: make check OPT=-g

Differential Revision: https://reviews.facebook.net/D6687
2012-11-13 10:40:52 -08:00
Dhruba Borthakur
a785e029f7 The db_bench utility was broken in 1.5.4.fb because of a signed-unsigned comparision.
Summary:
The db_bench utility was broken in 1.5.4.fb because of a
signed-unsigned comparision.

The static variable FLAGS_min_level_to_compress was recently
changed from int to 'unsigned in' but it is initilized to a
nagative value -1.

The segfault is of this type:
Program received signal SIGSEGV, Segmentation fault.
Open (this=0x7fffffffdee0) at db/db_bench.cc:939
939	db/db_bench.cc: No such file or directory.
(gdb) where

Test Plan: run db_bench with no options.

Reviewers: heyongqiang

Reviewed By: heyongqiang

CC: MarkCallaghan, emayanke, sheki

Differential Revision: https://reviews.facebook.net/D6663
2012-11-12 13:59:35 -08:00
Dhruba Borthakur
9c6c232e47 Compilation error while compiling with OPT=-g
Summary:
make clean check OPT=-g fails
leveldb::DBStatistics::getTickerCount(leveldb::Tickers)’:
./db/db_statistics.h:34: error: ‘MAX_NO_TICKERS’ was not declared in this scope
util/ldb_cmd.cc:255: warning: left shift count >= width of type

Test Plan:
make clean check OPT=-g

Reviewers:

CC:

Task ID: #

Blame Rev:
2012-11-11 00:20:40 -08:00
Abhishek Kona
0f8e4721a5 Metrics: record compaction drop's and bloom filter effectiveness
Summary: Record BloomFliter hits and drop off reasons during compaction.

Test Plan: Unit tests work.

Reviewers: dhruba, heyongqiang

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D6591
2012-11-09 11:38:45 -08:00
heyongqiang
20d18a89a3 disable size compaction in ldb reduce_levels and added compression and file size parameter to it
Summary:
disable size compaction in ldb reduce_levels, this will avoid compactions rather than the manual comapction,

added --compression=none|snappy|zlib|bzip2 and --file_size= per-file size to ldb reduce_levels command

Test Plan: run ldb

Reviewers: dhruba, MarkCallaghan

Reviewed By: dhruba

CC: sheki, emayanke

Differential Revision: https://reviews.facebook.net/D6597
2012-11-09 10:14:47 -08:00
Abhishek Kona
391885c4e4 stat's collection in leveldb
Summary:
Prototype stat's collection. Diff is a good estimate of what
the final code will look like.
A few assumptions :
  * Used a global static instance of the statistics object. Plan to pass
  it to each internal function. Static allows metrics only at app
  level.
  * In the Ticker's do not do any locking. Depend on the mutex at each
   function of LevelDB. If we ever remove the mutex, we should change
   here too. The other option is use atomic objects anyways as there
   won't be any contention as they will be always acquired only by one
   thread.
  * The counters are dumb, increment through lifecycle. Plan to use ods
    etc to get last5min stat etc.

Test Plan:
made changes in db_bench
Ran ./db_bench --statistics=1 --num=10000 --cache_size=5000
This will print the cache hit/miss stats.

Reviewers: dhruba, heyongqiang

Differential Revision: https://reviews.facebook.net/D6441
2012-11-08 13:55:49 -08:00
Dhruba Borthakur
95dda37858 Move filesize-based-sorting to outside the Mutex
Summary:
When a new version is created, we sort all the files at every
level based on their size. This is necessary because we want
to compact the largest file first. The sorting takes quite a
bit of CPU.

Moved the sorting code to be outside the mutex. Also, the
earlier code was sorting files at all levels but we do not
need to sort the highest-number level because those files
are never the cause of any compaction. To reduce sorting
costs, we sort only the first few files in each level
because it is likely that those are the only files in that
level that will be picked for compaction.

At steady state, I have seen that this patch increase
throughout from 1500 writes/sec to 1700 writes/sec at the
end of a 72 hour run. The cpu saving by not sorting the
last level was not distinctive in this test run because
there were only 100K files in the highest numbered level.
I expect the cpu saving to be significant when the number of
files is much higher.

This is mostly an early preview and not ready for rigorous review.

With this patch, the writs/sec is now bottlenecked not by the sorting code but by GetOverlappingInputs. I am working on a patch to optimize GetOverlappingInputs.

Test Plan: make check

Reviewers: MarkCallaghan, heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D6411
2012-11-07 15:39:44 -08:00
Dhruba Borthakur
18cb6004d2 Fixed compilation error in previous merge.
Summary:
Fixed compilation error in previous merge.

Test Plan:

Reviewers:

CC:

Task ID: #

Blame Rev:
2012-11-07 15:24:47 -08:00
Dhruba Borthakur
8143062edd Merge branch 'master' into performance
Conflicts:
	db/db_impl.cc
	db/version_set.cc
	util/options.cc
2012-11-07 15:11:37 -08:00
heyongqiang
3fcf533ed0 Add a readonly db
Summary: as subject

Test Plan: run db_bench readrandom

Reviewers: dhruba

Reviewed By: dhruba

CC: MarkCallaghan, emayanke, sheki

Differential Revision: https://reviews.facebook.net/D6495
2012-11-07 14:19:48 -08:00
Dhruba Borthakur
9b87a2bae8 Avoid doing a exhaustive search when looking for overlapping files.
Summary:
The Version::GetOverlappingInputs() is called multiple times in
the compaction code path. Eack invocation does a binary search
for overlapping files in the specified key range.
This patch remembers the offset of an overlapped file when
GetOverlappingInputs() is called the first time within
a compaction run. Suceeding calls to GetOverlappingInputs()
uses the remembered index to avoid the binary search.

I measured that 1000 iterations of GetOverlappingInputs
takes around 4500 microseconds without this patch. If I use
this patch with the hint on every invocation, then 1000
iterations take about 3900 microsecond.

Test Plan: make check OPT=-g

Reviewers: heyongqiang

Reviewed By: heyongqiang

CC: MarkCallaghan, emayanke, sheki

Differential Revision: https://reviews.facebook.net/D6513
2012-11-07 11:47:17 -08:00
Abhishek Kona
4e413df3d0 Flush Data at object destruction if disableWal is used.
Summary:
Added a conditional flush in ~DBImpl to flush.
There is still a chance of writes not being persisted if there is a
crash (not a clean shutdown) before the DBImpl instance is destroyed.

Test Plan: modified db_test to meet the new expectations.

Reviewers: dhruba, heyongqiang

Differential Revision: https://reviews.facebook.net/D6519
2012-11-06 15:04:42 -08:00
Dhruba Borthakur
aa42c66814 Fix all warnings generated by -Wall option to the compiler.
Summary:
The default compilation process now uses "-Wall" to compile.
Fix all compilation error generated by gcc.

Test Plan: make all check

Reviewers: heyongqiang, emayanke, sheki

Reviewed By: heyongqiang

CC: MarkCallaghan

Differential Revision: https://reviews.facebook.net/D6525
2012-11-06 14:07:31 -08:00
Dhruba Borthakur
5f91868cee Merge branch 'master' into performance
Conflicts:
	db/version_set.cc
	util/options.cc
2012-11-05 16:51:55 -08:00
Dhruba Borthakur
cb7a00227f The method GetOverlappingInputs should use binary search.
Summary:
The method Version::GetOverlappingInputs used a sequential search
to map a kay-range to a set of files. But the files are arranged
in ascending order of key, so a biary search is more effective.

This patch implements Version::GetOverlappingInputsBinarySearch
that finds one file that corresponds to the specified key range
and then iterates backwards and forwards to find all overlapping
files.

This patch is critical for making compactions efficient, especially
when there are thousands of files in a single level.

I measured that 1000 iterations of TEST_MaxNextLevelOverlappingBytes
takes 16000 microseconds without this patch. With this patch, the
same method takes about 4600 microseconds.

Test Plan: Almost all unit tests in db_test uses this method to lookup keys.

Reviewers: heyongqiang

Reviewed By: heyongqiang

CC: MarkCallaghan, emayanke, sheki

Differential Revision: https://reviews.facebook.net/D6465
2012-11-05 16:08:01 -08:00
Dhruba Borthakur
5273c81483 Ability to invoke application hook for every key during compaction.
Summary:
There are certain use-cases where the application intends to
delete older keys aftre they have expired a certian time period.
One option for those applications is to periodically scan the
entire database and delete appropriate keys.

A better way is to allow the application to hook into the
compaction process. This patch allows the application to set
a method callback for every key that is being compacted. If
this method returns true, then the key is not preserved in
the output of the compaction.

Test Plan:
This is mostly to preview the proposed new public api.
Since it is a public api, please do due diligence on reviewing it.

I will be writing test cases for this api in mynext version of
this patch.

Reviewers: MarkCallaghan, heyongqiang

Reviewed By: heyongqiang

CC: sheki, adsharma

Differential Revision: https://reviews.facebook.net/D6285
2012-11-05 16:02:13 -08:00
heyongqiang
f1a7c735b5 fix complie error
Summary:

as subject

Test Plan:n/a
2012-11-05 10:30:19 -08:00
heyongqiang
d55c2ba305 Add a tool to change number of levels
Summary: as subject.

Test Plan: manually test it, will add a testcase

Reviewers: dhruba, MarkCallaghan

Differential Revision: https://reviews.facebook.net/D6345
2012-11-05 10:17:39 -08:00
Dhruba Borthakur
81f735d97c Merge branch 'master' into performance
Conflicts:
	db/db_impl.cc
	util/options.cc
2012-11-05 09:41:38 -08:00
Dhruba Borthakur
a1bd5b7752 Compilation problem introduced by previous
commit 854c66b089.

Summary:
Compilation problem introduced by previous
commit 854c66b089.

Test Plan:  make check
2012-11-04 22:04:14 -08:00
amayank
854c66b089 Make compression options configurable. These include window-bits, level and strategy for ZlibCompression
Summary: Leveldb currently uses windowBits=-14 while using zlib compression.(It was earlier 15). This makes the setting configurable. Related changes here: https://reviews.facebook.net/D6105

Test Plan: make all check

Reviewers: dhruba, MarkCallaghan, sheki, heyongqiang

Differential Revision: https://reviews.facebook.net/D6393
2012-11-02 11:26:39 -07:00
heyongqiang
3096fa7534 Add two more options: disable block cache and make table cache shard number configuable
Summary:

as subject

Test Plan:

run db_bench and db_test

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D6111
2012-11-01 13:23:21 -07:00
Mark Callaghan
3e7e269292 Use timer to measure sleep rather than assume it is 1000 usecs
Summary:
This makes the stall timers in MakeRoomForWrite more accurate by timing
the sleeps. From looking at the logs the real sleep times are usually
about 2000 usecs each when SleepForMicros(1000) is called. The modified LOG messages are:
2012/10/29-12:06:33.271984 2b3cc872f700 delaying write 13 usecs for level0_slowdown_writes_trigger
2012/10/29-12:06:34.688939 2b3cc872f700 delaying write 1728 usecs for rate limits with max score 3.83

Task ID: #

Blame Rev:

Test Plan:
run db_bench, look at DB/LOG

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D6297
2012-10-30 07:21:37 -07:00
heyongqiang
fb8d437325 fix test failure
Summary: as subject

Test Plan: db_test

Reviewers: dhruba, MarkCallaghan

Reviewed By: MarkCallaghan

Differential Revision: https://reviews.facebook.net/D6309
2012-10-29 18:55:52 -07:00
heyongqiang
925f60d39d add a test case to make sure chaning num_levels will fail Summary:
Summary: as subject

Test Plan: db_test

Reviewers: dhruba, MarkCallaghan

Reviewed By: MarkCallaghan

Differential Revision: https://reviews.facebook.net/D6303
2012-10-29 15:27:07 -07:00
Dhruba Borthakur
53e04311b1 Merge branch 'master' into performance
Conflicts:
	db/db_bench.cc
	util/options.cc
2012-10-29 14:18:00 -07:00
Dhruba Borthakur
321dfdc3ae Allow having different compression algorithms on different levels.
Summary:
The leveldb API is enhanced to support different compression algorithms at
different levels.

This adds the option min_level_to_compress to db_bench that specifies
the minimum level for which compression should be done when
compression is enabled. This can be used to disable compression for levels
0 and 1 which are likely to suffer from stalls because of the CPU load
for memtable flushes and (L0,L1) compaction.  Level 0 is special as it
gets frequent memtable flushes. Level 1 is special as it frequently
gets all:all file compactions between it and level 0. But all other levels
could be the same. For any level N where N > 1, the rate of sequential
IO for that level should be the same. The last level is the
exception because it might not be full and because files from it are
not read to compact with the next larger level.

The same amount of time will be spent doing compaction at any
level N excluding N=0, 1 or the last level. By this standard all
of those levels should use the same compression. The difference is that
the loss (using more disk space) from a faster compression algorithm
is less significant for N=2 than for N=3. So we might be willing to
trade disk space for faster write rates with no compression
for L0 and L1, snappy for L2, zlib for L3. Using a faster compression
algorithm for the mid levels also allows us to reclaim some cpu
without trading off much loss in disk space overhead.

Also note that little is to be gained by compressing levels 0 and 1. For
a 4-level tree they account for 10% of the data. For a 5-level tree they
account for 1% of the data.

With compression enabled:
* memtable flush rate is ~18MB/second
* (L0,L1) compaction rate is ~30MB/second

With compression enabled but min_level_to_compress=2
* memtable flush rate is ~320MB/second
* (L0,L1) compaction rate is ~560MB/second

This practicaly takes the same code from https://reviews.facebook.net/D6225
but makes the leveldb api more general purpose with a few additional
lines of code.

Test Plan: make check

Differential Revision: https://reviews.facebook.net/D6261
2012-10-29 11:48:09 -07:00
Mark Callaghan
acc8567b24 Add more rates to db_bench output
Summary:
Adds the "MB/sec in" and "MB/sec out" to this line:
Amplification: 1.7 rate, 0.01 GB in, 0.02 GB out, 8.24 MB/sec in, 13.75 MB/sec out

Changes all values to be reported per interval and since test start for this line:
... thread 0: (10000,60000) ops and (19155.6,27307.5) ops/second in (0.522041,2.197198) seconds

Task ID: #

Blame Rev:

Test Plan:
run db_bench

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D6291
2012-10-29 11:30:07 -07:00
Dhruba Borthakur
de7689b1d7 Fix unit test failure caused by delaying deleting obsolete files.
Summary:
A previous commit 4c107587ed introduced
the idea that some version updates might not delete obsolete files.
This means that if a unit test blindly counts the number of files
in the db directory it might not represent the true state of the database.

Use GetLiveFiles() insteads to count the number of live files in the database.

Test Plan:
make check
2012-10-29 11:12:24 -07:00
Mark Callaghan
70c42bf05f Adds DB::GetNextCompaction and then uses that for rate limiting db_bench
Summary:
Adds a method that returns the score for the next level that most
needs compaction. That method is then used by db_bench to rate limit threads.
Threads are put to sleep at the end of each stats interval until the score
is less than the limit. The limit is set via the --rate_limit=$double option.
The specified value must be > 1.0. Also adds the option --stats_per_interval
to enable additional metrics reported every stats interval.

Task ID: #

Blame Rev:

Test Plan:
run db_bench

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D6243
2012-10-29 10:17:43 -07:00
Kai Liu
d50f8eb603 Enable LevelDb to create a new log file if current log file is too large.
Summary: Enable LevelDb to create a new log file if current log file is too large.

Test Plan:
Write a script and manually check the generated info LOG.

Task ID: 1803577

Blame Rev:

Reviewers: dhruba, heyongqiang

Reviewed By: heyongqiang

CC: zshao

Differential Revision: https://reviews.facebook.net/D6003
2012-10-26 14:55:02 -07:00
Mark Callaghan
65855dd8d4 Normalize compaction stats by time in compaction
Summary:
I used server uptime to compute per-level IO throughput rates. I
intended to use time spent doing compaction at that level. This fixes that.

Task ID: #

Blame Rev:

Test Plan:
run db_bench, look at results

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D6237
2012-10-26 14:19:13 -07:00
Dhruba Borthakur
ea9e087851 Merge branch 'master' into performance
Conflicts:
	db/db_bench.cc
	db/db_impl.cc
	db/db_test.cc
2012-10-26 08:57:56 -07:00
Dhruba Borthakur
8eedf13a82 Fix unit test failure caused by delaying deleting obsolete files.
Summary:
A previous commit 4c107587ed introduced
the idea that some version updates might not delete obsolete files.
This means that if a unit test blindly counts the number of files
in the db directory it might not represent the true state of the database.

Use GetLiveFiles() insteads to count the number of live files in the database.

Test Plan: make check

Reviewers: heyongqiang, MarkCallaghan

Reviewed By: MarkCallaghan

Differential Revision: https://reviews.facebook.net/D6207
2012-10-26 08:42:05 -07:00
Dhruba Borthakur
5b0fe6c73b Greedy algorithm for picking files to compact.
Summary:
It is best if we pick the largest file to compact in a level.
This reduces the write amplification factor for compactions.
Each level has an auxiliary data structure called files_by_size_
that sorts all files by their size. This data structure is
updated when a new version is created.

Test Plan: make check

Differential Revision: https://reviews.facebook.net/D6195
2012-10-25 18:27:53 -07:00