Summary:
This is the first step to fix unit tests and bugs for universal
compactiion. I added universal compaction option to ChangeOptions(), and
fixed all unit tests calling ChangeOptions(). Some of these tests
obviously assume more than 1 level and check file number/values in level
1 or above levels. I set kSkipUniversalCompaction for these tests.
The major bug I found is manual compaction with universal compaction never stops. I have put a fix for
it.
I have also set universal compaction as the default compaction and found
at least 20+ unit tests failing. I haven't looked into the details. The
next step is to check all unit tests without calling ChangeOptions().
Test Plan: make all check
Reviewers: dhruba, haobo
Differential Revision: https://reviews.facebook.net/D12051
Summary: rocksdb replicaiton will need this when writing value+TS from master to slave 'as is'
Test Plan: make
Reviewers: dhruba, vamsi, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11919
Summary:
This diff adds support for both soft and hard rate limiting. The following changes are included:
1) Options.rate_limit is renamed to Options.hard_rate_limit.
2) Options.rate_limit_delay_milliseconds is renamed to Options.rate_limit_delay_max_milliseconds.
3) Options.soft_rate_limit is added.
4) If the maximum compaction score is > hard_rate_limit and rate_limit_delay_max_milliseconds == 0, then writes are delayed by 1 ms at a time until the max compaction score falls below hard_rate_limit.
5) If the max compaction score is > soft_rate_limit but <= hard_rate_limit, then writes are delayed by 0-1 ms depending on how close we are to hard_rate_limit.
6) Users can disable 4 by setting hard_rate_limit = 0. They can add a limit to the maximum amount of time waited by setting rate_limit_delay_max_milliseconds > 0. Thus, the old behavior can be preserved by setting soft_rate_limit = 0, which is the default.
Test Plan:
make -j32 check
./db_stress
Reviewers: dhruba, haobo, MarkCallaghan
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12003
Summary:
Add an option for arena block size, default value 4096 bytes. Arena will allocate blocks with such size.
I am not sure about passing parameter to skiplist in the new virtualized framework, though I talked to Jim a bit. So add Jim as reviewer.
Test Plan:
new unit test, I am running db_test.
For passing paramter from configured option to Arena, I tried tests like:
TEST(DBTest, Arena_Option) {
std::string dbname = test::TmpDir() + "/db_arena_option_test";
DestroyDB(dbname, Options());
DB* db = nullptr;
Options opts;
opts.create_if_missing = true;
opts.arena_block_size = 1000000; // tested 99, 999999
Status s = DB::Open(opts, dbname, &db);
db->Put(WriteOptions(), "a", "123");
}
and printed some debug info. The results look good. Any suggestion for such a unit-test?
Reviewers: haobo, dhruba, emayanke, jpaton
Reviewed By: dhruba
CC: leveldb, zshao
Differential Revision: https://reviews.facebook.net/D11799
Summary: This diff virtualizes the skiplist interface so that users can provide their own implementation of a backing store for MemTables. Eventually, the backing store will be responsible for its own synchronization, allowing users (and us) to experiment with different lockless implementations.
Test Plan:
make clean
make -j32 check
./db_stress
Reviewers: dhruba, emayanke, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11739
Summary:
Introduced KeyMayExist checking during writebatch-delete and removed from Outer Delete API because it uses writebatch-delete.
Added code to skip getting Table from disk if not already present in table_cache.
Some renaming of variables.
Introduced KeyMayExistImpl which allows checking since specified sequence number in GetImpl useful to check partially written writebatch.
Changed KeyMayExist to not be pure virtual and provided a default implementation.
Expanded unit-tests in db_test to check appropriately.
Ran db_stress for 1 hour with ./db_stress --max_key=100000 --ops_per_thread=10000000 --delpercent=50 --filter_deletes=1 --statistics=1.
Test Plan: db_stress;make check
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb, xjin
Differential Revision: https://reviews.facebook.net/D11745
Summary:
Wrote a new function in db_impl.c-CheckKeyMayExist that calls Get but with a new parameter turned on which makes Get return false only if bloom filters can guarantee that key is not in database. Delete calls this function and if the option- deletes_use_filter is turned on and CheckKeyMayExist returns false, the delete will be dropped saving:
1. Put of delete type
2. Space in the db,and
3. Compaction time
Test Plan:
make all check;
will run db_stress and db_bench and enhance unit-test once the basic design gets approved
Reviewers: dhruba, haobo, vamsi
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11607
Summary: This diff added a command 'idump' to ldb tool, which dumps the internal key/value pairs. It could be useful for diagnosis and estimating the per user key 'overhead'. Also cleaned up the ldb code a bit where I touched.
Test Plan: make check; ldb idump
Reviewers: emayanke, sheki, dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11517
Summary:
There is a new option called hybrid_mode which, when switched on,
causes HBase style compactions. Files from L0 are
compacted back into L0. This meat of this compaction algorithm
is in PickCompactionHybrid().
All files reside in L0. That means all files have overlapping
keys. Each file has a time-bound, i.e. each file contains a
range of keys that were inserted around the same time. The
start-seqno and the end-seqno refers to the timeframe when
these keys were inserted. Files that have contiguous seqno
are compacted together into a larger file. All files are
ordered from most recent to the oldest.
The current compaction algorithm starts to look for
candidate files starting from the most recent file. It continues to
add more files to the same compaction run as long as the
sum of the files chosen till now is smaller than the next
candidate file size. This logic needs to be debated
and validated.
The above logic should reduce write amplification to a
large extent... will publish numbers shortly.
Test Plan: dbstress runs for 6 hours with no data corruption (tested so far).
Differential Revision: https://reviews.facebook.net/D11289
Summary: [start_time, end_time) is waht I'm following for the buckets and the whole time-range. Also cleaned up some code in db_ttl.* Not correcting the spacing/indenting convention for util/ldb_cmd.cc in this diff.
Test Plan: python ldb_test.py, make ttl_test, Run mcrocksdb-backup tool, Run the ldb tool on 2 mcrocksdb production backups form sigmafio033.prn1
Reviewers: vamsi, haobo
Reviewed By: vamsi
Differential Revision: https://reviews.facebook.net/D11433
Summary:
Scan and Dump commands in ldb use iterator. We need to also print timestamp for ttl databases for debugging. For this I create a TtlIterator class pointer in these functions and assign it the value of Iterator pointer which actually points to t TtlIterator object, and access the new function ValueWithTS which can return TS also. Buckets feature for dump command: gives a count of different key-values in the specified time-range distributed across the time-range partitioned according to bucket-size. start_time and end_time are specified in unixtimestamp and bucket in seconds on the user-commandline
Have commented out 3 ines from ldb_test.py so that the test does not break right now. It breaks because timestamp is also printed now and I have to look at wildcards in python to compare properly.
Test Plan: python tools/ldb_test.py
Reviewers: vamsi, dhruba, haobo, sheki
Reviewed By: vamsi
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11403
Summary: as title, also removed an incorrect assertion
Test Plan: make check; db_stress --mmap_read=1; db_stress --mmap_read=0
Reviewers: dhruba, emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11367
Summary: This diff added an option to control the incremenal sync frequency. db_bench has a new flag bytes_per_sync for easy tuning exercise.
Test Plan: make check; db_bench
Reviewers: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11295
Summary:
Merge multiple multiple memtables in memory before writing it
out to a file in L0.
There is a new config parameter min_write_buffer_number_to_merge
that specifies the number of write buffers that should be merged
together to a single file in storage. The system will not flush
wrte buffers to storage unless at least these many buffers have
accumulated in memory.
The default value of this new parameter is 1, which means that
a write buffer will be immediately flushed to disk as soon it is
ready.
Test Plan: make check
Differential Revision: https://reviews.facebook.net/D11241
Summary:
Use a bit set to keep track of which random number is generated.
Currently only supports single-threaded. All our perf tests are run with threads=1
Copied over bitset implementation from common/datastructures
Test Plan: printed the generated keys, and verified all keys were present.
Reviewers: MarkCallaghan, haobo, dhruba
Reviewed By: MarkCallaghan
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11247
Summary:
During compaction, we sync the output files after they are fully written out. This causes unnecessary blocking of the compaction thread and burstiness of the write traffic.
This diff simply asks the OS to sync data incrementally as they are written, on the background. The hope is that, at the final sync, most of the data are already on disk and we would block less on the sync call. Thus, each compaction runs faster and we could use fewer number of compaction threads to saturate IO.
In addition, the write traffic will be smoothed out, hopefully reducing the IO P99 latency too.
Some quick tests show 10~20% improvement in per thread compaction throughput. Combined with posix advice on compaction read, just 5 threads are enough to almost saturate the udb flash bandwidth for 800 bytes write only benchmark.
What's more promising is that, with saturated IO, iostat shows average wait time is actually smoother and much smaller.
For the write only test 800bytes test:
Before the change: await occillate between 10ms and 3ms
After the change: await ranges 1-3ms
Will test against read-modify-write workload too, see if high read latency P99 could be resolved.
Will introduce a parameter to control the sync interval in a follow up diff after cleaning up EnvOptions.
Test Plan: make check; db_bench; db_stress
Reviewers: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11115
Summary:
This diff simplifies EnvOptions by treating it as POD, similar to Options.
- virtual functions are removed and member fields are accessed directly.
- StorageOptions is removed.
- Options.allow_readahead and Options.allow_readahead_compactions are deprecated.
- Unused global variables are removed: useOsBuffer, useFsReadAhead, useMmapRead, useMmapWrite
Test Plan: make check; db_stress
Reviewers: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11175
Summary:
This diff adds an option to specify whether PTHREAD_MUTEX_ADAPTIVE_NP will be enabled for the rocksdb single big kernel lock. db_bench also have this option now.
Quickly tested 8 thread cpu bound 100 byte random read.
No fast mutex: ~750k/s ops
With fast mutex: ~880k/s ops
Test Plan: make check; db_bench; db_stress
Reviewers: dhruba
CC: MarkCallaghan, leveldb
Differential Revision: https://reviews.facebook.net/D11031
Summary:
Current posix advice implementation ties up the access pattern hint with the creation of a file.
It is not possible to apply different advice for different access (random get vs compaction read),
without keeping two open files for the same table. This patch extended the RandomeAccessFile interface
to accept new access hint at anytime. Particularly, we are able to set different access hint on the same
table file based on when/how the file is used.
Two options are added to set the access hint, after the file is first opened and after the file is being
compacted.
Test Plan: make check; db_stress; db_bench
Reviewers: dhruba
Reviewed By: dhruba
CC: MarkCallaghan, leveldb
Differential Revision: https://reviews.facebook.net/D10905
Summary: a new option block_size_deviation is added.
Test Plan: run db_test and db_bench
Reviewers: dhruba, haobo
Reviewed By: haobo
Differential Revision: https://reviews.facebook.net/D10821
Summary: a new option block_size_deviation is added.
Test Plan: run db_test and db_bench
Reviewers: dhruba, haobo
Reviewed By: haobo
Differential Revision: https://reviews.facebook.net/D10821
Summary:
Added an option stats_dump_period_sec to dump leveldb.stats to LOG periodically for diagnosis.
By defauly, it's set to a very big number 3600 (1 hour).
Test Plan: make check;
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb, zshao
Differential Revision: https://reviews.facebook.net/D10761
Summary:
The valgrind errors were in the unit tests where we change the
number of levels of a database using internal methods.
Test Plan:
valgrind ./reduce_levels_test
valgrind ./db_test
Reviewers: emayanke
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10893
Summary:
This is initial version. A few ways in which this could
be extended in the future are:
(a) Killing from more places in source code
(b) Hashing stack and using that hash in determining whether to crash.
This is to avoid crashing more often at source lines that are executed
more often.
(c) Raising exceptions or returning errors instead of killing
Test Plan:
This whole thing is for testing.
Here is part of output:
python2.7 tools/db_crashtest2.py -d 600
Running db_stress
db_stress retncode -15 output LevelDB version : 1.5
Number of threads : 32
Ops per thread : 10000000
Read percentage : 50
Write-buffer-size : 4194304
Delete percentage : 30
Max key : 1000
Ratio #ops/#keys : 320000
Num times DB reopens: 0
Batches/snapshots : 1
Purge redundant % : 50
Num keys per lock : 4
Compression : snappy
------------------------------------------------
No lock creation because test_batches_snapshots set
2013/04/26-17:55:17 Starting database operations
Created bg thread 0x7fc1f07ff700
... finished 60000 ops
Running db_stress
db_stress retncode -15 output LevelDB version : 1.5
Number of threads : 32
Ops per thread : 10000000
Read percentage : 50
Write-buffer-size : 4194304
Delete percentage : 30
Max key : 1000
Ratio #ops/#keys : 320000
Num times DB reopens: 0
Batches/snapshots : 1
Purge redundant % : 50
Num keys per lock : 4
Compression : snappy
------------------------------------------------
Created bg thread 0x7ff0137ff700
No lock creation because test_batches_snapshots set
2013/04/26-17:56:15 Starting database operations
... finished 90000 ops
Revert Plan: OK
Task ID: #2252691
Reviewers: dhruba, emayanke
Reviewed By: emayanke
CC: leveldb, haobo
Differential Revision: https://reviews.facebook.net/D10581
Summary:
Currently, with paranoid_check on, DB::Open will fail on any log read error on recovery.
If client is ok with losing most recent updates, we could simply skip those errors.
However, it's important to introduce an additional flag, so that paranoid_check can
still guard against more serious problems.
Test Plan: make check; db_stress
Reviewers: dhruba, emayanke
Reviewed By: emayanke
CC: leveldb, emayanke
Differential Revision: https://reviews.facebook.net/D10869
Summary:
There is an existing field Options.max_bytes_for_level_multiplier that
sets the multiplier for the size of each level in the database.
This patch introduces the ability to set different multipliers
for every level in the database. The size of a level is determined
by using both max_bytes_for_level_multiplier as well as the
per-level fanout.
size of level[i] = size of level[i-1] * max_bytes_for_level_multiplier
* fanout[i-1]
The default value of fanout is 1, so that it is backward compatible.
Test Plan: make check
Reviewers: haobo, emayanke
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10863
Summary:
PosixLogger and AutoRollLogger do not seem to be thread safe.
For PosixLogger, log_size_ is not atomically updated.
For AutoRollLogger, the underlying logger_ might be deleted by
one thread while still being accessed by another.
Test Plan: make check
Reviewers: kailiu, dhruba, heyongqiang
Reviewed By: kailiu
CC: leveldb, zshao, sheki
Differential Revision: https://reviews.facebook.net/D9699
Summary:
Make stop watch a simple implementation, instead of subclass of a virtual class
Allocate stop watches off the stack instead of heap.
Code is more terse now.
Test Plan: make all check, db_bench with --statistics=1
Reviewers: haobo, dhruba
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10809
Summary: Statistics.h and histogram.h had double based api's to record values. Remove them as they are not used anywhere
Test Plan: make all check
Reviewers: haobo, dhruba
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10815
Summary: ldb works with raw data from the database and needs to be aware of ttl-database to work with it meaningfully. '-ttl' option now tells it that. Also added onto the ldb_test.py test. This option may be specified alongwith put, get, scan or dump. There is no support to provide a ttl-value and it uses default forever because there is no use-case for this currently.
Test Plan: make ldb_test; python tools/ldb_test.py
Reviewers: dhruba, sheki, haobo, vamsi
Reviewed By: sheki
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10797
Summary:
This diff replaces compaction_filter_args and CompactionFilter with a single compaction_filter parameter. It gives CompactionFilter better encapsulation and a similar look to Comparator and MergeOpertor, which improves consistency of the overall interface.
The change is not backward compatible. Nevertheless, the two references in fbcode are not in production yet.
Test Plan: make check
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb, zshao
Differential Revision: https://reviews.facebook.net/D10773
Summary:
This diff introduces a new Merge operation into rocksdb.
The purpose of this review is mostly getting feedback from the team (everyone please) on the design.
Please focus on the four files under include/leveldb/, as they spell the client visible interface change.
include/leveldb/db.h
include/leveldb/merge_operator.h
include/leveldb/options.h
include/leveldb/write_batch.h
Please go over local/my_test.cc carefully, as it is a concerete use case.
Please also review the impelmentation files to see if the straw man implementation makes sense.
Note that, the diff does pass all make check and truly supports forward iterator over db and a version
of Get that's based on iterator.
Future work:
- Integration with compaction
- A raw Get implementation
I am working on a wiki that explains the design and implementation choices, but coding comes
just naturally and I think it might be a good idea to share the code earlier. The code is
heavily commented.
Test Plan: run all local tests
Reviewers: dhruba, heyongqiang
Reviewed By: dhruba
CC: leveldb, zshao, sheki, emayanke, MarkCallaghan
Differential Revision: https://reviews.facebook.net/D9651
Summary:
Mark's task description from #2316777
Env::Default() comes from util/env_posix.cc
This is a static global.
static PosixEnv default_env;
Env* Env::Default() {
return &default_env;
}
-----
These globals assume default_env was initialized first. I don't think that is safe or correct to do (http://stackoverflow.com/questions/1005685/c-static-initialization-order)
const string AutoRollLoggerTest::kTestDir(
test::TmpDir() + "/db_log_test");
const string AutoRollLoggerTest::kLogFile(
test::TmpDir() + "/db_log_test/LOG");
Env* AutoRollLoggerTest::env = Env::Default();
Test Plan:
run make clean && make && make check
But how can I know if it works in Ubuntu?
Reviewers: MarkCallaghan, chip
Reviewed By: chip
CC: leveldb, dhruba, haobo
Differential Revision: https://reviews.facebook.net/D10491
Summary:
RocksDB doesn't build on Ubuntu VM .. shoudl be fixed with this patch.
g++ --version
g++ (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3
util/env_posix.cc:68:24: sorry, unimplemented: non-static data member initializers
util/env_posix.cc:68:24: error: ISO C++ forbids in-class initialization of non-const static member ‘use_os_buffer’
util/env_posix.cc:113:24: sorry, unimplemented: non-static data member initializers
util/env_posix.cc:113:24: error: ISO C++ forbids in-class initialization of non-const static member ‘use_os_buffer
Test Plan: make check
Reviewers: sheki, leveldb
Reviewed By: sheki
Differential Revision: https://reviews.facebook.net/D10461
Summary:
Adds the --writes_per_second rate limit for the readwhilewriting test.
The purpose is to optionally avoid saturating storage with writes & compaction
and test read response time when some writes are being done.
Changes the histogram code to also print the p99.99 value
Task ID: #
Blame Rev:
Test Plan:
make check, ran db_bench with it
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: haobo
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10305
Summary: forgot to include signal_test.cc
Test Plan: make check
Reviewers: sheki
Reviewed By: sheki
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10281
Summary:
This diff provides the ability to print out a stacktrace when the process receives certain signals.
Currently, we enable this for the following signals (program error related):
SIGILL SIGSEGV SIGBUS SIGABRT
Application simply #include "util/stack_trace.h" and call leveldb::InstallStackTraceHandler() during initialization, if signal handler is needed. It's not done automatically when openning db, because it's the application(process)'s responsibility to install signal handler and some applications might already have their own (like fbcode).
Sample output:
Received signal 11 (Segmentation fault)
#0 0x408ff0 ./signal_test() [0x408ff0] /home/haobo/rocksdb/util/signal_test.cc:4
#1 0x40827d ./signal_test() [0x40827d] /home/haobo/rocksdb/util/signal_test.cc:24
#2 0x7f8bb183172e /usr/local/fbcode/gcc-4.7.1-glibc-2.14.1/lib/libc.so.6(__libc_start_main+0x10e) [0x7f8bb183172e] ??:0
#3 0x408ebc ./signal_test() [0x408ebc] /home/engshare/third-party/src/glibc/glibc-2.14.1/glibc-2.14.1/csu/../sysdeps/x86_64/elf/start.S:113
Segmentation fault (core dumped)
For each frame, we print the raw pointer, the symbol provided by backtrace_symbols (still not good enough), and the source file/line. Note that address translation is done by directly shell out to addr2line. ??:0 means addr2line fails to do the translation. Hacky, but I think it's good for now.
Test Plan: signal_test.cc
Reviewers: dhruba, MarkCallaghan
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10173
Summary: As title. Found out this when testing stack_trace.cc portability.
Test Plan: make check; manual test 'non-linux' build by forcing OS_LINUX2
Reviewers: dhruba, heyongqiang
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10263
Summary: Primarily a refactor. Introduced LDBTool interface to which customers can plug in their options and this will create their own version of ldb tool.
Test Plan: made ldb tool and tried it.
Reviewers: dhruba, heyongqiang
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10191
Summary:
The background compaction threads are never exitted and therefore caused
memory-leaks while running rpcksdb tests. Have changed the PosixEnv destructor to exit and join them and changed the tests likewise
The memory leaked has reduced from 320 bytes to 64 bytes in all the tests. The 64
bytes is relating to
pthread_exit, but still have to figure out why. The stack-trace right now with
table_test.cc = 64 bytes in 1 blocks are possibly lost in loss record 4 of 5
at 0x475D8C: malloc (jemalloc.c:914)
by 0x400D69E: _dl_map_object_deps (dl-deps.c:505)
by 0x4013393: dl_open_worker (dl-open.c:263)
by 0x400F015: _dl_catch_error (dl-error.c:178)
by 0x4013B2B: _dl_open (dl-open.c:569)
by 0x5D3E913: do_dlopen (dl-libc.c:86)
by 0x400F015: _dl_catch_error (dl-error.c:178)
by 0x5D3E9D6: __libc_dlopen_mode (dl-libc.c:47)
by 0x5048BF3: pthread_cancel_init (unwind-forcedunwind.c:53)
by 0x5048DC9: _Unwind_ForcedUnwind (unwind-forcedunwind.c:126)
by 0x5046D9F: __pthread_unwind (unwind.c:130)
by 0x50413A4: pthread_exit (pthreadP.h:289)
Test Plan: make all check
Reviewers: dhruba, sheki, haobo
Reviewed By: dhruba
CC: leveldb, chip
Differential Revision: https://reviews.facebook.net/D9573
Summary: as subject. This is causing problem in adsconv. Ideally, this flags should be set in open. But that is only supported in Linux kernel ≥2.6.23 and glibc ≥2.7.
Test Plan:
db_test
run db_test
Reviewers: dhruba, MarkCallaghan, haobo
Reviewed By: dhruba
CC: leveldb, chip
Differential Revision: https://reviews.facebook.net/D10089
Summary:
1. The stock LRUCache nukes itself whenever the working set (the total number of entries not released by client at a certain time) is bigger than the cache capacity.
See https://our.dev.facebook.com/intern/tasks/?t=2252281
2. There's a bug in shard calculation leading to segmentation fault when only one shard is needed.
Test Plan: make check
Reviewers: dhruba, heyongqiang
Reviewed By: heyongqiang
CC: leveldb, zshao, sheki
Differential Revision: https://reviews.facebook.net/D9927
Summary:
1. SetBackgroundThreads was not thread safe
2. queue_size_ does not seem necessary
3. moved condition signal after shared state change. Even though the original
order is in practice ok (because the mutex is still held), it looks fishy
and non-intuitive.
Test Plan: make check
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb, zshao
Differential Revision: https://reviews.facebook.net/D9825
Summary:
Use non mmapd files for Write-Ahead log.
Earlier use of MMaped files. made the log iterator read ahead and miss records.
Now the reader and writer will point to the same physical location.
There is no perf regression :
./db_bench --benchmarks=fillseq --db=/dev/shm/mmap_test --num=$(million 20) --use_existing_db=0 --threads=2
with This diff :
fillseq : 10.756 micros/op 185281 ops/sec; 20.5 MB/s
without this dif :
fillseq : 11.085 micros/op 179676 ops/sec; 19.9 MB/s
Test Plan: unit test included
Reviewers: dhruba, heyongqiang
Reviewed By: heyongqiang
CC: leveldb
Differential Revision: https://reviews.facebook.net/D9741
Summary:
Earlier Statistics object was a raw pointer. This meant the user had to clear up
the Statistics object after creating the database. In most use cases the database is created in a function and the statistics pointer is out of scope. Hence the statistics object would never be deleted.
Now Using a shared_ptr to manage this.
Want this in before the next release.
Test Plan: make all check.
Reviewers: dhruba, emayanke
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D9735
Summary: This caused compilation problems on some gcc platforms during the third-partyrelease
Test Plan: make
Reviewers: sheki
Reviewed By: sheki
Differential Revision: https://reviews.facebook.net/D9627
Summary:
This patch allows an application to specify whether to use bufferedio,
reads-via-mmaps and writes-via-mmaps per database. Earlier, there
was a global static variable that was used to configure this functionality.
The default setting remains the same (and is backward compatible):
1. use bufferedio
2. do not use mmaps for reads
3. use mmap for writes
4. use readaheads for reads needed for compaction
I also added a parameter to db_bench to be able to explicitly specify
whether to do readaheads for compactions or not.
Test Plan: make check
Reviewers: sheki, heyongqiang, MarkCallaghan
Reviewed By: sheki
CC: leveldb
Differential Revision: https://reviews.facebook.net/D9429
Summary: Getting rid of boost in our github codebase which caused problems on third-party
Test Plan: make ldb; python tools/ldb_test.py
Reviewers: sheki, dhruba
Reviewed By: sheki
Differential Revision: https://reviews.facebook.net/D9543
Summary: Was causing error(warning) in third-party saying unused result
Test Plan: make
Reviewers: sheki, dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D9447
Summary: Makefile had options to ignore sign-comparisons and unused-parameters, which should be there. Also fixed the specific errors in the code-base
Test Plan: make
Reviewers: chip, dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D9531
Summary: negation of the condition checked currently had to be checkd actually
Test Plan: make ldb; python ldb_test.py
Reviewers: sheki, dhruba
Reviewed By: sheki
Differential Revision: https://reviews.facebook.net/D9459
Summary: boost functions cause complications while deploying to third-party
Test Plan: make
Reviewers: sheki, dhruba
Reviewed By: sheki
Differential Revision: https://reviews.facebook.net/D9441
Summary:
Ftruncate does not throw an error on disk-full. This causes Sig-bus in
the case where the database tries to issue a Put call on a full-disk.
Use posix_fallocate for allocation instead of truncate.
Add a check to use MMaped files only on ext4, xfs and tempfs, as
posix_fallocate is very slow on ext3 and older.
Test Plan: make all check
Reviewers: dhruba, chip
Reviewed By: dhruba
CC: adsharma, leveldb
Differential Revision: https://reviews.facebook.net/D9291
Summary: Fix for memory leaks in rocksdb tests. Also modified the variable NUM_FAILED_TESTS to print the actual number of failed tests.
Test Plan: make <test>; valgrind --leak-check=full ./<test>
Reviewers: sheki, dhruba
Reviewed By: sheki
CC: leveldb
Differential Revision: https://reviews.facebook.net/D9333
Summary:
1. Create only 2 levels so that manual compactions are fast.
2. Set target file size to a large value
Test Plan: make clean check
Reviewers: kailiu, zshao
Reviewed By: zshao
CC: leveldb
Differential Revision: https://reviews.facebook.net/D9231
Summary:
Add a shortcut function to make it easier for people
to efficiently bulk_load data into RocksDB.
Test Plan:
Tried ldb with "--bulk_load" and "--bulk_load --compact" and verified the outcome.
Needs to consult the team on how to test this automatically.
Reviewers: sheki, dhruba, emayanke, heyongqiang
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D8907
Summary:
This adds the rate_delay_limit_milliseconds option to make the delay
configurable in MakeRoomForWrite when the max compaction score is too high.
This delay is called the Ln slowdown. This change also counts the Ln slowdown
per level to make it possible to see where the stalls occur.
From IO-bound performance testing, the Level N stalls occur:
* with compression -> at the largest uncompressed level. This makes sense
because compaction for compressed levels is much
slower. When Lx is uncompressed and Lx+1 is compressed
then files pile up at Lx because the (Lx,Lx+1)->Lx+1
compaction process is the first to be slowed by
compression.
* without compression -> at level 1
Task ID: #1832108
Blame Rev:
Test Plan:
run with real data, added test
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D9045
Summary:
Rocks accumulates recent writes and deletes in the in-memory memtable.
When the memtable is full, it writes the contents on the memtable to
a file in L0.
This patch removes redundant records at the time of the flush. If there
are multiple versions of the same key in the memtable, then only the
most recent one is dumped into the output file. The purging of
redundant records occur only if the most recent snapshot is earlier
than the earliest record in the memtable.
Should we switch on this feature by default or should we keep this feature
turned off in the default settings?
Test Plan: Added test case to db_test.cc
Reviewers: sheki, vamsi, emayanke, heyongqiang
Reviewed By: sheki
CC: leveldb
Differential Revision: https://reviews.facebook.net/D8991
Summary: LDB tool to print the deleted/put keys in hex in the wal file.
Test Plan: run ldb on a db to check if output was satisfactory
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D8691
Summary:
Changed the Get and Scan options with openForReadOnly mode to have access to the memtable.
Changed the visibility of NewInternalIterator in db_impl from private to protected so that
the derived class db_impl_read_only can call that in its NewIterator function for the
scan case. The previous approach which changed the default for flush_on_destroy_ from false to true
caused many problems in the unit tests due to empty sst files that it created. All
unit tests pass now.
Test Plan: make clean; make all check; ldb put and get and scans
Reviewers: dhruba, heyongqiang, sheki
Reviewed By: dhruba
CC: kosievdmerwe, zshao, dilipj, kailiu
Differential Revision: https://reviews.facebook.net/D8697
Summary:
* Introduce is histogram in statistics.h
* stop watch to measure time.
* introduce two timers as a poc.
Replaced NULL with nullptr to fight some lint errors
Should be useful for google.
Test Plan:
ran db_bench and check stats.
make all check
Reviewers: dhruba, heyongqiang
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D8637
Summary:
I missed InitTestDb() in one of my tess. InitTestDb() initializes the test directory, without which the test will throw IO error.
This problem didn't occur before because I've already run the tests before so the test directory is already there.
Test Plan:
Reviewers: dhruba
CC:
Task ID: #
Blame Rev:
Summary:
flush_on_destroy has a default value of false and the memtable is flushed
in the dbimpl-destructor only when that is set to true. Because we want the memtable to be flushed everytime that
the destructor is called(db is closed) and the cases where we work with the memtable only are very less
it is a good idea to give this a default value of true. Thus the put from ldb
wil have its data flushed to disk in the destructor and the next Get will be able to
read it when opened with OpenForReadOnly. The reason that ldb could read the latest value when
the db was opened in the normal Open mode is that the Get from normal Open first reads
the memtable and directly finds the latest value written there and the Get from OpenForReadOnly
doesn't have access to the memtable (which is correct because all its Put/Modify) are disabled
Test Plan: make all; ldb put and get and scans
Reviewers: dhruba, heyongqiang, sheki
Reviewed By: heyongqiang
CC: kosievdmerwe, zshao, dilipj, kailiu
Differential Revision: https://reviews.facebook.net/D8631
Summary: Fix the warning [-Werror=format-security] and [-Werror=unused-result].
Test Plan:
enforced the Werror and run make
Task ID: 2101673
Blame Rev:
Reviewers: heyongqiang
Differential Revision: https://reviews.facebook.net/D8553
Summary:
$SUBJECT -- cosmetic fix for histograms, print P75/P99, and
make sure zlib is enabled for our command line tools.
Test Plan: compile, test db_bench with --compression_type=zlib
Reviewers: heyongqiang
Reviewed By: heyongqiang
CC: adsharma, leveldb
Differential Revision: https://reviews.facebook.net/D8445
Summary:
* Add a SplitByTTLLogger to enable this feature. In this diff I implemented generalized AutoSplitLoggerBase class to simplify the
development of such classes.
* Refactor the existing AutoSplitLogger and fix several bugs.
Test Plan:
* Added a unit tests for different types of "auto splitable" loggers individually.
* Tested the composited logger which allows the log files to be splitted by both TTL and log size.
Reviewers: heyongqiang, dhruba
Reviewed By: heyongqiang
CC: zshao, leveldb
Differential Revision: https://reviews.facebook.net/D8037
Summary:
The existing code did not initialize a few doubles in histogram.cc.
Cropped up when I wrote a unit-test.
Test Plan: make all check
Reviewers: chip
Reviewed By: chip
CC: leveldb
Differential Revision: https://reviews.facebook.net/D8319
Summary:
Earlier way to record in histogram=>
Linear search BucketLimit array to find the bucket and increment the
counter
Current way to record in histogram=>
Store a HistMap statically which points the buckets of each value in the
range [kFirstValue, kLastValue);
In the proccess use vectors instead of array's and refactor some code to
HistogramHelper class.
Test Plan:
run db_bench with histogram=1 and see a histogram being
printed.
Reviewers: dhruba, chip, heyongqiang
Reviewed By: chip
CC: leveldb
Differential Revision: https://reviews.facebook.net/D8265
Summary:
Added function to `RandomAccessFile` to generate an unique ID for that file. Currently only `PosixRandomAccessFile` has this behaviour implemented and only on Linux.
Changed how key is generated in `Table::BlockReader`.
Added tests to check whether the unique ID is stable, unique and not a prefix of another unique ID. Added tests to see that `Table` uses the cache more efficiently.
Test Plan: make check
Reviewers: chip, vamsi, dhruba
Reviewed By: chip
CC: leveldb
Differential Revision: https://reviews.facebook.net/D8145
Summary: fallocate is linux only, so let's protect it with ifdef's
Test Plan: make
Reviewers: sheki, dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D8223
Summary:
Previously, if you opened a db with num_levels set lower than
the database, you received the unhelpful message "Corruption:
VersionEdit: new-file entry." Now you get a more verbose message
describing the issue.
Also, fix handling of compression_levels (both the run-over-the-end
issue and the memory management of it).
Lastly, unique_ptr'ify a couple of minor calls.
Test Plan: make check
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D8151
Summary:
We continually rebuilt build_version.c because we put the
current date into it, but that's what __DATE__ already is. This makes
builds faster.
This also fixes an issue with 'make clean FOO' not working properly.
Also tweak the build rules to be more consistent, always have warnings,
and add a 'make release' rule to handle flags for release builds.
Test Plan: make, make clean
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D8139
Summary:
On some filesystems, pre-allocation can be a considerable
amount of space. xfs in our production environment pre-allocates by
1GB, for instance. By using fallocate to inform the kernel of our
expected file sizes, we eliminate this wasteage (that isn't recovered
until the file is closed which, in the case of LOG files, can be a
considerable amount of time).
Test Plan:
created an xfs loopback filesystem, mounted with
allocsize=4M, and ran db_stress. LOG file without this change was 4M,
and with it it was 128k then grew to normal size.
Reviewers: dhruba
Reviewed By: dhruba
CC: adsharma, leveldb
Differential Revision: https://reviews.facebook.net/D7953
Summary:
Replace manual memory management with std::unique_ptr in a
number of places; not exhaustive, but this fixes a few leaks with file
handles as well as clarifies semantics of the ownership of file handles
with log classes.
Test Plan: db_stress, make check
Reviewers: dhruba
Reviewed By: dhruba
CC: zshao, leveldb, heyongqiang
Differential Revision: https://reviews.facebook.net/D8043
Summary:
Check in LogAndApply if the file size is more than the limit set in
Options.
Things to consider : will this be expensive?
Test Plan: make all check. Inputs on a new unit test?
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7701
Summary:
clang is an alternate compiler based on llvm. It produces
nicer error messages and finds some bugs that gcc doesn't, such as the
size_t change in this file (which caused some write return values to be
misinterpreted!)
Clang isn't the default; to try it, do "USE_CLANG=1 make" or "export
USE_CLANG=1" then make as normal
Test Plan: "make check" and "USE_CLANG=1 make check"
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D7899
Summary: `~ShardedLRUCache()` was empty despite `init()` allocating memory on the heap. Fixed the leak by freeing memory allocated by `init()`.
Test Plan:
make check
Ran valgrind on db_test before and after patch and saw leaked memory went down
Reviewers: vamsi, dhruba, emayanke, sheki
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7791
Summary:
Changed CreateDir() to CreateDirIfMissing() so a directory that already exists now causes and error.
Fixed CreateDirIfMissing() and added Env.DirExists()
Test Plan:
make check to test for regessions
Ran the following to test if the error message is not about lock files not existing
./db_bench --db=dir/testdb
After creating a file "testdb", ran the following to see if it failed with sane error message:
./db_bench --db=testdb
Reviewers: dhruba, emayanke, vamsi, sheki
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7707
Summary:
There is a compilation error while using gcc 4.7.1.
util/ldb_cmd.cc:381:3: error: ‘leveldb::ReadOptions::ReadOptions’ names the constructor, not the type
util/ldb_cmd.cc:381:37: error: expected ‘;’ before ‘read_options’
util/ldb_cmd.cc:381:49: error: statement cannot resolve address of overloaded function
Test Plan: make clean check
Reviewers: sheki, emayanke, zshao
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7659
Summary: The queries will come from stdin. One key per line. The output will be in stdout, in the format of "<key> ==> <value>" if found, or "<key>" if not found. "--hex" uses HEX-encoded keys and values in both input and output.
Test Plan: ldb query --db=leveldb_db --hex
Reviewers: dhruba, emayanke, sheki
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7617
Summary: We were ignoring additional chars at the end of an arg. This can create confusion, e.g. --disable_wal=0 will act the same as --disable_wal without any warnings.
Test Plan:
Tried this:
[zshao@dev485 ~/git/rocksdb] ./ldb dump --statsAAA
Failed: Unknown argument:--statsAAA
Reviewers: dhruba, sheki, emayanke
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7635
Summary:
This allows ldb to control the write_buffer_size (which reflects to L0 file size) and file_size (which reflects to L1 file size). Since the target_file_size_ratio is 1 by default, all other levels will also have the same file size as L1.
As part of the diff, I also cleaned up some unused code and help messages.
Test Plan: ./ldb load --db=/data/users/zshao/test_leveldb --file_size=64000000 --write_buffer_size=32000000 --create_if_missing --input_hex --disable_wal
Reviewers: dhruba, sheki, emayanke
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7569
Summary: This command accepts key-value pairs from stdin with the same format of "ldb dump" command. This allows us to try out different compression algorithms/block sizes easily.
Test Plan: dump, load, dump, verify the data is the same.
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7443
Summary: The old code was omitting the 0 if the char is less than 16.
Test Plan:
Tried the following program:
int main() {
unsigned char c = 1;
printf("%X\n", c);
printf("%02X\n", c);
return 0;
}
The output is:
1
01
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7437
Summary: This allows us to use ldb to do more experiments like block_size changes.
Test Plan: run it by hand.
Reviewers: dhruba, sheki, emayanke
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7431
Summary:
Suppose you submit 100 background tasks one after another. The first
enqueu task finds that the queue is empty and wakes up one worker thread.
Now suppose that all remaining 99 work items are enqueued, they do not
wake up any worker threads because the queue is already non-empty.
This causes a situation when there are 99 tasks in the task queue but
only one worker thread is processing a task while the remaining
worker threads are waiting.
The fix is to always wakeup one worker thread while enqueuing a task.
I also added a check to count the number of elements in the queue
to help in debugging.
Test Plan: make clean check.
Reviewers: chip
Reviewed By: chip
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7203
Summary:
Added the following two options:
[--bloom_bits=<int,e.g.:14>]
[--compression_type=<no|snappy|zlib|bzip2>]
These options will be used when ldb opens the leveldb database.
Test Plan: Tried by hand for both success and failure cases. We do need a test framework.
Reviewers: dhruba, emayanke, sheki
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7197
Summary: Added 1 to indices where I shouldn't have so overrun array.
Test Plan: make check
Reviewers: sheki, emayanke, vamsi, dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D7227
Summary: Added BitStreamPutInt() and BitStreamGetInt() which take a stream of chars and can write integers of arbitrary bit sizes to that stream at arbitrary positions. There are also convenience versions of these functions that take std::strings and leveldb::Slices.
Test Plan: make check
Reviewers: sheki, vamsi, dhruba, emayanke
Reviewed By: vamsi
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7071
Summary:
Create a directory "archive" in the DB directory.
During DeleteObsolteFiles move the WAL files (*.log) to the Archive directory,
instead of deleting.
Test Plan: Created a DB using DB_Bench. Reopened it. Checked if files move.
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D6975
Summary:
Scripted and removed all trailing spaces and converted all tabs to
spaces.
Also fixed other lint errors.
All lint errors from this point of time should be taken seriously.
Test Plan: make all check
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7059
Summary:
It would appear our unit tests make use of code from ldb_cmd,
and don't always require a valid database handle. D6855 was not aware
db_ could sometimes be NULL for such commands, and so it broke
reduce_levels_test.
This moves the check elsewhere to (at least) fix the 'ldb dump' case of
segfaulting when it couldn't open a database.
Test Plan: make check
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D6903
Summary:
Link statically against snappy, using the gvfs one for facebook
environments, and the bundled one otherwise.
In addition, fix a few minor segfaults in ldb when it couldn't open the
database, and update .gitignore to include a few other build artifacts.
Test Plan: make check
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D6855
Summary:
The compaction process takes some files from LevelK and
merges it into LevelK+1. The number of files it picks from
LevelK was capped such a way that the total amount of
data picked does not exceed the maxfilesize of that level.
This essentially meant that only one file from LevelK
is picked for a single compaction.
For bulkloads, we would like to take many many file from
LevelK and compact them using a single compaction run.
This patch introduces a option called the 'source_compaction_factor'
(similar to expanded_compaction_factor). It is a multiplier
that is multiplied by the maxfilesize of that level to arrive
at the limit that is used to throttle the number of source
files from LevelK. For bulk loads, set source_compaction_factor
to a very high number so that multiple files from the same
level are picked for compaction in a single compaction.
The default value of source_compaction_factor is 1, so that
we can keep backward compatibilty with existing compaction semantics.
Test Plan: make clean check
Reviewers: emayanke, sheki
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D6867
Summary:
This option is needed for fast bulk uploads. The goal is to load
all the data into files in L0 without any interference from
background compactions.
Test Plan: make clean check
Reviewers: sheki
Reviewed By: sheki
CC: leveldb
Differential Revision: https://reviews.facebook.net/D6849
Summary:
StringStream.clear() does not clear the stream. It sets some flags.
Who knew? Fixing that is not printing the stuff again and again.
Test Plan: ran it on a local db
Reviewers: dhruba, emayanke
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D6795
Summary:
There are applications that operate on multiple leveldb instances.
These applications will like to pass in an opaque type for each
leveldb instance and this type should be passed back to the application
with every invocation of the CompactionFilter api.
Test Plan: Enehanced unit test for opaque parameter to CompactionFilter.
Reviewers: heyongqiang
Reviewed By: heyongqiang
CC: MarkCallaghan, sheki, emayanke
Differential Revision: https://reviews.facebook.net/D6711
Summary:
I changed the reduce_num_levels logic to avoid "compactRange()" call if the current number of levels in use (levels that contain files) is smaller than the new num of levels.
And that change breaks the assert in reduce_levels_test
Test Plan: run reduce_levels_test
Reviewers: dhruba, MarkCallaghan
Reviewed By: dhruba
CC: emayanke, sheki
Differential Revision: https://reviews.facebook.net/D6651
Summary:
make clean check OPT=-g fails
leveldb::DBStatistics::getTickerCount(leveldb::Tickers)’:
./db/db_statistics.h:34: error: ‘MAX_NO_TICKERS’ was not declared in this scope
util/ldb_cmd.cc:255: warning: left shift count >= width of type
Test Plan:
make clean check OPT=-g
Reviewers:
CC:
Task ID: #
Blame Rev:
Summary:
disable size compaction in ldb reduce_levels, this will avoid compactions rather than the manual comapction,
added --compression=none|snappy|zlib|bzip2 and --file_size= per-file size to ldb reduce_levels command
Test Plan: run ldb
Reviewers: dhruba, MarkCallaghan
Reviewed By: dhruba
CC: sheki, emayanke
Differential Revision: https://reviews.facebook.net/D6597
Summary:
The default compilation process now uses "-Wall" to compile.
Fix all compilation error generated by gcc.
Test Plan: make all check
Reviewers: heyongqiang, emayanke, sheki
Reviewed By: heyongqiang
CC: MarkCallaghan
Differential Revision: https://reviews.facebook.net/D6525
Summary:
There are certain use-cases where the application intends to
delete older keys aftre they have expired a certian time period.
One option for those applications is to periodically scan the
entire database and delete appropriate keys.
A better way is to allow the application to hook into the
compaction process. This patch allows the application to set
a method callback for every key that is being compacted. If
this method returns true, then the key is not preserved in
the output of the compaction.
Test Plan:
This is mostly to preview the proposed new public api.
Since it is a public api, please do due diligence on reviewing it.
I will be writing test cases for this api in mynext version of
this patch.
Reviewers: MarkCallaghan, heyongqiang
Reviewed By: heyongqiang
CC: sheki, adsharma
Differential Revision: https://reviews.facebook.net/D6285
Summary: as subject.
Test Plan: manually test it, will add a testcase
Reviewers: dhruba, MarkCallaghan
Differential Revision: https://reviews.facebook.net/D6345
Summary: Leveldb currently uses windowBits=-14 while using zlib compression.(It was earlier 15). This makes the setting configurable. Related changes here: https://reviews.facebook.net/D6105
Test Plan: make all check
Reviewers: dhruba, MarkCallaghan, sheki, heyongqiang
Differential Revision: https://reviews.facebook.net/D6393
Summary:
as subject
Test Plan:
run db_bench and db_test
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D6111
Summary:
The leveldb API is enhanced to support different compression algorithms at
different levels.
This adds the option min_level_to_compress to db_bench that specifies
the minimum level for which compression should be done when
compression is enabled. This can be used to disable compression for levels
0 and 1 which are likely to suffer from stalls because of the CPU load
for memtable flushes and (L0,L1) compaction. Level 0 is special as it
gets frequent memtable flushes. Level 1 is special as it frequently
gets all:all file compactions between it and level 0. But all other levels
could be the same. For any level N where N > 1, the rate of sequential
IO for that level should be the same. The last level is the
exception because it might not be full and because files from it are
not read to compact with the next larger level.
The same amount of time will be spent doing compaction at any
level N excluding N=0, 1 or the last level. By this standard all
of those levels should use the same compression. The difference is that
the loss (using more disk space) from a faster compression algorithm
is less significant for N=2 than for N=3. So we might be willing to
trade disk space for faster write rates with no compression
for L0 and L1, snappy for L2, zlib for L3. Using a faster compression
algorithm for the mid levels also allows us to reclaim some cpu
without trading off much loss in disk space overhead.
Also note that little is to be gained by compressing levels 0 and 1. For
a 4-level tree they account for 10% of the data. For a 5-level tree they
account for 1% of the data.
With compression enabled:
* memtable flush rate is ~18MB/second
* (L0,L1) compaction rate is ~30MB/second
With compression enabled but min_level_to_compress=2
* memtable flush rate is ~320MB/second
* (L0,L1) compaction rate is ~560MB/second
This practicaly takes the same code from https://reviews.facebook.net/D6225
but makes the leveldb api more general purpose with a few additional
lines of code.
Test Plan: make check
Differential Revision: https://reviews.facebook.net/D6261
Summary:
Adds a method that returns the score for the next level that most
needs compaction. That method is then used by db_bench to rate limit threads.
Threads are put to sleep at the end of each stats interval until the score
is less than the limit. The limit is set via the --rate_limit=$double option.
The specified value must be > 1.0. Also adds the option --stats_per_interval
to enable additional metrics reported every stats interval.
Task ID: #
Blame Rev:
Test Plan:
run db_bench
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D6243
Summary: Enable LevelDb to create a new log file if current log file is too large.
Test Plan:
Write a script and manually check the generated info LOG.
Task ID: 1803577
Blame Rev:
Reviewers: dhruba, heyongqiang
Reviewed By: heyongqiang
CC: zshao
Differential Revision: https://reviews.facebook.net/D6003
Summary:
The parameter delete_obsolete_files_period_micros controls the
periodicity of deleting obsolete files. db_bench was reading in
this parameter intoa local variable called 'l' but was incorrectly
using another local variable called 'n' while setting it in the
db.options data structure.
This patch also logs the value of delete_obsolete_files_period_micros
in the LOG file at db startup time.
I am hoping that this will improve the overall write throughput drastically.
Test Plan: run db_bench
Reviewers: MarkCallaghan, heyongqiang
Reviewed By: MarkCallaghan
Differential Revision: https://reviews.facebook.net/D6099
published in https://reviews.facebook.net/D5997.
Summary:
This patch allows compaction to occur in multiple background threads
concurrently.
If a manual compaction is issued, the system falls back to a
single-compaction-thread model. This is done to ensure correctess
and simplicity of code. When the manual compaction is finished,
the system resumes its concurrent-compaction mode automatically.
The updates to the manifest are done via group-commit approach.
Test Plan: run db_bench
Summary:
The method DeleteObsolete files is a very costly methind, especially
when the number of files in a system is large. It makes a list of
all live-files and then scans the directory to compute the diff.
By default, this method is executed after every compaction run.
This patch makes it such that DeleteObsolete files is never
invoked twice within a configured period.
Test Plan: run all unit tests
Reviewers: heyongqiang, MarkCallaghan
Reviewed By: MarkCallaghan
Differential Revision: https://reviews.facebook.net/D6045
Summary:
Each assoc is identified by (id1, assocType). This is the rowkey.
Each row has a read/write rowlock. There is statically allocated array
of 2000 read/write locks. A rowkey is murmur-hashed to one of the
read/write locks.
assocPut and assocDelete acquires the rowlock in Write mode.
The key-updates are done within the rowlock with a atomic nosync
batch write to leveldb. Then the rowlock is released and
a write-with-sync is done to sync leveldb transaction log.
Test Plan: added unit test
Reviewers: heyongqiang
Reviewed By: heyongqiang
Differential Revision: https://reviews.facebook.net/D5859
Summary:
We have seen that reading data via the pread call (instead of
mmap) is much faster on Linux 2.6.x kernels. This patch makes
an equivalent option to switch off mmaps for the write path
as well.
db_bench --mmap_write=0 will use write() instead of mmap() to
write data to a file.
This change is backward compatible, the default
option is to continue using mmap for writing to a file.
Test Plan: "make check all"
Differential Revision: https://reviews.facebook.net/D5781
Summary:
Implement ReadWrite locks for leveldb. These will be helpful
to implement a read-modify-write operation (e.g. atomic increments).
Test Plan: does not modify any existing code
Reviewers: heyongqiang
Reviewed By: heyongqiang
CC: MarkCallaghan
Differential Revision: https://reviews.facebook.net/D5787
Summary: Print the block cache size in the LOG.
Test Plan: run db_bench and look at LOG. This is helpful while I was debugging one use-case.
Reviewers: heyongqiang, MarkCallaghan
Reviewed By: heyongqiang
Differential Revision: https://reviews.facebook.net/D5739
Summary:
The GetLiveFiles() api lists the set of sst files and the current
MANIFEST file. But the database continues to append new data to the
MANIFEST file even when the application is backing it up to the
backup location. This means that the database-version that is
stored in the MANIFEST FILE in the backup location
does not correspond to the sst files returned by GetLiveFiles.
This API adds a new parameter to GetLiveFiles. This new parmeter
returns the current size of the MANIFEST file.
Test Plan: Unit test attached.
Reviewers: heyongqiang
Reviewed By: heyongqiang
Differential Revision: https://reviews.facebook.net/D5631
Summary:
The background threads are necessary for compaction.
For slower storage, it might be necessary to have more than
one compaction thread per DB. This patch allows creating
a configurable number of worker threads.
The default reamins at 1 (to maintain backward compatibility).
Test Plan:
run all unit tests. changes to db-bench coming in
a separate patch.
Reviewers: heyongqiang
Reviewed By: heyongqiang
CC: MarkCallaghan
Differential Revision: https://reviews.facebook.net/D5559
Summary:
as subject. This diff should be good for benchmarking.
will send another diff to make it better in the case the seek compaction is enable.
In that coming diff, will not count a seek if the bloomfilter filters.
Test Plan: build
Reviewers: dhruba, MarkCallaghan
Reviewed By: MarkCallaghan
Differential Revision: https://reviews.facebook.net/D5481
Summary:
as subject. this can be used for benchmarking.
If we want it for some cases, we can do more changes to make this part of the option.
Test Plan: db_test
Reviewers: dhruba
CC: MarkCallaghan
Differential Revision: https://reviews.facebook.net/D5451
Summary:
Reads via mmap on concurrent workloads are much slower than pread.
For example on a 24-core server with storage that can do 100k IOPS or more
I can get no more than 10k IOPS with mmap reads and 32+ threads.
Test Plan: db_bench benchmarks
Reviewers: dhruba, heyongqiang
Reviewed By: heyongqiang
Differential Revision: https://reviews.facebook.net/D5433
Summary:
Ability to switch off filesystem read-aheads. This change is
backward-compatible: the default setting is to allow file
system read-aheads.
Test Plan: run benchmarks
Reviewers: heyongqiang, adsharma
Reviewed By: heyongqiang
Differential Revision: https://reviews.facebook.net/D5391
Summary:
When posix_fadvise(offset, offset) is usedm it frees up only those
pages in that specified range. But the filesystem could have done some
read-aheads and those get cached in the OS cache.
Do not cache readahead-pages in the OS cache.
Test Plan: run db_bench benchmark.
Reviewers: vamsi, heyongqiang
Reviewed By: heyongqiang
Differential Revision: https://reviews.facebook.net/D5379
Summary: added a new option db_log_dir, which points the log dir. Inside that dir, in order to make log names unique, the log file name is prefixed with the leveldb data dir absolute path.
Test Plan: db_test
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D5205
Summary:
Clean up compiler warnings generated by -Wall option.
make clean all OPT=-Wall
This is a pre-requisite before making a new release.
Test Plan: compile and run unit tests
Reviewers: heyongqiang
Reviewed By: heyongqiang
Differential Revision: https://reviews.facebook.net/D5019
Summary:
The numbers of shards that the block cache is divided into is
configurable. However, if the user specifies that he/she wants
the block cache to be divided into more than 2**20 pieces, then
the system will rey to allocate a huge array of that size) that
could fail.
It is better to limit the sharding of the block cache to an
upper bound. The default sharding is 16 shards (i.e. 2**4)
and the maximum is now 2 million shards (i.e. 2**20).
Also, fixed a bug with the LRUCache where the numShardBits
should be a private member of the LRUCache object rather than
a static variable.
Test Plan:
run db_bench with --cache_numshardbits=64.
Task ID: #
Blame Rev:
Reviewers: heyongqiang
Reviewed By: heyongqiang
Differential Revision: https://reviews.facebook.net/D5013
Summary:
Introduce a new method Env->Fsync() that issues fsync (instead of fdatasync).
This is needed for data durability when running on ext3 filesystems.
Added options to the benchmark db_bench to generate performance numbers
with either fsync or fdatasync enabled.
Cleaned up Makefile to build leveldb_shell only when building the thrift
leveldb server.
Test Plan: build and run benchmark
Reviewers: heyongqiang
Reviewed By: heyongqiang
Differential Revision: https://reviews.facebook.net/D4911
Summary: Log the open-options to the LOG. Use options_ instead of options because SanitizeOptions could modify the max_file_open limit.
Test Plan: num db_bench
Reviewers: heyongqiang
Reviewed By: heyongqiang
Differential Revision: https://reviews.facebook.net/D4833
Summary: Record the version of the source that we are compiling. We keep a record of the git revision in util/version.cc. This source file is then built as a regular source file as part of the compilation process. One can run "strings executable_filename | grep _build_" to find the version of the source that we used to build the executable file.
Test Plan: none
Differential Revision: https://reviews.facebook.net/D4785
Summary:
as subject.
A new log is written to scribe via thrift client when a new db is opened and when there is
a compaction.
a new option var scribe_log_db_stats is added.
Test Plan: manually checked using command "ptail -time 0 leveldb_deploy_stats"
Reviewers: dhruba
Differential Revision: https://reviews.facebook.net/D4659
Summary:
The fcntl call cannot detect lock conflicts when invoked multiple times
from the same thread.
Use a static lockedFile Set to record the paths that are locked.
A lockfile request checks to see if htis filename already exists in
lockedFiles, if so, then it triggers an error. Otherwise, it inserts
the filename in the lockedFiles Set.
A unlock file request verifies that the filename is in the lockedFiles
set and removes it from lockedFiles set.
Test Plan: unit test attached
Reviewers: heyongqiang
Reviewed By: heyongqiang
Differential Revision: https://reviews.facebook.net/D4755
My change in this patch is to make the Hash code that is used for blooms to be confgurable. In fact,
one can specify a modified HashCode that inspects only parts of the Key to generate the Hash (used
by booms).
Test Plan: none
Differential Revision: https://reviews.facebook.net/D4059
Summary:
First draft.
Unit tests pass.
Test Plan: unit tests attached
Reviewers: heyongqiang
Reviewed By: heyongqiang
Differential Revision: https://reviews.facebook.net/D3969
Summary:
This speeds up CRC computation significantly on
hardware that supports it. Enabled via -msse4.
Note: the binary won't be usable on older CPUs
that don't support the instruction.
Test Plan: crc32c_test
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D3201
Summary:
Some code reorganization in-preparation for replacing with a hardware
instruction.
* Use u64 for some of the key types
* Use an ALIGN macro so code is easier to read
Test Plan: crc32c_test
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D3135
In particular, we add a new FilterPolicy class. An instance
of this class can be supplied in Options when opening a
database. If supplied, the instance is used to generate
summaries of keys (e.g., a bloom filter) which are placed in
sstables. These summaries are consulted by DB::Get() so we
can avoid reading sstable blocks that are guaranteed to not
contain the key we are looking for.
This change provides one implementation of FilterPolicy
based on bloom filters.
Other changes:
- Updated version number to 1.4.
- Some build tweaks.
- C binding for CompactRange.
- A few more benchmarks: deleteseq, deleterandom, readmissing, seekrandom.
- Minor .gitignore update.
- Pass system's values of CFLAGS,LDFLAGS.
Don't override OPT if it's already set.
Original patch by Alessio Treglia <alessio@debian.org>:
http://code.google.com/p/leveldb/issues/detail?id=27#c6
- Remove 1 exit time destructor from leveldb.
See http://crbug.com/101600
- Fix problem where sstable building code would pass an
internal key to the user comparator.
(Sync with uptream at 25436817.)
- Replace raw slice comparison with a call to user comparator.
Added test for custom comparators.
- Fix end of namespace comments.
- Fixed bug in picking inputs for a level-0 compaction.
When finding overlapping files, the covered range may expand
as files are added to the input set. We now correctly expand
the range when this happens instead of continuing to use the
old range. For example, suppose L0 contains files with the
following ranges:
F1: a .. d
F2: c .. g
F3: f .. j
and the initial compaction target is F3. We used to search
for range f..j which yielded {F2,F3}. However we now expand
the range as soon as another file is added. In this case,
when F2 is added, we expand the range to c..j and restart the
search. That picks up file F1 as well.
This change fixes a bug related to deleted keys showing up
incorrectly after a compaction as described in Issue 44.
(Sync with upstream @25072954)
- Added DB::CompactRange() method.
Changed manual compaction code so it breaks up compactions of
big ranges into smaller compactions.
Changed the code that pushes the output of memtable compactions
to higher levels to obey the grandparent constraint: i.e., we
must never have a single file in level L that overlaps too
much data in level L+1 (to avoid very expensive L-1 compactions).
Added code to pretty-print internal keys.
- Fixed bug where we would not detect overlap with files in
level-0 because we were incorrectly using binary search
on an array of files with overlapping ranges.
Added "leveldb.sstables" property that can be used to dump
all of the sstables and ranges that make up the db state.
- Removing post_write_snapshot support. Email to leveldb mailing
list brought up no users, just confusion from one person about
what it meant.
- Fixing static_cast char to unsigned on BIG_ENDIAN platforms.
Fixes Issue 35 and Issue 36.
- Comment clarification to address leveldb Issue 37.
- Change license in posix_logger.h to match other files.
- A build problem where uint32 was used instead of uint32_t.
Sync with upstream @24408625
Fix GCC -Wshadow warnings in LevelDB's public header files,
reported by Dustin.
Add in-memory Env implementation (helpers/memenv/*).
This enables users to create LevelDB databases in-memory.
Initialize ShardedLRUCache::last_id_ to zero.
This fixes a Valgrind warning.
(Also delete port/sha1_* which were removed upstream some time ago.)
- Fix for issue 33 (non-null-terminated result from
leveldb_property_value())
- Support for running multiple instances of a benchmark in parallel.
- Reduce lock contention on Get():
(1) Do not hold the lock while searching memtables.
(2) Shard block and table caches 16-ways.
Benchmark for evaluating this change:
$ db_bench --benchmarks=fillseq1,readrandom --threads=$n
(fillseq1 is a small hack to make sure fillseq runs once regardless
of number of threads specified on the command line).
git-svn-id: https://leveldb.googlecode.com/svn/trunk@49 62dab493-f737-651d-591e-8d6aee1b9529
- Fix bug in Iterator::Prev where it would return the wrong key.
Fixes issues 29 and 30.
- Added a tweak to testharness to allow running just some tests.
- Fixing two minor documentation errors based on issues 28 and 25.
- Cleanup; fix namespaces of export-to-C code.
Also fix one "const char*" vs "char*" mismatch.
git-svn-id: https://leveldb.googlecode.com/svn/trunk@48 62dab493-f737-651d-591e-8d6aee1b9529
- LevelDB patch for FreeBSD. This resolves Issue 22.
Contributed by dforsythe (thanks!).
- Removing Chromium-specific files.
They are now going to live in the Chromium repository.
- Adding a benchmark page comparing LevelDB performance
to SQLite and Kyoto Cabinet's TreeDB, along with
code to generate the benchmarks.
Thanks to Kevin Tseng for compiling the benchmarks,
and Scott Hess and Mikio Hirabayashi for their
help and advice.
git-svn-id: https://leveldb.googlecode.com/svn/trunk@40 62dab493-f737-651d-591e-8d6aee1b9529
- Removed one copy of an uncompressed block contents changing
the signature of Snappy_Uncompress() so it uncompresses into a
flat array instead of a std::string.
Speeds up readrandom ~10%.
- Instead of a combination of Env/WritableFile, we now have a
Logger interface that can be easily overridden applications
that want to supply their own logging.
- Separated out the gcc and Sun Studio parts of atomic_pointer.h
so we can use 'asm', 'volatile' keywords for Sun Studio.
git-svn-id: https://leveldb.googlecode.com/svn/trunk@39 62dab493-f737-651d-591e-8d6aee1b9529
- LevelDB patch for Sun Studio
Based on a patch submitted by Theo Schlossnagle - thanks!
This fixes Issue 17.
- Fix a couple of test related memory leaks.
git-svn-id: https://leveldb.googlecode.com/svn/trunk@38 62dab493-f737-651d-591e-8d6aee1b9529
Slight tweak to the no-overlap optimization: only push to
level 2 to reduce the amount of wasted space when the same
small key range is being repeatedly overwritten.
Fix for Issue 18: Avoid failure on Windows by avoiding
deletion of lock file until the end of DestroyDB().
Fix for Issue 19: Disregard sequence numbers when checking for
overlap in sstable ranges. This fixes issue 19: when writing
the same key over and over again, we would generate a sequence
of sstables that were never merged together since their sequence
numbers were disjoint.
Don't ignore map/unmap error checks.
Miscellaneous fixes for small problems Sanjay found while diagnosing
issue/9 and issue/16 (corruption_testr failures).
- log::Reader reports the record type when it finds an unexpected type.
- log::Reader no longer reports an error when it encounters an expected
zero record regardless of the setting of the "checksum" flag.
- Added a missing forward declaration.
- Documented a side-effects of larger write buffer sizes
(longer recovery time).
git-svn-id: https://leveldb.googlecode.com/svn/trunk@37 62dab493-f737-651d-591e-8d6aee1b9529
This revision adds two major changes:
1. build_detect_platform which generates build_config.mk
with platform-dependent flags for the build process
2. /port/atomic_pointer.h with anAtomicPointerimplementation
for platforms without <cstdatomic>
Some of this code is loosely based on patches submitted to the
LevelDB mailing list at https://groups.google.com/forum/#!forum/leveldb
Tip of the hat to Dave Smith and Edouard A, who both sent patches.
The presence of Snappy (http://code.google.com/p/snappy/) and
cstdatomic are now both detected in the build_detect_platform
script (1.) which gets executing during make.
For (2.), instead of broadly importing atomicops_* from Chromium or
the Google performance tools, we chose to just implement AtomicPointer
and the limited atomic load and store operations it needs.
This resulted in much less code and fewer files - everything is
contained in atomic_pointer.h.
git-svn-id: https://leveldb.googlecode.com/svn/trunk@34 62dab493-f737-651d-591e-8d6aee1b9529
* Patch LevelDB to build for OSX and iOS
* Fix race condition in memtable iterator deletion.
* Other small fixes.
git-svn-id: https://leveldb.googlecode.com/svn/trunk@29 62dab493-f737-651d-591e-8d6aee1b9529
* env_chromium.cc should not export symbols.
* Fix MSVC warnings.
* Removed large value support.
* Fix broken reference to documentation file
git-svn-id: https://leveldb.googlecode.com/svn/trunk@24 62dab493-f737-651d-591e-8d6aee1b9529