Summary: rocksdb replicaiton will need this when writing value+TS from master to slave 'as is'
Test Plan: make
Reviewers: dhruba, vamsi, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11919
Summary:
This diff adds support for both soft and hard rate limiting. The following changes are included:
1) Options.rate_limit is renamed to Options.hard_rate_limit.
2) Options.rate_limit_delay_milliseconds is renamed to Options.rate_limit_delay_max_milliseconds.
3) Options.soft_rate_limit is added.
4) If the maximum compaction score is > hard_rate_limit and rate_limit_delay_max_milliseconds == 0, then writes are delayed by 1 ms at a time until the max compaction score falls below hard_rate_limit.
5) If the max compaction score is > soft_rate_limit but <= hard_rate_limit, then writes are delayed by 0-1 ms depending on how close we are to hard_rate_limit.
6) Users can disable 4 by setting hard_rate_limit = 0. They can add a limit to the maximum amount of time waited by setting rate_limit_delay_max_milliseconds > 0. Thus, the old behavior can be preserved by setting soft_rate_limit = 0, which is the default.
Test Plan:
make -j32 check
./db_stress
Reviewers: dhruba, haobo, MarkCallaghan
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12003
Summary:
Add an option for arena block size, default value 4096 bytes. Arena will allocate blocks with such size.
I am not sure about passing parameter to skiplist in the new virtualized framework, though I talked to Jim a bit. So add Jim as reviewer.
Test Plan:
new unit test, I am running db_test.
For passing paramter from configured option to Arena, I tried tests like:
TEST(DBTest, Arena_Option) {
std::string dbname = test::TmpDir() + "/db_arena_option_test";
DestroyDB(dbname, Options());
DB* db = nullptr;
Options opts;
opts.create_if_missing = true;
opts.arena_block_size = 1000000; // tested 99, 999999
Status s = DB::Open(opts, dbname, &db);
db->Put(WriteOptions(), "a", "123");
}
and printed some debug info. The results look good. Any suggestion for such a unit-test?
Reviewers: haobo, dhruba, emayanke, jpaton
Reviewed By: dhruba
CC: leveldb, zshao
Differential Revision: https://reviews.facebook.net/D11799
Summary: This diff virtualizes the skiplist interface so that users can provide their own implementation of a backing store for MemTables. Eventually, the backing store will be responsible for its own synchronization, allowing users (and us) to experiment with different lockless implementations.
Test Plan:
make clean
make -j32 check
./db_stress
Reviewers: dhruba, emayanke, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11739
Summary:
Introduced KeyMayExist checking during writebatch-delete and removed from Outer Delete API because it uses writebatch-delete.
Added code to skip getting Table from disk if not already present in table_cache.
Some renaming of variables.
Introduced KeyMayExistImpl which allows checking since specified sequence number in GetImpl useful to check partially written writebatch.
Changed KeyMayExist to not be pure virtual and provided a default implementation.
Expanded unit-tests in db_test to check appropriately.
Ran db_stress for 1 hour with ./db_stress --max_key=100000 --ops_per_thread=10000000 --delpercent=50 --filter_deletes=1 --statistics=1.
Test Plan: db_stress;make check
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb, xjin
Differential Revision: https://reviews.facebook.net/D11745
Summary:
Wrote a new function in db_impl.c-CheckKeyMayExist that calls Get but with a new parameter turned on which makes Get return false only if bloom filters can guarantee that key is not in database. Delete calls this function and if the option- deletes_use_filter is turned on and CheckKeyMayExist returns false, the delete will be dropped saving:
1. Put of delete type
2. Space in the db,and
3. Compaction time
Test Plan:
make all check;
will run db_stress and db_bench and enhance unit-test once the basic design gets approved
Reviewers: dhruba, haobo, vamsi
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11607
Summary: This diff added a command 'idump' to ldb tool, which dumps the internal key/value pairs. It could be useful for diagnosis and estimating the per user key 'overhead'. Also cleaned up the ldb code a bit where I touched.
Test Plan: make check; ldb idump
Reviewers: emayanke, sheki, dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11517
Summary:
There is a new option called hybrid_mode which, when switched on,
causes HBase style compactions. Files from L0 are
compacted back into L0. This meat of this compaction algorithm
is in PickCompactionHybrid().
All files reside in L0. That means all files have overlapping
keys. Each file has a time-bound, i.e. each file contains a
range of keys that were inserted around the same time. The
start-seqno and the end-seqno refers to the timeframe when
these keys were inserted. Files that have contiguous seqno
are compacted together into a larger file. All files are
ordered from most recent to the oldest.
The current compaction algorithm starts to look for
candidate files starting from the most recent file. It continues to
add more files to the same compaction run as long as the
sum of the files chosen till now is smaller than the next
candidate file size. This logic needs to be debated
and validated.
The above logic should reduce write amplification to a
large extent... will publish numbers shortly.
Test Plan: dbstress runs for 6 hours with no data corruption (tested so far).
Differential Revision: https://reviews.facebook.net/D11289
Summary: [start_time, end_time) is waht I'm following for the buckets and the whole time-range. Also cleaned up some code in db_ttl.* Not correcting the spacing/indenting convention for util/ldb_cmd.cc in this diff.
Test Plan: python ldb_test.py, make ttl_test, Run mcrocksdb-backup tool, Run the ldb tool on 2 mcrocksdb production backups form sigmafio033.prn1
Reviewers: vamsi, haobo
Reviewed By: vamsi
Differential Revision: https://reviews.facebook.net/D11433
Summary:
Scan and Dump commands in ldb use iterator. We need to also print timestamp for ttl databases for debugging. For this I create a TtlIterator class pointer in these functions and assign it the value of Iterator pointer which actually points to t TtlIterator object, and access the new function ValueWithTS which can return TS also. Buckets feature for dump command: gives a count of different key-values in the specified time-range distributed across the time-range partitioned according to bucket-size. start_time and end_time are specified in unixtimestamp and bucket in seconds on the user-commandline
Have commented out 3 ines from ldb_test.py so that the test does not break right now. It breaks because timestamp is also printed now and I have to look at wildcards in python to compare properly.
Test Plan: python tools/ldb_test.py
Reviewers: vamsi, dhruba, haobo, sheki
Reviewed By: vamsi
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11403
Summary: as title, also removed an incorrect assertion
Test Plan: make check; db_stress --mmap_read=1; db_stress --mmap_read=0
Reviewers: dhruba, emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11367
Summary: This diff added an option to control the incremenal sync frequency. db_bench has a new flag bytes_per_sync for easy tuning exercise.
Test Plan: make check; db_bench
Reviewers: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11295
Summary:
Merge multiple multiple memtables in memory before writing it
out to a file in L0.
There is a new config parameter min_write_buffer_number_to_merge
that specifies the number of write buffers that should be merged
together to a single file in storage. The system will not flush
wrte buffers to storage unless at least these many buffers have
accumulated in memory.
The default value of this new parameter is 1, which means that
a write buffer will be immediately flushed to disk as soon it is
ready.
Test Plan: make check
Differential Revision: https://reviews.facebook.net/D11241
Summary:
Use a bit set to keep track of which random number is generated.
Currently only supports single-threaded. All our perf tests are run with threads=1
Copied over bitset implementation from common/datastructures
Test Plan: printed the generated keys, and verified all keys were present.
Reviewers: MarkCallaghan, haobo, dhruba
Reviewed By: MarkCallaghan
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11247
Summary:
During compaction, we sync the output files after they are fully written out. This causes unnecessary blocking of the compaction thread and burstiness of the write traffic.
This diff simply asks the OS to sync data incrementally as they are written, on the background. The hope is that, at the final sync, most of the data are already on disk and we would block less on the sync call. Thus, each compaction runs faster and we could use fewer number of compaction threads to saturate IO.
In addition, the write traffic will be smoothed out, hopefully reducing the IO P99 latency too.
Some quick tests show 10~20% improvement in per thread compaction throughput. Combined with posix advice on compaction read, just 5 threads are enough to almost saturate the udb flash bandwidth for 800 bytes write only benchmark.
What's more promising is that, with saturated IO, iostat shows average wait time is actually smoother and much smaller.
For the write only test 800bytes test:
Before the change: await occillate between 10ms and 3ms
After the change: await ranges 1-3ms
Will test against read-modify-write workload too, see if high read latency P99 could be resolved.
Will introduce a parameter to control the sync interval in a follow up diff after cleaning up EnvOptions.
Test Plan: make check; db_bench; db_stress
Reviewers: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11115
Summary:
This diff simplifies EnvOptions by treating it as POD, similar to Options.
- virtual functions are removed and member fields are accessed directly.
- StorageOptions is removed.
- Options.allow_readahead and Options.allow_readahead_compactions are deprecated.
- Unused global variables are removed: useOsBuffer, useFsReadAhead, useMmapRead, useMmapWrite
Test Plan: make check; db_stress
Reviewers: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11175
Summary:
This diff adds an option to specify whether PTHREAD_MUTEX_ADAPTIVE_NP will be enabled for the rocksdb single big kernel lock. db_bench also have this option now.
Quickly tested 8 thread cpu bound 100 byte random read.
No fast mutex: ~750k/s ops
With fast mutex: ~880k/s ops
Test Plan: make check; db_bench; db_stress
Reviewers: dhruba
CC: MarkCallaghan, leveldb
Differential Revision: https://reviews.facebook.net/D11031
Summary:
Current posix advice implementation ties up the access pattern hint with the creation of a file.
It is not possible to apply different advice for different access (random get vs compaction read),
without keeping two open files for the same table. This patch extended the RandomeAccessFile interface
to accept new access hint at anytime. Particularly, we are able to set different access hint on the same
table file based on when/how the file is used.
Two options are added to set the access hint, after the file is first opened and after the file is being
compacted.
Test Plan: make check; db_stress; db_bench
Reviewers: dhruba
Reviewed By: dhruba
CC: MarkCallaghan, leveldb
Differential Revision: https://reviews.facebook.net/D10905
Summary: a new option block_size_deviation is added.
Test Plan: run db_test and db_bench
Reviewers: dhruba, haobo
Reviewed By: haobo
Differential Revision: https://reviews.facebook.net/D10821
Summary: a new option block_size_deviation is added.
Test Plan: run db_test and db_bench
Reviewers: dhruba, haobo
Reviewed By: haobo
Differential Revision: https://reviews.facebook.net/D10821
Summary:
Added an option stats_dump_period_sec to dump leveldb.stats to LOG periodically for diagnosis.
By defauly, it's set to a very big number 3600 (1 hour).
Test Plan: make check;
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb, zshao
Differential Revision: https://reviews.facebook.net/D10761
Summary:
The valgrind errors were in the unit tests where we change the
number of levels of a database using internal methods.
Test Plan:
valgrind ./reduce_levels_test
valgrind ./db_test
Reviewers: emayanke
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10893
Summary:
This is initial version. A few ways in which this could
be extended in the future are:
(a) Killing from more places in source code
(b) Hashing stack and using that hash in determining whether to crash.
This is to avoid crashing more often at source lines that are executed
more often.
(c) Raising exceptions or returning errors instead of killing
Test Plan:
This whole thing is for testing.
Here is part of output:
python2.7 tools/db_crashtest2.py -d 600
Running db_stress
db_stress retncode -15 output LevelDB version : 1.5
Number of threads : 32
Ops per thread : 10000000
Read percentage : 50
Write-buffer-size : 4194304
Delete percentage : 30
Max key : 1000
Ratio #ops/#keys : 320000
Num times DB reopens: 0
Batches/snapshots : 1
Purge redundant % : 50
Num keys per lock : 4
Compression : snappy
------------------------------------------------
No lock creation because test_batches_snapshots set
2013/04/26-17:55:17 Starting database operations
Created bg thread 0x7fc1f07ff700
... finished 60000 ops
Running db_stress
db_stress retncode -15 output LevelDB version : 1.5
Number of threads : 32
Ops per thread : 10000000
Read percentage : 50
Write-buffer-size : 4194304
Delete percentage : 30
Max key : 1000
Ratio #ops/#keys : 320000
Num times DB reopens: 0
Batches/snapshots : 1
Purge redundant % : 50
Num keys per lock : 4
Compression : snappy
------------------------------------------------
Created bg thread 0x7ff0137ff700
No lock creation because test_batches_snapshots set
2013/04/26-17:56:15 Starting database operations
... finished 90000 ops
Revert Plan: OK
Task ID: #2252691
Reviewers: dhruba, emayanke
Reviewed By: emayanke
CC: leveldb, haobo
Differential Revision: https://reviews.facebook.net/D10581
Summary:
Currently, with paranoid_check on, DB::Open will fail on any log read error on recovery.
If client is ok with losing most recent updates, we could simply skip those errors.
However, it's important to introduce an additional flag, so that paranoid_check can
still guard against more serious problems.
Test Plan: make check; db_stress
Reviewers: dhruba, emayanke
Reviewed By: emayanke
CC: leveldb, emayanke
Differential Revision: https://reviews.facebook.net/D10869
Summary:
There is an existing field Options.max_bytes_for_level_multiplier that
sets the multiplier for the size of each level in the database.
This patch introduces the ability to set different multipliers
for every level in the database. The size of a level is determined
by using both max_bytes_for_level_multiplier as well as the
per-level fanout.
size of level[i] = size of level[i-1] * max_bytes_for_level_multiplier
* fanout[i-1]
The default value of fanout is 1, so that it is backward compatible.
Test Plan: make check
Reviewers: haobo, emayanke
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10863
Summary:
PosixLogger and AutoRollLogger do not seem to be thread safe.
For PosixLogger, log_size_ is not atomically updated.
For AutoRollLogger, the underlying logger_ might be deleted by
one thread while still being accessed by another.
Test Plan: make check
Reviewers: kailiu, dhruba, heyongqiang
Reviewed By: kailiu
CC: leveldb, zshao, sheki
Differential Revision: https://reviews.facebook.net/D9699
Summary:
Make stop watch a simple implementation, instead of subclass of a virtual class
Allocate stop watches off the stack instead of heap.
Code is more terse now.
Test Plan: make all check, db_bench with --statistics=1
Reviewers: haobo, dhruba
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10809
Summary: Statistics.h and histogram.h had double based api's to record values. Remove them as they are not used anywhere
Test Plan: make all check
Reviewers: haobo, dhruba
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10815
Summary: ldb works with raw data from the database and needs to be aware of ttl-database to work with it meaningfully. '-ttl' option now tells it that. Also added onto the ldb_test.py test. This option may be specified alongwith put, get, scan or dump. There is no support to provide a ttl-value and it uses default forever because there is no use-case for this currently.
Test Plan: make ldb_test; python tools/ldb_test.py
Reviewers: dhruba, sheki, haobo, vamsi
Reviewed By: sheki
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10797
Summary:
This diff replaces compaction_filter_args and CompactionFilter with a single compaction_filter parameter. It gives CompactionFilter better encapsulation and a similar look to Comparator and MergeOpertor, which improves consistency of the overall interface.
The change is not backward compatible. Nevertheless, the two references in fbcode are not in production yet.
Test Plan: make check
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb, zshao
Differential Revision: https://reviews.facebook.net/D10773
Summary:
This diff introduces a new Merge operation into rocksdb.
The purpose of this review is mostly getting feedback from the team (everyone please) on the design.
Please focus on the four files under include/leveldb/, as they spell the client visible interface change.
include/leveldb/db.h
include/leveldb/merge_operator.h
include/leveldb/options.h
include/leveldb/write_batch.h
Please go over local/my_test.cc carefully, as it is a concerete use case.
Please also review the impelmentation files to see if the straw man implementation makes sense.
Note that, the diff does pass all make check and truly supports forward iterator over db and a version
of Get that's based on iterator.
Future work:
- Integration with compaction
- A raw Get implementation
I am working on a wiki that explains the design and implementation choices, but coding comes
just naturally and I think it might be a good idea to share the code earlier. The code is
heavily commented.
Test Plan: run all local tests
Reviewers: dhruba, heyongqiang
Reviewed By: dhruba
CC: leveldb, zshao, sheki, emayanke, MarkCallaghan
Differential Revision: https://reviews.facebook.net/D9651
Summary:
Mark's task description from #2316777
Env::Default() comes from util/env_posix.cc
This is a static global.
static PosixEnv default_env;
Env* Env::Default() {
return &default_env;
}
-----
These globals assume default_env was initialized first. I don't think that is safe or correct to do (http://stackoverflow.com/questions/1005685/c-static-initialization-order)
const string AutoRollLoggerTest::kTestDir(
test::TmpDir() + "/db_log_test");
const string AutoRollLoggerTest::kLogFile(
test::TmpDir() + "/db_log_test/LOG");
Env* AutoRollLoggerTest::env = Env::Default();
Test Plan:
run make clean && make && make check
But how can I know if it works in Ubuntu?
Reviewers: MarkCallaghan, chip
Reviewed By: chip
CC: leveldb, dhruba, haobo
Differential Revision: https://reviews.facebook.net/D10491
Summary:
RocksDB doesn't build on Ubuntu VM .. shoudl be fixed with this patch.
g++ --version
g++ (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3
util/env_posix.cc:68:24: sorry, unimplemented: non-static data member initializers
util/env_posix.cc:68:24: error: ISO C++ forbids in-class initialization of non-const static member ‘use_os_buffer’
util/env_posix.cc:113:24: sorry, unimplemented: non-static data member initializers
util/env_posix.cc:113:24: error: ISO C++ forbids in-class initialization of non-const static member ‘use_os_buffer
Test Plan: make check
Reviewers: sheki, leveldb
Reviewed By: sheki
Differential Revision: https://reviews.facebook.net/D10461
Summary:
Adds the --writes_per_second rate limit for the readwhilewriting test.
The purpose is to optionally avoid saturating storage with writes & compaction
and test read response time when some writes are being done.
Changes the histogram code to also print the p99.99 value
Task ID: #
Blame Rev:
Test Plan:
make check, ran db_bench with it
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: haobo
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10305
Summary: forgot to include signal_test.cc
Test Plan: make check
Reviewers: sheki
Reviewed By: sheki
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10281
Summary:
This diff provides the ability to print out a stacktrace when the process receives certain signals.
Currently, we enable this for the following signals (program error related):
SIGILL SIGSEGV SIGBUS SIGABRT
Application simply #include "util/stack_trace.h" and call leveldb::InstallStackTraceHandler() during initialization, if signal handler is needed. It's not done automatically when openning db, because it's the application(process)'s responsibility to install signal handler and some applications might already have their own (like fbcode).
Sample output:
Received signal 11 (Segmentation fault)
#0 0x408ff0 ./signal_test() [0x408ff0] /home/haobo/rocksdb/util/signal_test.cc:4
#1 0x40827d ./signal_test() [0x40827d] /home/haobo/rocksdb/util/signal_test.cc:24
#2 0x7f8bb183172e /usr/local/fbcode/gcc-4.7.1-glibc-2.14.1/lib/libc.so.6(__libc_start_main+0x10e) [0x7f8bb183172e] ??:0
#3 0x408ebc ./signal_test() [0x408ebc] /home/engshare/third-party/src/glibc/glibc-2.14.1/glibc-2.14.1/csu/../sysdeps/x86_64/elf/start.S:113
Segmentation fault (core dumped)
For each frame, we print the raw pointer, the symbol provided by backtrace_symbols (still not good enough), and the source file/line. Note that address translation is done by directly shell out to addr2line. ??:0 means addr2line fails to do the translation. Hacky, but I think it's good for now.
Test Plan: signal_test.cc
Reviewers: dhruba, MarkCallaghan
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10173
Summary: As title. Found out this when testing stack_trace.cc portability.
Test Plan: make check; manual test 'non-linux' build by forcing OS_LINUX2
Reviewers: dhruba, heyongqiang
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10263
Summary: Primarily a refactor. Introduced LDBTool interface to which customers can plug in their options and this will create their own version of ldb tool.
Test Plan: made ldb tool and tried it.
Reviewers: dhruba, heyongqiang
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D10191
Summary:
The background compaction threads are never exitted and therefore caused
memory-leaks while running rpcksdb tests. Have changed the PosixEnv destructor to exit and join them and changed the tests likewise
The memory leaked has reduced from 320 bytes to 64 bytes in all the tests. The 64
bytes is relating to
pthread_exit, but still have to figure out why. The stack-trace right now with
table_test.cc = 64 bytes in 1 blocks are possibly lost in loss record 4 of 5
at 0x475D8C: malloc (jemalloc.c:914)
by 0x400D69E: _dl_map_object_deps (dl-deps.c:505)
by 0x4013393: dl_open_worker (dl-open.c:263)
by 0x400F015: _dl_catch_error (dl-error.c:178)
by 0x4013B2B: _dl_open (dl-open.c:569)
by 0x5D3E913: do_dlopen (dl-libc.c:86)
by 0x400F015: _dl_catch_error (dl-error.c:178)
by 0x5D3E9D6: __libc_dlopen_mode (dl-libc.c:47)
by 0x5048BF3: pthread_cancel_init (unwind-forcedunwind.c:53)
by 0x5048DC9: _Unwind_ForcedUnwind (unwind-forcedunwind.c:126)
by 0x5046D9F: __pthread_unwind (unwind.c:130)
by 0x50413A4: pthread_exit (pthreadP.h:289)
Test Plan: make all check
Reviewers: dhruba, sheki, haobo
Reviewed By: dhruba
CC: leveldb, chip
Differential Revision: https://reviews.facebook.net/D9573
Summary: as subject. This is causing problem in adsconv. Ideally, this flags should be set in open. But that is only supported in Linux kernel ≥2.6.23 and glibc ≥2.7.
Test Plan:
db_test
run db_test
Reviewers: dhruba, MarkCallaghan, haobo
Reviewed By: dhruba
CC: leveldb, chip
Differential Revision: https://reviews.facebook.net/D10089
Summary:
1. The stock LRUCache nukes itself whenever the working set (the total number of entries not released by client at a certain time) is bigger than the cache capacity.
See https://our.dev.facebook.com/intern/tasks/?t=2252281
2. There's a bug in shard calculation leading to segmentation fault when only one shard is needed.
Test Plan: make check
Reviewers: dhruba, heyongqiang
Reviewed By: heyongqiang
CC: leveldb, zshao, sheki
Differential Revision: https://reviews.facebook.net/D9927
Summary:
1. SetBackgroundThreads was not thread safe
2. queue_size_ does not seem necessary
3. moved condition signal after shared state change. Even though the original
order is in practice ok (because the mutex is still held), it looks fishy
and non-intuitive.
Test Plan: make check
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb, zshao
Differential Revision: https://reviews.facebook.net/D9825
Summary:
Use non mmapd files for Write-Ahead log.
Earlier use of MMaped files. made the log iterator read ahead and miss records.
Now the reader and writer will point to the same physical location.
There is no perf regression :
./db_bench --benchmarks=fillseq --db=/dev/shm/mmap_test --num=$(million 20) --use_existing_db=0 --threads=2
with This diff :
fillseq : 10.756 micros/op 185281 ops/sec; 20.5 MB/s
without this dif :
fillseq : 11.085 micros/op 179676 ops/sec; 19.9 MB/s
Test Plan: unit test included
Reviewers: dhruba, heyongqiang
Reviewed By: heyongqiang
CC: leveldb
Differential Revision: https://reviews.facebook.net/D9741
Summary:
Earlier Statistics object was a raw pointer. This meant the user had to clear up
the Statistics object after creating the database. In most use cases the database is created in a function and the statistics pointer is out of scope. Hence the statistics object would never be deleted.
Now Using a shared_ptr to manage this.
Want this in before the next release.
Test Plan: make all check.
Reviewers: dhruba, emayanke
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D9735
Summary: This caused compilation problems on some gcc platforms during the third-partyrelease
Test Plan: make
Reviewers: sheki
Reviewed By: sheki
Differential Revision: https://reviews.facebook.net/D9627
Summary:
This patch allows an application to specify whether to use bufferedio,
reads-via-mmaps and writes-via-mmaps per database. Earlier, there
was a global static variable that was used to configure this functionality.
The default setting remains the same (and is backward compatible):
1. use bufferedio
2. do not use mmaps for reads
3. use mmap for writes
4. use readaheads for reads needed for compaction
I also added a parameter to db_bench to be able to explicitly specify
whether to do readaheads for compactions or not.
Test Plan: make check
Reviewers: sheki, heyongqiang, MarkCallaghan
Reviewed By: sheki
CC: leveldb
Differential Revision: https://reviews.facebook.net/D9429
Summary: Getting rid of boost in our github codebase which caused problems on third-party
Test Plan: make ldb; python tools/ldb_test.py
Reviewers: sheki, dhruba
Reviewed By: sheki
Differential Revision: https://reviews.facebook.net/D9543
Summary: Was causing error(warning) in third-party saying unused result
Test Plan: make
Reviewers: sheki, dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D9447
Summary: Makefile had options to ignore sign-comparisons and unused-parameters, which should be there. Also fixed the specific errors in the code-base
Test Plan: make
Reviewers: chip, dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D9531
Summary: negation of the condition checked currently had to be checkd actually
Test Plan: make ldb; python ldb_test.py
Reviewers: sheki, dhruba
Reviewed By: sheki
Differential Revision: https://reviews.facebook.net/D9459
Summary: boost functions cause complications while deploying to third-party
Test Plan: make
Reviewers: sheki, dhruba
Reviewed By: sheki
Differential Revision: https://reviews.facebook.net/D9441
Summary:
Ftruncate does not throw an error on disk-full. This causes Sig-bus in
the case where the database tries to issue a Put call on a full-disk.
Use posix_fallocate for allocation instead of truncate.
Add a check to use MMaped files only on ext4, xfs and tempfs, as
posix_fallocate is very slow on ext3 and older.
Test Plan: make all check
Reviewers: dhruba, chip
Reviewed By: dhruba
CC: adsharma, leveldb
Differential Revision: https://reviews.facebook.net/D9291
Summary: Fix for memory leaks in rocksdb tests. Also modified the variable NUM_FAILED_TESTS to print the actual number of failed tests.
Test Plan: make <test>; valgrind --leak-check=full ./<test>
Reviewers: sheki, dhruba
Reviewed By: sheki
CC: leveldb
Differential Revision: https://reviews.facebook.net/D9333
Summary:
1. Create only 2 levels so that manual compactions are fast.
2. Set target file size to a large value
Test Plan: make clean check
Reviewers: kailiu, zshao
Reviewed By: zshao
CC: leveldb
Differential Revision: https://reviews.facebook.net/D9231
Summary:
Add a shortcut function to make it easier for people
to efficiently bulk_load data into RocksDB.
Test Plan:
Tried ldb with "--bulk_load" and "--bulk_load --compact" and verified the outcome.
Needs to consult the team on how to test this automatically.
Reviewers: sheki, dhruba, emayanke, heyongqiang
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D8907
Summary:
This adds the rate_delay_limit_milliseconds option to make the delay
configurable in MakeRoomForWrite when the max compaction score is too high.
This delay is called the Ln slowdown. This change also counts the Ln slowdown
per level to make it possible to see where the stalls occur.
From IO-bound performance testing, the Level N stalls occur:
* with compression -> at the largest uncompressed level. This makes sense
because compaction for compressed levels is much
slower. When Lx is uncompressed and Lx+1 is compressed
then files pile up at Lx because the (Lx,Lx+1)->Lx+1
compaction process is the first to be slowed by
compression.
* without compression -> at level 1
Task ID: #1832108
Blame Rev:
Test Plan:
run with real data, added test
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D9045
Summary:
Rocks accumulates recent writes and deletes in the in-memory memtable.
When the memtable is full, it writes the contents on the memtable to
a file in L0.
This patch removes redundant records at the time of the flush. If there
are multiple versions of the same key in the memtable, then only the
most recent one is dumped into the output file. The purging of
redundant records occur only if the most recent snapshot is earlier
than the earliest record in the memtable.
Should we switch on this feature by default or should we keep this feature
turned off in the default settings?
Test Plan: Added test case to db_test.cc
Reviewers: sheki, vamsi, emayanke, heyongqiang
Reviewed By: sheki
CC: leveldb
Differential Revision: https://reviews.facebook.net/D8991
Summary: LDB tool to print the deleted/put keys in hex in the wal file.
Test Plan: run ldb on a db to check if output was satisfactory
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D8691
Summary:
Changed the Get and Scan options with openForReadOnly mode to have access to the memtable.
Changed the visibility of NewInternalIterator in db_impl from private to protected so that
the derived class db_impl_read_only can call that in its NewIterator function for the
scan case. The previous approach which changed the default for flush_on_destroy_ from false to true
caused many problems in the unit tests due to empty sst files that it created. All
unit tests pass now.
Test Plan: make clean; make all check; ldb put and get and scans
Reviewers: dhruba, heyongqiang, sheki
Reviewed By: dhruba
CC: kosievdmerwe, zshao, dilipj, kailiu
Differential Revision: https://reviews.facebook.net/D8697
Summary:
* Introduce is histogram in statistics.h
* stop watch to measure time.
* introduce two timers as a poc.
Replaced NULL with nullptr to fight some lint errors
Should be useful for google.
Test Plan:
ran db_bench and check stats.
make all check
Reviewers: dhruba, heyongqiang
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D8637
Summary:
I missed InitTestDb() in one of my tess. InitTestDb() initializes the test directory, without which the test will throw IO error.
This problem didn't occur before because I've already run the tests before so the test directory is already there.
Test Plan:
Reviewers: dhruba
CC:
Task ID: #
Blame Rev:
Summary:
flush_on_destroy has a default value of false and the memtable is flushed
in the dbimpl-destructor only when that is set to true. Because we want the memtable to be flushed everytime that
the destructor is called(db is closed) and the cases where we work with the memtable only are very less
it is a good idea to give this a default value of true. Thus the put from ldb
wil have its data flushed to disk in the destructor and the next Get will be able to
read it when opened with OpenForReadOnly. The reason that ldb could read the latest value when
the db was opened in the normal Open mode is that the Get from normal Open first reads
the memtable and directly finds the latest value written there and the Get from OpenForReadOnly
doesn't have access to the memtable (which is correct because all its Put/Modify) are disabled
Test Plan: make all; ldb put and get and scans
Reviewers: dhruba, heyongqiang, sheki
Reviewed By: heyongqiang
CC: kosievdmerwe, zshao, dilipj, kailiu
Differential Revision: https://reviews.facebook.net/D8631
Summary: Fix the warning [-Werror=format-security] and [-Werror=unused-result].
Test Plan:
enforced the Werror and run make
Task ID: 2101673
Blame Rev:
Reviewers: heyongqiang
Differential Revision: https://reviews.facebook.net/D8553
Summary:
$SUBJECT -- cosmetic fix for histograms, print P75/P99, and
make sure zlib is enabled for our command line tools.
Test Plan: compile, test db_bench with --compression_type=zlib
Reviewers: heyongqiang
Reviewed By: heyongqiang
CC: adsharma, leveldb
Differential Revision: https://reviews.facebook.net/D8445
Summary:
* Add a SplitByTTLLogger to enable this feature. In this diff I implemented generalized AutoSplitLoggerBase class to simplify the
development of such classes.
* Refactor the existing AutoSplitLogger and fix several bugs.
Test Plan:
* Added a unit tests for different types of "auto splitable" loggers individually.
* Tested the composited logger which allows the log files to be splitted by both TTL and log size.
Reviewers: heyongqiang, dhruba
Reviewed By: heyongqiang
CC: zshao, leveldb
Differential Revision: https://reviews.facebook.net/D8037
Summary:
The existing code did not initialize a few doubles in histogram.cc.
Cropped up when I wrote a unit-test.
Test Plan: make all check
Reviewers: chip
Reviewed By: chip
CC: leveldb
Differential Revision: https://reviews.facebook.net/D8319
Summary:
Earlier way to record in histogram=>
Linear search BucketLimit array to find the bucket and increment the
counter
Current way to record in histogram=>
Store a HistMap statically which points the buckets of each value in the
range [kFirstValue, kLastValue);
In the proccess use vectors instead of array's and refactor some code to
HistogramHelper class.
Test Plan:
run db_bench with histogram=1 and see a histogram being
printed.
Reviewers: dhruba, chip, heyongqiang
Reviewed By: chip
CC: leveldb
Differential Revision: https://reviews.facebook.net/D8265
Summary:
Added function to `RandomAccessFile` to generate an unique ID for that file. Currently only `PosixRandomAccessFile` has this behaviour implemented and only on Linux.
Changed how key is generated in `Table::BlockReader`.
Added tests to check whether the unique ID is stable, unique and not a prefix of another unique ID. Added tests to see that `Table` uses the cache more efficiently.
Test Plan: make check
Reviewers: chip, vamsi, dhruba
Reviewed By: chip
CC: leveldb
Differential Revision: https://reviews.facebook.net/D8145
Summary: fallocate is linux only, so let's protect it with ifdef's
Test Plan: make
Reviewers: sheki, dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D8223
Summary:
Previously, if you opened a db with num_levels set lower than
the database, you received the unhelpful message "Corruption:
VersionEdit: new-file entry." Now you get a more verbose message
describing the issue.
Also, fix handling of compression_levels (both the run-over-the-end
issue and the memory management of it).
Lastly, unique_ptr'ify a couple of minor calls.
Test Plan: make check
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D8151
Summary:
We continually rebuilt build_version.c because we put the
current date into it, but that's what __DATE__ already is. This makes
builds faster.
This also fixes an issue with 'make clean FOO' not working properly.
Also tweak the build rules to be more consistent, always have warnings,
and add a 'make release' rule to handle flags for release builds.
Test Plan: make, make clean
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D8139
Summary:
On some filesystems, pre-allocation can be a considerable
amount of space. xfs in our production environment pre-allocates by
1GB, for instance. By using fallocate to inform the kernel of our
expected file sizes, we eliminate this wasteage (that isn't recovered
until the file is closed which, in the case of LOG files, can be a
considerable amount of time).
Test Plan:
created an xfs loopback filesystem, mounted with
allocsize=4M, and ran db_stress. LOG file without this change was 4M,
and with it it was 128k then grew to normal size.
Reviewers: dhruba
Reviewed By: dhruba
CC: adsharma, leveldb
Differential Revision: https://reviews.facebook.net/D7953
Summary:
Replace manual memory management with std::unique_ptr in a
number of places; not exhaustive, but this fixes a few leaks with file
handles as well as clarifies semantics of the ownership of file handles
with log classes.
Test Plan: db_stress, make check
Reviewers: dhruba
Reviewed By: dhruba
CC: zshao, leveldb, heyongqiang
Differential Revision: https://reviews.facebook.net/D8043
Summary:
Check in LogAndApply if the file size is more than the limit set in
Options.
Things to consider : will this be expensive?
Test Plan: make all check. Inputs on a new unit test?
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7701
Summary:
clang is an alternate compiler based on llvm. It produces
nicer error messages and finds some bugs that gcc doesn't, such as the
size_t change in this file (which caused some write return values to be
misinterpreted!)
Clang isn't the default; to try it, do "USE_CLANG=1 make" or "export
USE_CLANG=1" then make as normal
Test Plan: "make check" and "USE_CLANG=1 make check"
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D7899
Summary: `~ShardedLRUCache()` was empty despite `init()` allocating memory on the heap. Fixed the leak by freeing memory allocated by `init()`.
Test Plan:
make check
Ran valgrind on db_test before and after patch and saw leaked memory went down
Reviewers: vamsi, dhruba, emayanke, sheki
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7791
Summary:
Changed CreateDir() to CreateDirIfMissing() so a directory that already exists now causes and error.
Fixed CreateDirIfMissing() and added Env.DirExists()
Test Plan:
make check to test for regessions
Ran the following to test if the error message is not about lock files not existing
./db_bench --db=dir/testdb
After creating a file "testdb", ran the following to see if it failed with sane error message:
./db_bench --db=testdb
Reviewers: dhruba, emayanke, vamsi, sheki
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7707
Summary:
There is a compilation error while using gcc 4.7.1.
util/ldb_cmd.cc:381:3: error: ‘leveldb::ReadOptions::ReadOptions’ names the constructor, not the type
util/ldb_cmd.cc:381:37: error: expected ‘;’ before ‘read_options’
util/ldb_cmd.cc:381:49: error: statement cannot resolve address of overloaded function
Test Plan: make clean check
Reviewers: sheki, emayanke, zshao
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7659
Summary: The queries will come from stdin. One key per line. The output will be in stdout, in the format of "<key> ==> <value>" if found, or "<key>" if not found. "--hex" uses HEX-encoded keys and values in both input and output.
Test Plan: ldb query --db=leveldb_db --hex
Reviewers: dhruba, emayanke, sheki
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7617
Summary: We were ignoring additional chars at the end of an arg. This can create confusion, e.g. --disable_wal=0 will act the same as --disable_wal without any warnings.
Test Plan:
Tried this:
[zshao@dev485 ~/git/rocksdb] ./ldb dump --statsAAA
Failed: Unknown argument:--statsAAA
Reviewers: dhruba, sheki, emayanke
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7635
Summary:
This allows ldb to control the write_buffer_size (which reflects to L0 file size) and file_size (which reflects to L1 file size). Since the target_file_size_ratio is 1 by default, all other levels will also have the same file size as L1.
As part of the diff, I also cleaned up some unused code and help messages.
Test Plan: ./ldb load --db=/data/users/zshao/test_leveldb --file_size=64000000 --write_buffer_size=32000000 --create_if_missing --input_hex --disable_wal
Reviewers: dhruba, sheki, emayanke
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7569
Summary: This command accepts key-value pairs from stdin with the same format of "ldb dump" command. This allows us to try out different compression algorithms/block sizes easily.
Test Plan: dump, load, dump, verify the data is the same.
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7443
Summary: The old code was omitting the 0 if the char is less than 16.
Test Plan:
Tried the following program:
int main() {
unsigned char c = 1;
printf("%X\n", c);
printf("%02X\n", c);
return 0;
}
The output is:
1
01
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7437
Summary: This allows us to use ldb to do more experiments like block_size changes.
Test Plan: run it by hand.
Reviewers: dhruba, sheki, emayanke
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7431
Summary:
Suppose you submit 100 background tasks one after another. The first
enqueu task finds that the queue is empty and wakes up one worker thread.
Now suppose that all remaining 99 work items are enqueued, they do not
wake up any worker threads because the queue is already non-empty.
This causes a situation when there are 99 tasks in the task queue but
only one worker thread is processing a task while the remaining
worker threads are waiting.
The fix is to always wakeup one worker thread while enqueuing a task.
I also added a check to count the number of elements in the queue
to help in debugging.
Test Plan: make clean check.
Reviewers: chip
Reviewed By: chip
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7203
Summary:
Added the following two options:
[--bloom_bits=<int,e.g.:14>]
[--compression_type=<no|snappy|zlib|bzip2>]
These options will be used when ldb opens the leveldb database.
Test Plan: Tried by hand for both success and failure cases. We do need a test framework.
Reviewers: dhruba, emayanke, sheki
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7197