Commit Graph

384 Commits

Author SHA1 Message Date
Dhruba Borthakur
47c4191fe8 Reduce write amplification by merging files in L0 back into L0
Summary:
There is a new option called hybrid_mode which, when switched on,
causes HBase style compactions.  Files from L0 are
compacted back into L0. This meat of this compaction algorithm
is in PickCompactionHybrid().

All files reside in L0. That means all files have overlapping
keys. Each file has a time-bound, i.e. each file contains a
range of keys that were inserted around the same time. The
start-seqno and the end-seqno refers to the timeframe when
these keys were inserted.  Files that have contiguous seqno
are compacted together into a larger file. All files are
ordered from most recent to the oldest.

The current compaction algorithm starts to look for
candidate files starting from the most recent file. It continues to
add more files to the same compaction run as long as the
sum of the files chosen till now is smaller than the next
candidate file size. This logic needs to be debated
and validated.

The above logic should reduce write amplification to a
large extent... will publish numbers shortly.

Test Plan: dbstress runs for 6 hours with no data corruption (tested so far).

Differential Revision: https://reviews.facebook.net/D11289
2013-06-30 20:07:04 -07:00
Mayank Agarwal
b858da709a Simplify bucketing logic in ldb-ttl
Summary: [start_time, end_time) is waht I'm following for the buckets and the whole time-range. Also cleaned up some code in db_ttl.* Not correcting the spacing/indenting convention for util/ldb_cmd.cc in this diff.

Test Plan: python ldb_test.py, make ttl_test, Run mcrocksdb-backup tool, Run the ldb tool on 2 mcrocksdb production backups form sigmafio033.prn1

Reviewers: vamsi, haobo

Reviewed By: vamsi

Differential Revision: https://reviews.facebook.net/D11433
2013-06-21 09:49:24 -07:00
Mayank Agarwal
61f1baaedf Introducing timeranged scan, timeranged dump in ldb. Also the ability to count in time-batches during Dump
Summary:
Scan and Dump commands in ldb use iterator. We need to also print timestamp for ttl databases for debugging. For this I create a TtlIterator class pointer in these functions and assign it the value of Iterator pointer which actually points to t TtlIterator object, and access the new function ValueWithTS which can return TS also. Buckets feature for dump command: gives a count of different key-values in the specified time-range distributed across the time-range partitioned according to bucket-size. start_time and end_time are specified in unixtimestamp and bucket in seconds on the user-commandline
Have commented out 3 ines from ldb_test.py so that the test does not break right now. It breaks because timestamp is also printed now and I have to look at wildcards in python to compare properly.

Test Plan: python tools/ldb_test.py

Reviewers: vamsi, dhruba, haobo, sheki

Reviewed By: vamsi

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11403
2013-06-19 18:45:13 -07:00
Haobo Xu
96be2c4ee0 [RocksDB] Add mmap_read option for db_stress
Summary: as title, also removed an incorrect assertion

Test Plan: make check; db_stress --mmap_read=1; db_stress --mmap_read=0

Reviewers: dhruba, emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11367
2013-06-19 10:28:32 -07:00
Abhishek Kona
5ef6bb8c37 [rocksdb][refactor] statistic printing code to one place
Summary: $title

Test Plan: db_bench --statistics=1

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11373
2013-06-18 20:28:41 -07:00
Haobo Xu
3cc1af2062 [RocksDB] Option for incremental sync
Summary: This diff added an option to control the incremenal sync frequency. db_bench has a new flag bytes_per_sync for easy tuning exercise.

Test Plan: make check; db_bench

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11295
2013-06-18 15:00:32 -07:00
Dhruba Borthakur
6acbe0fc45 Compact multiple memtables before flushing to storage.
Summary:
Merge multiple multiple memtables in memory before writing it
out to a file in L0.

There is a new config parameter min_write_buffer_number_to_merge
that specifies the number of write buffers that should be merged
together to a single file in storage. The system will not flush
wrte buffers to storage unless at least these many buffers have
accumulated in memory.
The default value of this new parameter is 1, which means that
a write buffer will be immediately flushed to disk as soon it is
ready.

Test Plan: make check

Differential Revision: https://reviews.facebook.net/D11241
2013-06-18 14:28:04 -07:00
Abhishek Kona
00124683de [rocksdb] do not trim range for level0 in manual compaction
Summary:
https://code.google.com/p/leveldb/issues/detail?can=1&q=178&colspec=ID%20Type%20Status%20Priority%20Milestone%20Owner%20Summary&id=178

Ported the solution as is to RocksDB.

Test Plan: moved the unit test as manual_compaction_test

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11331
2013-06-17 13:58:17 -07:00
Abhishek Kona
bff718d81c [Rocksdb] Implement filluniquerandom
Summary:
Use a bit set to keep track of which random number is generated.
        Currently only supports single-threaded. All our perf tests are run with threads=1
        Copied over bitset implementation from common/datastructures

Test Plan: printed the generated keys, and verified all keys were present.

Reviewers: MarkCallaghan, haobo, dhruba

Reviewed By: MarkCallaghan

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11247
2013-06-14 16:17:56 -07:00
Haobo Xu
778e179046 [RocksDB] Sync file to disk incrementally
Summary:
During compaction, we sync the output files after they are fully written out. This causes unnecessary blocking of the compaction thread and burstiness of the write traffic.
This diff simply asks the OS to sync data incrementally as they are written, on the background. The hope is that, at the final sync, most of the data are already on disk and we would block less on the sync call. Thus, each compaction runs faster and we could use fewer number of compaction threads to saturate IO.
In addition, the write traffic will be smoothed out, hopefully reducing the IO P99 latency too.

Some quick tests show 10~20% improvement in per thread compaction throughput. Combined with posix advice on compaction read, just 5 threads are enough to almost saturate the udb flash bandwidth for 800 bytes write only benchmark.
What's more promising is that, with saturated IO, iostat shows average wait time is actually smoother and much smaller.
For the write only test 800bytes test:
Before the change:  await  occillate between 10ms and 3ms
After the change: await ranges 1-3ms

Will test against read-modify-write workload too, see if high read latency P99 could be resolved.

Will introduce a parameter to control the sync interval in a follow up diff after cleaning up EnvOptions.

Test Plan: make check; db_bench; db_stress

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11115
2013-06-12 12:53:59 -07:00
Haobo Xu
bdf1085944 [RocksDB] cleanup EnvOptions
Summary:
This diff simplifies EnvOptions by treating it as POD, similar to Options.
- virtual functions are removed and member fields are accessed directly.
- StorageOptions is removed.
- Options.allow_readahead and Options.allow_readahead_compactions are deprecated.
- Unused global variables are removed: useOsBuffer, useFsReadAhead, useMmapRead, useMmapWrite

Test Plan: make check; db_stress

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11175
2013-06-12 11:17:19 -07:00
Haobo Xu
043573b24f [RocksDB] Include 64bit random number generator
Summary: As title.

Test Plan: make check;

Reviewers: chip, MarkCallaghan

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11061
2013-06-04 13:52:27 -07:00
Haobo Xu
d897d33bf1 [RocksDB] Introduce Fast Mutex option
Summary:
This diff adds an option to specify whether PTHREAD_MUTEX_ADAPTIVE_NP will be enabled for the rocksdb single big kernel lock. db_bench also have this option now.
Quickly tested 8 thread cpu bound 100 byte random read.
No fast mutex: ~750k/s ops
With fast mutex: ~880k/s ops

Test Plan: make check; db_bench; db_stress

Reviewers: dhruba

CC: MarkCallaghan, leveldb

Differential Revision: https://reviews.facebook.net/D11031
2013-06-01 23:11:34 -07:00
Haobo Xu
ab8d2f6ab2 [RocksDB] [Performance] Allow different posix advice to be applied to the same table file
Summary:
Current posix advice implementation ties up the access pattern hint with the creation of a file.
It is not possible to apply different advice for different access (random get vs compaction read),
without keeping two open files for the same table. This patch extended the RandomeAccessFile interface
to accept new access hint at anytime. Particularly, we are able to set different access hint on the same
table file based on when/how the file is used.
Two options are added to set the access hint, after the file is first opened and after the file is being
compacted.

Test Plan: make check; db_stress; db_bench

Reviewers: dhruba

Reviewed By: dhruba

CC: MarkCallaghan, leveldb

Differential Revision: https://reviews.facebook.net/D10905
2013-05-30 19:08:44 -07:00
heyongqiang
4c47d8f345 add block deviation option to terminate a block before it exceeds block_size
Summary: a new option block_size_deviation is added.

Test Plan: run db_test and db_bench

Reviewers: dhruba, haobo

Reviewed By: haobo

Differential Revision: https://reviews.facebook.net/D10821
2013-05-24 16:21:52 -07:00
heyongqiang
4b29651206 add block deviation option to terminate a block before it exceeds block_size
Summary: a new option block_size_deviation is added.

Test Plan: run db_test and db_bench

Reviewers: dhruba, haobo

Reviewed By: haobo

Differential Revision: https://reviews.facebook.net/D10821
2013-05-24 15:52:49 -07:00
Haobo Xu
0e879c93de [RocksDB] dump leveldb.stats periodically in LOG file.
Summary:
Added an option stats_dump_period_sec to dump leveldb.stats to LOG periodically for diagnosis.
By defauly, it's set to a very big number 3600 (1 hour).

Test Plan: make check;

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb, zshao

Differential Revision: https://reviews.facebook.net/D10761
2013-05-23 16:56:59 -07:00
Dhruba Borthakur
898f79372f Fix valgrind errors introduced by https://reviews.facebook.net/D10863
Summary:
The valgrind errors were in the unit tests where we change the
number of levels of a database using internal methods.

Test Plan:
valgrind ./reduce_levels_test
valgrind ./db_test

Reviewers: emayanke

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10893
2013-05-23 12:14:41 -07:00
Vamsi Ponnekanti
760dd4750f [Kill randomly at various points in source code for testing]
Summary:
This is initial version. A few ways in which this could
be extended in the future are:
(a) Killing from more places in source code
(b) Hashing stack and using that hash in determining whether to crash.
    This is to avoid crashing more often at source lines that are executed
    more often.
(c) Raising exceptions or returning errors instead of killing

Test Plan:
This whole thing is for testing.

Here is part of output:

python2.7 tools/db_crashtest2.py -d 600
Running db_stress

db_stress retncode -15 output LevelDB version     : 1.5
Number of threads   : 32
Ops per thread      : 10000000
Read percentage     : 50
Write-buffer-size   : 4194304
Delete percentage   : 30
Max key             : 1000
Ratio #ops/#keys    : 320000
Num times DB reopens: 0
Batches/snapshots   : 1
Purge redundant %   : 50
Num keys per lock   : 4
Compression         : snappy
------------------------------------------------
No lock creation because test_batches_snapshots set
2013/04/26-17:55:17  Starting database operations
Created bg thread 0x7fc1f07ff700
... finished 60000 ops
Running db_stress

db_stress retncode -15 output LevelDB version     : 1.5
Number of threads   : 32
Ops per thread      : 10000000
Read percentage     : 50
Write-buffer-size   : 4194304
Delete percentage   : 30
Max key             : 1000
Ratio #ops/#keys    : 320000
Num times DB reopens: 0
Batches/snapshots   : 1
Purge redundant %   : 50
Num keys per lock   : 4
Compression         : snappy
------------------------------------------------
Created bg thread 0x7ff0137ff700
No lock creation because test_batches_snapshots set
2013/04/26-17:56:15  Starting database operations
... finished 90000 ops

Revert Plan: OK

Task ID: #2252691

Reviewers: dhruba, emayanke

Reviewed By: emayanke

CC: leveldb, haobo

Differential Revision: https://reviews.facebook.net/D10581
2013-05-21 18:21:49 -07:00
Haobo Xu
87d0af15d8 [RocksDB] Introduce an option to skip log error on recovery
Summary:
Currently, with paranoid_check on, DB::Open will fail on any log read error on recovery.
If client is ok with losing most recent updates, we could simply skip those errors.
However, it's important to introduce an additional flag, so that paranoid_check can
still guard against more serious problems.

Test Plan: make check; db_stress

Reviewers: dhruba, emayanke

Reviewed By: emayanke

CC: leveldb, emayanke

Differential Revision: https://reviews.facebook.net/D10869
2013-05-21 14:30:36 -07:00
Dhruba Borthakur
d1aaaf718c Ability to set different size fanout multipliers for every level.
Summary:
There is an existing field Options.max_bytes_for_level_multiplier that
sets the multiplier for the size of each level in the database.

This patch introduces the ability to set different multipliers
for every level in the database. The size of a level is determined
by using both max_bytes_for_level_multiplier as well as the
per-level fanout.

size of level[i] = size of level[i-1] * max_bytes_for_level_multiplier
                   * fanout[i-1]

The default value of fanout is 1, so that it is backward compatible.

Test Plan: make check

Reviewers: haobo, emayanke

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10863
2013-05-21 13:50:20 -07:00
Haobo Xu
839f6db77e [RocksDB] Fix PosixLogger and AutoRollLogger thread safety
Summary:
PosixLogger and AutoRollLogger do not seem to be thread safe.
For PosixLogger, log_size_ is not atomically updated.
For AutoRollLogger, the underlying logger_ might be deleted by
one thread while still being accessed by another.

Test Plan: make check

Reviewers: kailiu, dhruba, heyongqiang

Reviewed By: kailiu

CC: leveldb, zshao, sheki

Differential Revision: https://reviews.facebook.net/D9699
2013-05-21 11:39:44 -07:00
Abhishek Kona
e1174306c5 [RocksDB] Simplify StopWatch implementation
Summary:
Make stop watch a simple implementation, instead of subclass of a virtual class
Allocate stop watches off the stack instead of heap.
Code is more terse now.

Test Plan: make all check, db_bench with --statistics=1

Reviewers: haobo, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10809
2013-05-17 10:55:34 -07:00
Abhishek Kona
446151cd20 [Rocksdb] Remove unused double apis to record into histograms
Summary: Statistics.h and histogram.h had double based api's to record values. Remove them as they are not used anywhere

Test Plan: make all check

Reviewers: haobo, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10815
2013-05-16 10:40:30 -07:00
Mayank Agarwal
8a48410f09 Enhance the ldb tool to support ttl databases
Summary: ldb works with raw data from the database and needs to be aware of ttl-database to work with it meaningfully. '-ttl' option now tells it that. Also added onto the ldb_test.py test. This option may be specified alongwith put, get, scan or dump. There is no support to provide a ttl-value and it uses default forever because there is no use-case for this currently.

Test Plan: make ldb_test; python tools/ldb_test.py

Reviewers: dhruba, sheki, haobo, vamsi

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10797
2013-05-15 12:10:00 -07:00
Haobo Xu
4ca3c67bd3 [RocksDB] Cleanup compaction filter to use a class interface, instead of function pointer and additional context pointer.
Summary:
This diff replaces compaction_filter_args and CompactionFilter with a single compaction_filter parameter. It gives CompactionFilter better encapsulation and a similar look to Comparator and MergeOpertor, which improves consistency of the overall interface.
The change is not backward compatible. Nevertheless, the two references in fbcode are not in production yet.

Test Plan: make check

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb, zshao

Differential Revision: https://reviews.facebook.net/D10773
2013-05-13 14:06:10 -07:00
Haobo Xu
05e8854085 [Rocksdb] Support Merge operation in rocksdb
Summary:
This diff introduces a new Merge operation into rocksdb.
The purpose of this review is mostly getting feedback from the team (everyone please) on the design.

Please focus on the four files under include/leveldb/, as they spell the client visible interface change.
include/leveldb/db.h
include/leveldb/merge_operator.h
include/leveldb/options.h
include/leveldb/write_batch.h

Please go over local/my_test.cc carefully, as it is a concerete use case.

Please also review the impelmentation files to see if the straw man implementation makes sense.

Note that, the diff does pass all make check and truly supports forward iterator over db and a version
of Get that's based on iterator.

Future work:
- Integration with compaction
- A raw Get implementation

I am working on a wiki that explains the design and implementation choices, but coding comes
just naturally and I think it might be a good idea to share the code earlier. The code is
heavily commented.

Test Plan: run all local tests

Reviewers: dhruba, heyongqiang

Reviewed By: dhruba

CC: leveldb, zshao, sheki, emayanke, MarkCallaghan

Differential Revision: https://reviews.facebook.net/D9651
2013-05-03 16:59:02 -07:00
Kai Liu
958b9c80e1 Avoid global static initialization in Env::Default()
Summary:
Mark's task description from #2316777

Env::Default() comes from util/env_posix.cc

This is a static global.

static PosixEnv default_env;

Env* Env::Default() {
  return &default_env;
}

-----

These globals assume default_env was initialized first. I don't think that is safe or correct to do (http://stackoverflow.com/questions/1005685/c-static-initialization-order)

const string AutoRollLoggerTest::kTestDir(
test::TmpDir() + "/db_log_test");
const string AutoRollLoggerTest::kLogFile(
test::TmpDir() + "/db_log_test/LOG");
Env* AutoRollLoggerTest::env = Env::Default();

Test Plan:
run make clean && make && make check
But how can I know if it works in Ubuntu?

Reviewers: MarkCallaghan, chip

Reviewed By: chip

CC: leveldb, dhruba, haobo

Differential Revision: https://reviews.facebook.net/D10491
2013-04-22 18:10:28 -07:00
Dhruba Borthakur
3cb7bf8170 Initialize parameters in the constructor.
Summary:
RocksDB doesn't build on Ubuntu VM .. shoudl be fixed with this patch.

g++ --version
g++ (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3

util/env_posix.cc:68:24: sorry, unimplemented: non-static data member initializers
util/env_posix.cc:68:24: error: ISO C++ forbids in-class initialization of non-const static member ‘use_os_buffer’
util/env_posix.cc:113:24: sorry, unimplemented: non-static data member initializers
util/env_posix.cc:113:24: error: ISO C++ forbids in-class initialization of non-const static member ‘use_os_buffer

Test Plan: make check

Reviewers: sheki, leveldb

Reviewed By: sheki

Differential Revision: https://reviews.facebook.net/D10461
2013-04-22 14:41:45 -07:00
Mark Callaghan
b1ff9ac9c5 Add --writes_per_second rate limit, print p99.99 in histogram
Summary:
Adds the --writes_per_second rate limit for the readwhilewriting test.
The purpose is to optionally avoid saturating storage with writes & compaction
and test read response time when some writes are being done.

Changes the histogram code to also print the p99.99 value

Task ID: #

Blame Rev:

Test Plan:
make check, ran db_bench with it

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10305
2013-04-20 10:26:51 -07:00
Haobo Xu
e0b60923ee [RocksDB] fix build
Summary: forgot to include signal_test.cc

Test Plan: make check

Reviewers: sheki

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10281
2013-04-20 10:26:51 -07:00
Haobo Xu
1255dcd446 [RocksDB] Add stacktrace signal handler
Summary:
This diff provides the ability to print out a stacktrace when the process receives certain signals.
Currently, we enable this for the following signals (program error related):
SIGILL SIGSEGV SIGBUS SIGABRT
Application simply #include "util/stack_trace.h" and call leveldb::InstallStackTraceHandler() during initialization, if signal handler is needed. It's not done automatically when openning db, because it's the application(process)'s responsibility to install signal handler and some applications might already have their own (like fbcode).

Sample output:
Received signal 11 (Segmentation fault)
#0  0x408ff0 ./signal_test() [0x408ff0] /home/haobo/rocksdb/util/signal_test.cc:4
#1  0x40827d ./signal_test() [0x40827d] /home/haobo/rocksdb/util/signal_test.cc:24
#2  0x7f8bb183172e /usr/local/fbcode/gcc-4.7.1-glibc-2.14.1/lib/libc.so.6(__libc_start_main+0x10e) [0x7f8bb183172e] ??:0
#3  0x408ebc ./signal_test() [0x408ebc] /home/engshare/third-party/src/glibc/glibc-2.14.1/glibc-2.14.1/csu/../sysdeps/x86_64/elf/start.S:113
Segmentation fault (core dumped)

For each frame, we print the raw pointer, the symbol provided by backtrace_symbols (still not good enough), and the source file/line. Note that address translation is done by directly shell out to addr2line. ??:0 means addr2line fails to do the translation. Hacky, but I think it's good for now.

Test Plan: signal_test.cc

Reviewers: dhruba, MarkCallaghan

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10173
2013-04-20 10:26:50 -07:00
Haobo Xu
a29fc171a6 [RocksDB] posix_logger does not compile on non-linux platform
Summary: As title. Found out this when testing stack_trace.cc portability.

Test Plan: make check; manual test 'non-linux' build by forcing OS_LINUX2

Reviewers: dhruba, heyongqiang

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10263
2013-04-15 19:18:51 -07:00
Abhishek Kona
dae7379050 [RocksDB] Expose LDB functioanality as a library call - clients can build their own LDB binary with additional options
Summary: Primarily a refactor. Introduced LDBTool interface to which customers can plug in their options and this will create their own version of ldb tool.

Test Plan: made ldb tool and tried it.

Reviewers: dhruba, heyongqiang

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D10191
2013-04-11 20:21:49 -07:00
Mayank Agarwal
6594fef7ef Exit and Join the background compaction threads while running rocksdb tests
Summary:
The background compaction threads are never exitted and therefore caused
memory-leaks while running rpcksdb tests. Have changed the PosixEnv destructor to exit and join them and changed the tests likewise
The memory leaked has reduced from 320 bytes to 64 bytes in all the tests. The 64
bytes is relating to
pthread_exit, but still have to figure out why. The stack-trace right now with
table_test.cc = 64 bytes in 1 blocks are possibly lost in loss record 4 of 5
   at 0x475D8C: malloc (jemalloc.c:914)
   by 0x400D69E: _dl_map_object_deps (dl-deps.c:505)
   by 0x4013393: dl_open_worker (dl-open.c:263)
   by 0x400F015: _dl_catch_error (dl-error.c:178)
   by 0x4013B2B: _dl_open (dl-open.c:569)
   by 0x5D3E913: do_dlopen (dl-libc.c:86)
   by 0x400F015: _dl_catch_error (dl-error.c:178)
   by 0x5D3E9D6: __libc_dlopen_mode (dl-libc.c:47)
   by 0x5048BF3: pthread_cancel_init (unwind-forcedunwind.c:53)
   by 0x5048DC9: _Unwind_ForcedUnwind (unwind-forcedunwind.c:126)
   by 0x5046D9F: __pthread_unwind (unwind.c:130)
   by 0x50413A4: pthread_exit (pthreadP.h:289)

Test Plan: make all check

Reviewers: dhruba, sheki, haobo

Reviewed By: dhruba

CC: leveldb, chip

Differential Revision: https://reviews.facebook.net/D9573
2013-04-10 14:50:25 -07:00
heyongqiang
e21ba94a69 Set FD_CLOEXEC after each file open
Summary: as subject. This is causing problem in adsconv. Ideally, this flags should be set in open. But that is only supported in Linux kernel ≥2.6.23 and glibc ≥2.7.

Test Plan:
db_test

run db_test

Reviewers: dhruba, MarkCallaghan, haobo

Reviewed By: dhruba

CC: leveldb, chip

Differential Revision: https://reviews.facebook.net/D10089
2013-04-10 14:44:06 -07:00
Mayank Agarwal
adb4e4509b Fixing delete in env_posix.cc
Summary: Was deleting incorrectly. Should delete the whole array.

Test Plan: make;valgrind stops complaining about Mismatched free/delete

Reviewers: dhruba, sheki

Reviewed By: sheki

CC: leveldb, haobo

Differential Revision: https://reviews.facebook.net/D10059
2013-04-09 11:49:35 -07:00
Haobo Xu
a77bc3d14c [RocksDB] Fix LRUCache Eviction problem
Summary:
1. The stock LRUCache nukes itself whenever the working set (the total number of entries not released by client at a certain time) is bigger than the cache capacity.
See https://our.dev.facebook.com/intern/tasks/?t=2252281
2. There's a bug in shard calculation leading to segmentation fault when only one shard is needed.

Test Plan: make check

Reviewers: dhruba, heyongqiang

Reviewed By: heyongqiang

CC: leveldb, zshao, sheki

Differential Revision: https://reviews.facebook.net/D9927
2013-04-04 11:22:50 -07:00
Haobo Xu
d815082159 [RocksDB] env_posix cleanup
Summary:
1. SetBackgroundThreads was not thread safe
2. queue_size_ does not seem necessary
3. moved condition signal after shared state change. Even though the original
   order is in practice ok (because the mutex is still held), it looks fishy
   and non-intuitive.

Test Plan: make check

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb, zshao

Differential Revision: https://reviews.facebook.net/D9825
2013-04-02 11:36:51 -07:00
Abhishek Kona
7fdd5f5b33 Use non-mmapd files for Write-Ahead Files
Summary:
Use non mmapd files for Write-Ahead log.
Earlier use of MMaped files. made the log iterator read ahead and miss records.
Now the reader and writer will point to the same physical location.

There is no perf regression :
./db_bench --benchmarks=fillseq --db=/dev/shm/mmap_test --num=$(million 20) --use_existing_db=0 --threads=2
with This diff :
fillseq      :      10.756 micros/op 185281 ops/sec;   20.5 MB/s
without this dif :
fillseq      :      11.085 micros/op 179676 ops/sec;   19.9 MB/s

Test Plan: unit test included

Reviewers: dhruba, heyongqiang

Reviewed By: heyongqiang

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9741
2013-03-28 13:13:35 -07:00
Abhishek Kona
63f216ee0a memory manage statistics
Summary:
Earlier Statistics object was a raw pointer. This meant the user had to clear up
the Statistics object after creating the database. In most use cases the database is created in a function and the statistics pointer is out of scope. Hence the statistics object would never be deleted.
Now Using a shared_ptr to manage this.

Want this in before the next release.

Test Plan: make all check.

Reviewers: dhruba, emayanke

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9735
2013-03-27 11:27:39 -07:00
Simon Marlow
a8bf8fe504 Integrate the manifest_dump command with ldb
Summary:
Syntax:

   manifest_dump [--verbose] --num=<manifest_num>

e.g.

$ ./ldb --db=/home/smarlow/tmp/testdb manifest_dump --num=12
manifest_file_number 13 next_file_number 14 last_sequence 3 log_number
11  prev_log_number 0
--- level 0 --- version# 0 ---
 6:116['a1' @ 1 : 1 .. 'a1' @ 1 : 1]
 10:130['a3' @ 2 : 1 .. 'a4' @ 3 : 1]
--- level 1 --- version# 0 ---
--- level 2 --- version# 0 ---
--- level 3 --- version# 0 ---
--- level 4 --- version# 0 ---
--- level 5 --- version# 0 ---
--- level 6 --- version# 0 ---

Test Plan: - Tested on an example DB (see output in summary)

Reviewers: sheki, dhruba

Reviewed By: sheki

CC: leveldb, heyongqiang

Differential Revision: https://reviews.facebook.net/D9609
2013-03-22 09:17:30 -07:00
Mayank Agarwal
38d54832f7 Initialize variable in constructor for PosixEnv::checkedDiskForMmap_
Summary: This caused compilation problems on some gcc platforms during the third-partyrelease

Test Plan: make

Reviewers: sheki

Reviewed By: sheki

Differential Revision: https://reviews.facebook.net/D9627
2013-03-21 11:26:50 -07:00
Dhruba Borthakur
ad96563b79 Ability to configure bufferedio-reads, filesystem-readaheads and mmap-read-write per database.
Summary:
This patch allows an application to specify whether to use bufferedio,
reads-via-mmaps and writes-via-mmaps per database. Earlier, there
was a global static variable that was used to configure this functionality.

The default setting remains the same (and is backward compatible):
 1. use bufferedio
 2. do not use mmaps for reads
 3. use mmap for writes
 4. use readaheads for reads needed for compaction

I also added a parameter to db_bench to be able to explicitly specify
whether to do readaheads for compactions or not.

Test Plan: make check

Reviewers: sheki, heyongqiang, MarkCallaghan

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9429
2013-03-20 23:14:03 -07:00
Mayank Agarwal
a6f4275403 Removing boost from ldb_cmd.cc
Summary: Getting rid of boost in our github codebase which caused problems on third-party

Test Plan: make ldb; python tools/ldb_test.py

Reviewers: sheki, dhruba

Reviewed By: sheki

Differential Revision: https://reviews.facebook.net/D9543
2013-03-20 11:19:12 -07:00
Mayank Agarwal
48abc06049 Using return value of fwrite in posix_logger.h
Summary: Was causing error(warning) in third-party saying unused result

Test Plan: make

Reviewers: sheki, dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D9447
2013-03-19 21:33:01 -07:00
Mayank Agarwal
487168cdcf Fixed sign-comparison in rocksdb code-base and fixed Makefile
Summary: Makefile had options to ignore sign-comparisons and unused-parameters, which should be there. Also fixed the specific errors in the code-base

Test Plan: make

Reviewers: chip, dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9531
2013-03-19 14:35:23 -07:00
Mayank Agarwal
f04cc368f7 Fixing a careless mistake in ldb
Summary: negation of the condition checked currently had to be checkd actually

Test Plan: make ldb; python ldb_test.py

Reviewers: sheki, dhruba

Reviewed By: sheki

Differential Revision: https://reviews.facebook.net/D9459
2013-03-15 13:59:11 -07:00
Mayank Agarwal
a78fb5e8bc Doing away with boost in ldb_cmd.h
Summary: boost functions cause complications while deploying to third-party

Test Plan: make

Reviewers: sheki, dhruba

Reviewed By: sheki

Differential Revision: https://reviews.facebook.net/D9441
2013-03-14 18:16:46 -07:00
Abhishek Kona
1ba5abca97 Use posix_fallocate as default.
Summary:
Ftruncate does not throw an error on disk-full. This causes Sig-bus in
the case where the database tries to issue a Put call on a full-disk.

Use posix_fallocate for allocation instead of truncate.
Add a check to use MMaped files only on ext4, xfs and tempfs, as
posix_fallocate is very slow on ext3 and older.

Test Plan: make all check

Reviewers: dhruba, chip

Reviewed By: dhruba

CC: adsharma, leveldb

Differential Revision: https://reviews.facebook.net/D9291
2013-03-13 13:50:26 -07:00
Mayank Agarwal
5b278b53ae Fix valgrind errors in rocksdb tests: auto_roll_logger_test, reduce_levels_test
Summary: Fix for memory leaks in rocksdb tests. Also modified the variable NUM_FAILED_TESTS to print the actual number of failed tests.

Test Plan: make <test>; valgrind --leak-check=full ./<test>

Reviewers: sheki, dhruba

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9333
2013-03-12 16:03:16 -07:00
Dhruba Borthakur
469724be7f Add appropriate parameters to make bulk-load go faster.
Summary:
1. Create only 2 levels so that manual compactions are fast.
2. Set target file size to a large value

Test Plan: make clean check

Reviewers: kailiu, zshao

Reviewed By: zshao

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9231
2013-03-08 10:52:16 -08:00
Zheng Shao
7b43500794 [RocksDB] Add bulk_load option to Options and ldb
Summary:
Add a shortcut function to make it easier for people
to efficiently bulk_load data into RocksDB.

Test Plan:
Tried ldb with "--bulk_load" and "--bulk_load --compact" and verified the outcome.
Needs to consult the team on how to test this automatically.

Reviewers: sheki, dhruba, emayanke, heyongqiang

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D8907
2013-03-05 00:34:53 -08:00
Mark Callaghan
993543d1be Add rate_delay_limit_milliseconds
Summary:
This adds the rate_delay_limit_milliseconds option to make the delay
configurable in MakeRoomForWrite when the max compaction score is too high.
This delay is called the Ln slowdown. This change also counts the Ln slowdown
per level to make it possible to see where the stalls occur.

From IO-bound performance testing, the Level N stalls occur:
* with compression -> at the largest uncompressed level. This makes sense
                      because compaction for compressed levels is much
                      slower. When Lx is uncompressed and Lx+1 is compressed
                      then files pile up at Lx because the (Lx,Lx+1)->Lx+1
                      compaction process is the first to be slowed by
                      compression.
* without compression -> at level 1

Task ID: #1832108

Blame Rev:

Test Plan:
run with real data, added test

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D9045
2013-03-04 07:41:15 -08:00
Dhruba Borthakur
806e264350 Ability for rocksdb to compact when flushing the in-memory memtable to a file in L0.
Summary:
Rocks accumulates recent writes and deletes in the in-memory memtable.
When the memtable is full, it writes the contents on the memtable to
a file in L0.

This patch removes redundant records at the time of the flush. If there
are multiple versions of the same key in the memtable, then only the
most recent one is dumped into the output file. The purging of
redundant records occur only if the most recent snapshot is earlier
than the earliest record in the memtable.

Should we switch on this feature by default or should we keep this feature
turned off in the default settings?

Test Plan: Added test case to db_test.cc

Reviewers: sheki, vamsi, emayanke, heyongqiang

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D8991
2013-03-04 00:01:47 -08:00
Abhishek Kona
c41f1e995c Codemod NULL to nullptr
Summary:
scripted NULL to nullptr in
* include/leveldb/
* db/
* table/
* util/

Test Plan: make all check

Reviewers: dhruba, emayanke

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9003
2013-02-28 18:04:58 -08:00
Abhishek Kona
9bf91c74b8 ldb waldump to print the keys along with other stats + NULL to nullptr in ldb_cmd.cc
Summary: LDB tool to print the deleted/put keys in hex in the wal file.

Test Plan: run ldb on a  db to check if output was satisfactory

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D8691
2013-02-20 11:01:37 -08:00
amayank
b2c50f1c3f Fix for the weird behaviour encountered by ldb Get where it could read only the second-latest value
Summary:
Changed the Get and Scan options with openForReadOnly mode to have access to the memtable.
Changed the visibility of NewInternalIterator in db_impl from private to protected so that
the derived class db_impl_read_only can call that in its NewIterator function for the
scan case. The previous approach which changed the default for flush_on_destroy_ from false to true
caused many problems in the unit tests due to empty sst files that it created. All
unit tests pass now.

Test Plan: make clean; make all check; ldb put and get and scans

Reviewers: dhruba, heyongqiang, sheki

Reviewed By: dhruba

CC: kosievdmerwe, zshao, dilipj, kailiu

Differential Revision: https://reviews.facebook.net/D8697
2013-02-20 10:45:52 -08:00
Abhishek Kona
fe10200ddc Introduce histogram in statistics.h
Summary:
* Introduce is histogram in statistics.h
* stop watch to measure time.
* introduce two timers as a poc.
Replaced NULL with nullptr to fight some lint errors
Should be useful for google.

Test Plan:
ran db_bench and check stats.
make all check

Reviewers: dhruba, heyongqiang

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D8637
2013-02-20 10:43:32 -08:00
Kai Liu
45f0030458 Fix the "IO error" in auto_roll_logger_test
Summary:

I missed InitTestDb() in one of my tess. InitTestDb() initializes the test directory, without which the test will throw IO error.

This problem didn't occur before because I've already run the tests before so the test directory is already there.

Test Plan:

Reviewers: dhruba

CC:

Task ID: #

Blame Rev:
2013-02-19 00:13:22 -08:00
amayank
f3901e0647 Revert "Fix for the weird behaviour encountered by ldb Get where it could read only the second-latest value"
This reverts commit 4c696ed001.
2013-02-18 22:32:27 -08:00
amayank
4c696ed001 Fix for the weird behaviour encountered by ldb Get where it could read only the second-latest value
Summary:
flush_on_destroy has a default value of false and the memtable is flushed
in the dbimpl-destructor only when that is set to true. Because we want the memtable to be flushed everytime that
the destructor is called(db is closed) and the cases where we work with the memtable only are very less
it is a good idea to give this a default value of true. Thus the put from ldb
wil have its data flushed to disk in the destructor and the next Get will be able to
read it when opened with OpenForReadOnly. The reason that ldb could read the latest value when
the db was opened in the normal Open mode is that the Get from normal Open first reads
the memtable and directly finds the latest value written there and the Get from OpenForReadOnly
doesn't have access to the memtable (which is correct because all its Put/Modify) are disabled

Test Plan: make all; ldb put and get and scans

Reviewers: dhruba, heyongqiang, sheki

Reviewed By: heyongqiang

CC: kosievdmerwe, zshao, dilipj, kailiu

Differential Revision: https://reviews.facebook.net/D8631
2013-02-15 16:56:06 -08:00
Kai Liu
aaa0cbb97a Fix the warning introduced by auto_roll_logger_test
Summary: Fix the warning [-Werror=format-security] and [-Werror=unused-result].

Test Plan:
enforced the Werror and run make

Task ID: 2101673

Blame Rev:

Reviewers: heyongqiang

Differential Revision: https://reviews.facebook.net/D8553
2013-02-13 15:29:35 -08:00
Chip Turner
f02db1c118 Add zlib to our builds and tweak histogram output
Summary:
$SUBJECT -- cosmetic fix for histograms, print P75/P99, and
make sure zlib is enabled for our command line tools.

Test Plan: compile, test db_bench with --compression_type=zlib

Reviewers: heyongqiang

Reviewed By: heyongqiang

CC: adsharma, leveldb

Differential Revision: https://reviews.facebook.net/D8445
2013-02-07 15:31:53 -08:00
Kai Liu
b63aafce42 Allow the logs to be purged by TTL.
Summary:
* Add a SplitByTTLLogger to enable this feature. In this diff I implemented generalized AutoSplitLoggerBase class to simplify the
development of such classes.
* Refactor the existing AutoSplitLogger and fix several bugs.

Test Plan:
* Added a unit tests for different types of "auto splitable" loggers individually.
* Tested the composited logger which allows the log files to be splitted by both TTL and log size.

Reviewers: heyongqiang, dhruba

Reviewed By: heyongqiang

CC: zshao, leveldb

Differential Revision: https://reviews.facebook.net/D8037
2013-02-04 19:42:40 -08:00
Abhishek Kona
4dc02f7b7a Initialize all doubles to 0 in histogram.cc
Summary:
The existing code did not initialize a few doubles in histogram.cc.
Cropped up when I wrote a unit-test.

Test Plan: make all check

Reviewers: chip

Reviewed By: chip

CC: leveldb

Differential Revision: https://reviews.facebook.net/D8319
2013-01-31 17:31:43 -08:00
Abhishek Kona
009034cf12 Performant util/histogram.
Summary:
Earlier way to record in histogram=>
Linear search BucketLimit array to find the bucket and increment the
counter
Current way to record in histogram=>
Store a HistMap statically which points the buckets of each value in the
range [kFirstValue, kLastValue);

In the proccess use vectors instead of array's and refactor some code to
HistogramHelper class.

Test Plan:
run db_bench with histogram=1 and see a histogram being
printed.

Reviewers: dhruba, chip, heyongqiang

Reviewed By: chip

CC: leveldb

Differential Revision: https://reviews.facebook.net/D8265
2013-01-31 16:10:34 -08:00
Kosie van der Merwe
4dcc0c89f4 Fixed cache key for block cache
Summary:
Added function to `RandomAccessFile` to generate an unique ID for that file. Currently only `PosixRandomAccessFile` has this behaviour implemented and only on Linux.

Changed how key is generated in `Table::BlockReader`.

Added tests to check whether the unique ID is stable, unique and not a prefix of another unique ID. Added tests to see that `Table` uses the cache more efficiently.

Test Plan: make check

Reviewers: chip, vamsi, dhruba

Reviewed By: chip

CC: leveldb

Differential Revision: https://reviews.facebook.net/D8145
2013-01-31 15:20:24 -08:00
Chip Turner
2c3565285e Add OS_LINUX ifdef protections around fallocate parts
Summary: fallocate is linux only, so let's protect it with ifdef's

Test Plan: make

Reviewers: sheki, dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D8223
2013-01-28 12:03:35 -08:00
Dilip Antony Joseph
11ce6a060e Enhanced ldb to support data access commands
Summary: Added put/get/scan/batchput/delete/approxsize

Test Plan: Added pyunit script to test the newly added commands

Reviewers: chip, leveldb

Reviewed By: chip

CC: zshao, emayanke

Differential Revision: https://reviews.facebook.net/D7947
2013-01-28 11:38:26 -08:00
Chip Turner
0b83a83191 Fix poor error on num_levels mismatch and few other minor improvements
Summary:
Previously, if you opened a db with num_levels set lower than
the database, you received the unhelpful message "Corruption:
VersionEdit: new-file entry."  Now you get a more verbose message
describing the issue.

Also, fix handling of compression_levels (both the run-over-the-end
issue and the memory management of it).

Lastly, unique_ptr'ify a couple of minor calls.

Test Plan: make check

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D8151
2013-01-25 15:37:26 -08:00
Chip Turner
772f75b3fb Stop continually re-creating build_version.c
Summary:
We continually rebuilt build_version.c because we put the
current date into it, but that's what __DATE__ already is.  This makes
builds faster.

This also fixes an issue with 'make clean FOO' not working properly.

Also tweak the build rules to be more consistent, always have warnings,
and add a 'make release' rule to handle flags for release builds.

Test Plan: make, make clean

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D8139
2013-01-24 17:51:39 -08:00
Chip Turner
3dafdfb2c4 Use fallocate to prevent excessive allocation of sst files and logs
Summary:
On some filesystems, pre-allocation can be a considerable
amount of space.  xfs in our production environment pre-allocates by
1GB, for instance.  By using fallocate to inform the kernel of our
expected file sizes, we eliminate this wasteage (that isn't recovered
until the file is closed which, in the case of LOG files, can be a
considerable amount of time).

Test Plan:
created an xfs loopback filesystem, mounted with
allocsize=4M, and ran db_stress.  LOG file without this change was 4M,
and with it it was 128k then grew to normal size.

Reviewers: dhruba

Reviewed By: dhruba

CC: adsharma, leveldb

Differential Revision: https://reviews.facebook.net/D7953
2013-01-24 12:25:13 -08:00
Chip Turner
2fdf91a4f8 Fix a number of object lifetime/ownership issues
Summary:
Replace manual memory management with std::unique_ptr in a
number of places; not exhaustive, but this fixes a few leaks with file
handles as well as clarifies semantics of the ownership of file handles
with log classes.

Test Plan: db_stress, make check

Reviewers: dhruba

Reviewed By: dhruba

CC: zshao, leveldb, heyongqiang

Differential Revision: https://reviews.facebook.net/D8043
2013-01-23 16:54:11 -08:00
Abhishek Kona
7d5a4383bb rollover manifest file.
Summary:
Check in LogAndApply if the file size is more than the limit set in
Options.
Things to consider : will this be expensive?

Test Plan: make all check. Inputs on a new unit test?

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7701
2013-01-16 12:09:44 -08:00
Chip Turner
a2dcd79c1e Add optional clang compile mode
Summary:
clang is an alternate compiler based on llvm.  It produces
nicer error messages and finds some bugs that gcc doesn't, such as the
size_t change in this file (which caused some write return values to be
misinterpreted!)

Clang isn't the default; to try it, do "USE_CLANG=1 make" or "export
USE_CLANG=1" then make as normal

Test Plan: "make check" and "USE_CLANG=1 make check"

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D7899
2013-01-15 18:48:37 -08:00
Kosie van der Merwe
4d339d7462 Fixed memory leak in ShardedLRUCache
Summary: `~ShardedLRUCache()` was empty despite `init()` allocating memory on the heap. Fixed the leak by freeing memory allocated by `init()`.

Test Plan:
make check

Ran valgrind on db_test before and after patch and saw leaked memory went down

Reviewers: vamsi, dhruba, emayanke, sheki

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7791
2013-01-08 11:24:15 -08:00
Kosie van der Merwe
d6e873f22f Added clearer error message for failure to create db directory in DBImpl::Recover()
Summary:
Changed CreateDir() to CreateDirIfMissing() so a directory that already exists now causes and error.

Fixed CreateDirIfMissing() and added Env.DirExists()

Test Plan:
make check to test for regessions

Ran the following to test if the error message is not about lock files not existing
./db_bench --db=dir/testdb

After creating a file "testdb", ran the following to see if it failed with sane error message:
./db_bench --db=testdb

Reviewers: dhruba, emayanke, vamsi, sheki

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7707
2013-01-07 10:11:18 -08:00
Dhruba Borthakur
5b05417df3 Complication error when using gcc 4.7.1.
Summary:
There is a compilation error while using gcc 4.7.1.
util/ldb_cmd.cc:381:3: error: ‘leveldb::ReadOptions::ReadOptions’ names the constructor, not the type
util/ldb_cmd.cc:381:37: error: expected ‘;’ before ‘read_options’
util/ldb_cmd.cc:381:49: error: statement cannot resolve address of overloaded function

Test Plan: make clean check

Reviewers: sheki, emayanke, zshao

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7659
2012-12-27 12:38:20 -08:00
Zheng Shao
0f762ac573 ldb: Add command "ldb query" to support random read from the database
Summary: The queries will come from stdin.  One key per line.  The output will be in stdout, in the format of "<key> ==> <value>" if found, or "<key>" if not found.  "--hex" uses HEX-encoded keys and values in both input and output.

Test Plan: ldb query --db=leveldb_db --hex

Reviewers: dhruba, emayanke, sheki

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7617
2012-12-26 20:37:42 -08:00
Zheng Shao
dcece4707e ldb: Fix incorrect arg parsing
Summary: We were ignoring additional chars at the end of an arg.  This can create confusion, e.g. --disable_wal=0 will act the same as --disable_wal without any warnings.

Test Plan:
Tried this:
[zshao@dev485 ~/git/rocksdb] ./ldb dump --statsAAA
Failed: Unknown argument:--statsAAA

Reviewers: dhruba, sheki, emayanke

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7635
2012-12-26 20:09:10 -08:00
Zheng Shao
70f0f50731 ldb: file_size and write_buffer_size options.
Summary:
This allows ldb to control the write_buffer_size (which reflects to L0 file size) and file_size (which reflects to L1 file size).  Since the target_file_size_ratio is 1 by default, all other levels will also have the same file size as L1.

As part of the diff, I also cleaned up some unused code and help messages.

Test Plan: ./ldb load --db=/data/users/zshao/test_leveldb --file_size=64000000 --write_buffer_size=32000000 --create_if_missing --input_hex --disable_wal

Reviewers: dhruba, sheki, emayanke

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7569
2012-12-21 10:40:38 -08:00
Abhishek Kona
1aae609b92 Use CRC32 ss42 instruction. Load it dynamically.
Test Plan: make all check

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb, zshao

Differential Revision: https://reviews.facebook.net/D7503
2012-12-21 10:20:32 -08:00
Zheng Shao
be9b862d47 ldb: add "ldb load" command
Summary: This command accepts key-value pairs from stdin with the same format of "ldb dump" command.  This allows us to try out different compression algorithms/block sizes easily.

Test Plan: dump, load, dump, verify the data is the same.

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7443
2012-12-19 01:45:59 -08:00
Zheng Shao
3d9ff0e921 ldb: fix dump command to pad HEX output chars with 0.
Summary: The old code was omitting the 0 if the char is less than 16.

Test Plan:
Tried the following program:
int main() {
  unsigned char c = 1;
  printf("%X\n", c);
  printf("%02X\n", c);
  return 0;
}
The output is:
1
01

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7437
2012-12-16 16:55:38 -08:00
Zheng Shao
7dc8bb710e ldb: support --block_size=<4096|65536|...> and --auto_compaction=<0|1>
Summary: This allows us to use ldb to do more experiments like block_size changes.

Test Plan: run it by hand.

Reviewers: dhruba, sheki, emayanke

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7431
2012-12-16 09:00:14 -08:00
Dhruba Borthakur
38671c4d54 Fix a race condition while processing tasks by background threads.
Summary:
Suppose you submit 100 background tasks one after another. The first
enqueu task finds that the queue is empty and wakes up one worker thread.
Now suppose that all remaining 99 work items are enqueued, they do not
wake up any worker threads because the queue is already non-empty.
This causes a situation when there are 99 tasks in the task queue but
only one worker thread is processing a task while the remaining
worker threads are waiting.
The fix is to always wakeup one worker thread while enqueuing a task.

I also added a check to count the number of elements in the queue
to help in debugging.

Test Plan: make clean check.

Reviewers: chip

Reviewed By: chip

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7203
2012-12-09 17:15:27 -08:00
Zheng Shao
768edfaaed ldb: Add compression and bloom filter options.
Summary:
Added the following two options:
[--bloom_bits=<int,e.g.:14>]
[--compression_type=<no|snappy|zlib|bzip2>]

These options will be used when ldb opens the leveldb database.

Test Plan: Tried by hand for both success and failure cases. We do need a test framework.

Reviewers: dhruba, emayanke, sheki

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7197
2012-12-07 14:26:37 -08:00
Kosie van der Merwe
f69e9f3e04 Fixed off by 1 in tests.
Summary: Added 1 to indices where I shouldn't have so overrun array.

Test Plan: make check

Reviewers: sheki, emayanke, vamsi, dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D7227
2012-12-07 10:48:46 -08:00
Kosie van der Merwe
0eb0c9bb82 Added methods to write small ints to bit streams.
Summary: Added BitStreamPutInt() and BitStreamGetInt() which take a stream of chars and can write integers of arbitrary bit sizes to that stream at arbitrary positions. There are also convenience versions of these functions that take std::strings and leveldb::Slices.

Test Plan: make check

Reviewers: sheki, vamsi, dhruba, emayanke

Reviewed By: vamsi

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7071
2012-12-07 10:42:19 -08:00
sheki
d4627e6de4 Move WAL files to archive directory, instead of deleting.
Summary:
Create a directory "archive" in the DB directory.
During DeleteObsolteFiles move the WAL files (*.log) to the Archive directory,
instead of deleting.

Test Plan: Created a DB using DB_Bench. Reopened it. Checked if files move.

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D6975
2012-11-28 17:28:08 -08:00
Abhishek Kona
d29f181923 Fix all the lint errors.
Summary:
Scripted and removed all trailing spaces and converted all tabs to
spaces.

Also fixed other lint errors.
All lint errors from this point of time should be taken seriously.

Test Plan: make all check

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7059
2012-11-28 17:18:41 -08:00
Chip Turner
6caf3b8e4b Fix broken test; some ldb commands can run without a db_
Summary:
It would appear our unit tests make use of code from ldb_cmd,
and don't always require a valid database handle.  D6855 was not aware
db_ could sometimes be NULL for such commands, and so it broke
reduce_levels_test.

This moves the check elsewhere to (at least) fix the 'ldb dump' case of
segfaulting when it couldn't open a database.

Test Plan: make check

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D6903
2012-11-26 11:11:30 -08:00
Chip Turner
879e45eb99 Fix ldb segfault and use static libsnappy for all builds
Summary:
Link statically against snappy, using the gvfs one for facebook
environments, and the bundled one otherwise.

In addition, fix a few minor segfaults in ldb when it couldn't open the
database, and update .gitignore to include a few other build artifacts.

Test Plan: make check

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D6855
2012-11-21 11:07:19 -08:00
Dhruba Borthakur
7632fdb5cb Support taking a configurable number of files from the same level to compact in a single compaction run.
Summary:
The compaction process takes some files from LevelK and
merges it into LevelK+1. The number of files it picks from
LevelK was capped such a way that the total amount of
data picked does not exceed the maxfilesize of that level.
This essentially meant that only one file from LevelK
is picked for a single compaction.

For bulkloads, we would like to take many many file from
LevelK and compact them using a single compaction run.

This patch introduces a option called the 'source_compaction_factor'
(similar to expanded_compaction_factor). It is a multiplier
that is multiplied by the maxfilesize of that level to arrive
at the limit that is used to throttle the number of source
files from LevelK.  For bulk loads, set source_compaction_factor
to a very high number so that multiple files from the same
level are picked for compaction in a single compaction.

The default value of source_compaction_factor is 1, so that
we can keep backward compatibilty with existing compaction semantics.

Test Plan: make clean check

Reviewers: emayanke, sheki

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D6867
2012-11-21 08:37:03 -08:00
Dhruba Borthakur
fbb73a4ac3 Support to disable background compactions on a database.
Summary:
This option is needed for fast bulk uploads. The goal is to load
all the data into files in L0 without any interference from
background compactions.

Test Plan: make clean check

Reviewers: sheki

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D6849
2012-11-20 21:12:06 -08:00
Abhishek Kona
661dc15721 Fix LDB dumpwal to print the messages as in the file.
Summary:
StringStream.clear() does not clear the stream. It sets some flags.
Who knew? Fixing that is not printing the stuff again and again.

Test Plan: ran it on a local db

Reviewers: dhruba, emayanke

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D6795
2012-11-19 12:04:35 -08:00
Abhishek Kona
30742e1692 LDB can read WAL.
Summary:
Add option to read WAL and print a summary for each record.
facebook task => #1885013

E.G. Output :
./ldb dump_wal --walfile=/tmp/leveldbtest-5907/dbbench/026122.log --header
Sequence,Count,ByteSize
49981,1,100033
49981,1,100033
49982,1,100033
49981,1,100033
49982,1,100033
49983,1,100033
49981,1,100033
49982,1,100033
49983,1,100033
49984,1,100033
49981,1,100033
49982,1,100033

Test Plan:
Works run
./ldb read_wal --wal-file=/tmp/leveldbtest-5907/dbbench/000078.log --header

Reviewers: dhruba, heyongqiang

Reviewed By: dhruba

CC: emayanke, leveldb, zshao

Differential Revision: https://reviews.facebook.net/D6675
2012-11-19 12:04:34 -08:00
Dhruba Borthakur
6c5a4d646a Merge branch 'master' into performance
Conflicts:
	db/db_impl.h
2012-11-14 21:39:52 -08:00
Dhruba Borthakur
5d16e503a6 Improved CompactionFilter api: pass in a opaque argument to CompactionFilter invocation.
Summary:
There are applications that operate on multiple leveldb instances.
These applications will like to pass in an opaque type for each
leveldb instance and this type should be passed back to the application
with every invocation of the CompactionFilter api.

Test Plan: Enehanced unit test for opaque parameter to CompactionFilter.

Reviewers: heyongqiang

Reviewed By: heyongqiang

CC: MarkCallaghan, sheki, emayanke

Differential Revision: https://reviews.facebook.net/D6711
2012-11-13 16:22:26 -08:00
heyongqiang
c64796fd34 Fix test failure of reduce_num_levels
Summary:
I changed the reduce_num_levels logic to avoid "compactRange()" call if the current number of levels in use (levels that contain files) is smaller than the new num of levels.
And that change breaks the assert in reduce_levels_test

Test Plan: run reduce_levels_test

Reviewers: dhruba, MarkCallaghan

Reviewed By: dhruba

CC: emayanke, sheki

Differential Revision: https://reviews.facebook.net/D6651
2012-11-12 12:05:38 -08:00
Dhruba Borthakur
9c6c232e47 Compilation error while compiling with OPT=-g
Summary:
make clean check OPT=-g fails
leveldb::DBStatistics::getTickerCount(leveldb::Tickers)’:
./db/db_statistics.h:34: error: ‘MAX_NO_TICKERS’ was not declared in this scope
util/ldb_cmd.cc:255: warning: left shift count >= width of type

Test Plan:
make clean check OPT=-g

Reviewers:

CC:

Task ID: #

Blame Rev:
2012-11-11 00:20:40 -08:00
heyongqiang
20d18a89a3 disable size compaction in ldb reduce_levels and added compression and file size parameter to it
Summary:
disable size compaction in ldb reduce_levels, this will avoid compactions rather than the manual comapction,

added --compression=none|snappy|zlib|bzip2 and --file_size= per-file size to ldb reduce_levels command

Test Plan: run ldb

Reviewers: dhruba, MarkCallaghan

Reviewed By: dhruba

CC: sheki, emayanke

Differential Revision: https://reviews.facebook.net/D6597
2012-11-09 10:14:47 -08:00
Dhruba Borthakur
8143062edd Merge branch 'master' into performance
Conflicts:
	db/db_impl.cc
	db/version_set.cc
	util/options.cc
2012-11-07 15:11:37 -08:00
Dhruba Borthakur
aa42c66814 Fix all warnings generated by -Wall option to the compiler.
Summary:
The default compilation process now uses "-Wall" to compile.
Fix all compilation error generated by gcc.

Test Plan: make all check

Reviewers: heyongqiang, emayanke, sheki

Reviewed By: heyongqiang

CC: MarkCallaghan

Differential Revision: https://reviews.facebook.net/D6525
2012-11-06 14:07:31 -08:00
Dhruba Borthakur
5f91868cee Merge branch 'master' into performance
Conflicts:
	db/version_set.cc
	util/options.cc
2012-11-05 16:51:55 -08:00
Dhruba Borthakur
5273c81483 Ability to invoke application hook for every key during compaction.
Summary:
There are certain use-cases where the application intends to
delete older keys aftre they have expired a certian time period.
One option for those applications is to periodically scan the
entire database and delete appropriate keys.

A better way is to allow the application to hook into the
compaction process. This patch allows the application to set
a method callback for every key that is being compacted. If
this method returns true, then the key is not preserved in
the output of the compaction.

Test Plan:
This is mostly to preview the proposed new public api.
Since it is a public api, please do due diligence on reviewing it.

I will be writing test cases for this api in mynext version of
this patch.

Reviewers: MarkCallaghan, heyongqiang

Reviewed By: heyongqiang

CC: sheki, adsharma

Differential Revision: https://reviews.facebook.net/D6285
2012-11-05 16:02:13 -08:00
heyongqiang
d55c2ba305 Add a tool to change number of levels
Summary: as subject.

Test Plan: manually test it, will add a testcase

Reviewers: dhruba, MarkCallaghan

Differential Revision: https://reviews.facebook.net/D6345
2012-11-05 10:17:39 -08:00
Dhruba Borthakur
81f735d97c Merge branch 'master' into performance
Conflicts:
	db/db_impl.cc
	util/options.cc
2012-11-05 09:41:38 -08:00
amayank
854c66b089 Make compression options configurable. These include window-bits, level and strategy for ZlibCompression
Summary: Leveldb currently uses windowBits=-14 while using zlib compression.(It was earlier 15). This makes the setting configurable. Related changes here: https://reviews.facebook.net/D6105

Test Plan: make all check

Reviewers: dhruba, MarkCallaghan, sheki, heyongqiang

Differential Revision: https://reviews.facebook.net/D6393
2012-11-02 11:26:39 -07:00
heyongqiang
3096fa7534 Add two more options: disable block cache and make table cache shard number configuable
Summary:

as subject

Test Plan:

run db_bench and db_test

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D6111
2012-11-01 13:23:21 -07:00
Dhruba Borthakur
53e04311b1 Merge branch 'master' into performance
Conflicts:
	db/db_bench.cc
	util/options.cc
2012-10-29 14:18:00 -07:00
Dhruba Borthakur
321dfdc3ae Allow having different compression algorithms on different levels.
Summary:
The leveldb API is enhanced to support different compression algorithms at
different levels.

This adds the option min_level_to_compress to db_bench that specifies
the minimum level for which compression should be done when
compression is enabled. This can be used to disable compression for levels
0 and 1 which are likely to suffer from stalls because of the CPU load
for memtable flushes and (L0,L1) compaction.  Level 0 is special as it
gets frequent memtable flushes. Level 1 is special as it frequently
gets all:all file compactions between it and level 0. But all other levels
could be the same. For any level N where N > 1, the rate of sequential
IO for that level should be the same. The last level is the
exception because it might not be full and because files from it are
not read to compact with the next larger level.

The same amount of time will be spent doing compaction at any
level N excluding N=0, 1 or the last level. By this standard all
of those levels should use the same compression. The difference is that
the loss (using more disk space) from a faster compression algorithm
is less significant for N=2 than for N=3. So we might be willing to
trade disk space for faster write rates with no compression
for L0 and L1, snappy for L2, zlib for L3. Using a faster compression
algorithm for the mid levels also allows us to reclaim some cpu
without trading off much loss in disk space overhead.

Also note that little is to be gained by compressing levels 0 and 1. For
a 4-level tree they account for 10% of the data. For a 5-level tree they
account for 1% of the data.

With compression enabled:
* memtable flush rate is ~18MB/second
* (L0,L1) compaction rate is ~30MB/second

With compression enabled but min_level_to_compress=2
* memtable flush rate is ~320MB/second
* (L0,L1) compaction rate is ~560MB/second

This practicaly takes the same code from https://reviews.facebook.net/D6225
but makes the leveldb api more general purpose with a few additional
lines of code.

Test Plan: make check

Differential Revision: https://reviews.facebook.net/D6261
2012-10-29 11:48:09 -07:00
Mark Callaghan
70c42bf05f Adds DB::GetNextCompaction and then uses that for rate limiting db_bench
Summary:
Adds a method that returns the score for the next level that most
needs compaction. That method is then used by db_bench to rate limit threads.
Threads are put to sleep at the end of each stats interval until the score
is less than the limit. The limit is set via the --rate_limit=$double option.
The specified value must be > 1.0. Also adds the option --stats_per_interval
to enable additional metrics reported every stats interval.

Task ID: #

Blame Rev:

Test Plan:
run db_bench

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D6243
2012-10-29 10:17:43 -07:00
Kai Liu
8965c8d0b9 Add the missing util/auto_split_logger.h
Summary:

Test Plan:

Reviewers:

CC:

Task ID: 1803577

Blame Rev:
2012-10-26 15:23:50 -07:00
Kai Liu
d50f8eb603 Enable LevelDb to create a new log file if current log file is too large.
Summary: Enable LevelDb to create a new log file if current log file is too large.

Test Plan:
Write a script and manually check the generated info LOG.

Task ID: 1803577

Blame Rev:

Reviewers: dhruba, heyongqiang

Reviewed By: heyongqiang

CC: zshao

Differential Revision: https://reviews.facebook.net/D6003
2012-10-26 14:55:02 -07:00
Dhruba Borthakur
e982f5a1d2 Merge branch 'master' into performance
Conflicts:
	util/options.cc
2012-10-19 15:16:42 -07:00
Dhruba Borthakur
cf5adc8016 db_bench was not correctly initializing the value for delete_obsolete_files_period_micros option.
Summary:
The parameter delete_obsolete_files_period_micros controls the
periodicity of deleting obsolete files. db_bench was reading in
this parameter intoa local variable called 'l' but was incorrectly
using another local variable called 'n' while setting it in the
db.options data structure.
This patch also logs the value of delete_obsolete_files_period_micros
in the LOG file at db startup time.

I am hoping that this will improve the overall write throughput drastically.

Test Plan: run db_bench

Reviewers: MarkCallaghan, heyongqiang

Reviewed By: MarkCallaghan

Differential Revision: https://reviews.facebook.net/D6099
2012-10-19 15:10:12 -07:00
Dhruba Borthakur
1ca0584345 This is the mega-patch multi-threaded compaction
published in https://reviews.facebook.net/D5997.

Summary:
This patch allows compaction to occur in multiple background threads
concurrently.

If a manual compaction is issued, the system falls back to a
single-compaction-thread model. This is done to ensure correctess
and simplicity of code. When the manual compaction is finished,
the system resumes its concurrent-compaction mode automatically.

The updates to the manifest are done via group-commit approach.

Test Plan: run db_bench
2012-10-19 14:00:53 -07:00
Dhruba Borthakur
aa73538f2a The deletion of obsolete files should not occur very frequently.
Summary:
The method DeleteObsolete files is a very costly methind, especially
when the number of files in a system is large. It makes a list of
all live-files and then scans the directory to compute the diff.
By default, this method is executed after every compaction run.

This patch makes it such that DeleteObsolete files is never
invoked twice within a configured period.

Test Plan: run all unit tests

Reviewers: heyongqiang, MarkCallaghan

Reviewed By: MarkCallaghan

Differential Revision: https://reviews.facebook.net/D6045
2012-10-16 10:26:10 -07:00
Dhruba Borthakur
f7975ac733 Implement RowLocks for assoc schema
Summary:
Each assoc is identified by (id1, assocType). This is the rowkey.
Each row has a read/write rowlock. There is statically allocated array
of 2000 read/write locks. A rowkey is murmur-hashed to one of the
read/write locks.

assocPut and assocDelete acquires the rowlock in Write mode.
The key-updates are done within the rowlock with a atomic nosync
batch write to leveldb. Then the rowlock is released and
a write-with-sync is done to sync leveldb transaction log.

Test Plan: added unit test

Reviewers: heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D5859
2012-10-03 23:19:01 -07:00
Dhruba Borthakur
c1006d4276 An configurable option to write data using write instead of mmap.
Summary:
We have seen that reading data via the pread call (instead of
mmap) is much faster on Linux 2.6.x kernels. This patch makes
an equivalent option to switch off mmaps for the write path
as well.

db_bench --mmap_write=0 will use write() instead of mmap() to
write data to a file.

This change is backward compatible, the default
option is to continue using mmap for writing to a file.

Test Plan: "make check all"

Differential Revision: https://reviews.facebook.net/D5781
2012-10-03 17:08:13 -07:00
Dhruba Borthakur
a58d48de79 Implement ReadWrite locks for leveldb
Summary:
Implement ReadWrite locks for leveldb. These will be helpful
to implement a read-modify-write operation (e.g. atomic increments).

Test Plan: does not modify any existing code

Reviewers: heyongqiang

Reviewed By: heyongqiang

CC: MarkCallaghan

Differential Revision: https://reviews.facebook.net/D5787
2012-10-01 22:37:39 -07:00
Dhruba Borthakur
72c45c66c6 Print the block cache size in the LOG.
Summary: Print the block cache size in the LOG.

Test Plan: run db_bench and look at LOG. This is helpful while I was debugging one use-case.

Reviewers: heyongqiang, MarkCallaghan

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D5739
2012-09-29 21:39:19 -07:00
Dhruba Borthakur
ae36e509f8 The BackupAPI should also list the length of the manifest file.
Summary:
The GetLiveFiles() api lists the set of sst files and the current
MANIFEST file. But the database continues to append new data to the
MANIFEST file even when the application is backing it up to the
backup location. This means that the database-version that is
stored in the MANIFEST FILE in the backup location
does not correspond to the sst files returned by GetLiveFiles.

This API adds a new parameter to GetLiveFiles. This new parmeter
returns the current size of the MANIFEST file.

Test Plan: Unit test attached.

Reviewers: heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D5631
2012-09-25 03:13:25 -07:00
Dhruba Borthakur
9e84834eb4 Allow a configurable number of background threads.
Summary:
The background threads are necessary for compaction.
For slower storage, it might be necessary to have more than
one compaction thread per DB. This patch allows creating
a configurable number of worker threads.
The default reamins at 1 (to maintain backward compatibility).

Test Plan:
run all unit tests. changes to db-bench coming in
a separate patch.

Reviewers: heyongqiang

Reviewed By: heyongqiang

CC: MarkCallaghan

Differential Revision: https://reviews.facebook.net/D5559
2012-09-19 15:51:08 -07:00
heyongqiang
a8464ed820 add an option to disable seek compaction
Summary:
as subject. This diff should be good for benchmarking.

will send another diff to make it better in the case the seek compaction is enable.
In that coming diff, will not count a seek if the bloomfilter filters.

Test Plan: build

Reviewers: dhruba, MarkCallaghan

Reviewed By: MarkCallaghan

Differential Revision: https://reviews.facebook.net/D5481
2012-09-17 13:59:57 -07:00
heyongqiang
b85cdca690 add a global var leveldb::useMmapRead to enable mmap Summary:
Summary:
as subject. this can be used for benchmarking.
If we want it for some cases, we can do more changes to make this part of the option.

Test Plan: db_test

Reviewers: dhruba

CC: MarkCallaghan

Differential Revision: https://reviews.facebook.net/D5451
2012-09-16 22:07:35 -07:00
Mark Callaghan
33323f2111 Remove use of mmap for random reads
Summary:
Reads via mmap on concurrent workloads are much slower than pread.
For example on a 24-core server with storage that can do 100k IOPS or more
I can get no more than 10k IOPS with mmap reads and 32+ threads.

Test Plan: db_bench benchmarks

Reviewers: dhruba, heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D5433
2012-09-14 16:43:50 -07:00
Dhruba Borthakur
93f4952089 Ability to switch off filesystem read-aheads
Summary:
Ability to switch off filesystem read-aheads. This change is
backward-compatible: the default setting is to allow file
system read-aheads.

Test Plan: run benchmarks

Reviewers: heyongqiang, adsharma

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D5391
2012-09-13 12:09:56 -07:00
Dhruba Borthakur
4028ae7d31 Do not cache readahead-pages in the OS cache.
Summary:
When posix_fadvise(offset, offset) is usedm it frees up only those
pages in that specified range. But the filesystem could have done some
read-aheads and those get cached in the OS cache.

Do not cache readahead-pages in the OS cache.

Test Plan: run db_bench benchmark.

Reviewers: vamsi, heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D5379
2012-09-13 10:56:02 -07:00
Dhruba Borthakur
407727b75f Fix compiler warnings. Use uint64_t instead of uint.
Summary: Fix compiler warnings. Use uint64_t instead of uint.

Test Plan: build using -Wall

Reviewers: heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D5355
2012-09-12 14:42:36 -07:00
heyongqiang
0f43aa474e put log in a seperate dir
Summary: added a new option db_log_dir, which points the log dir. Inside that dir, in order to make log names unique, the log file name is prefixed with the leveldb data dir absolute path.

Test Plan: db_test

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D5205
2012-09-06 17:52:08 -07:00
Dhruba Borthakur
fe93631678 Clean up compiler warnings generated by -Wall option.
Summary:
Clean up compiler warnings generated by -Wall option.
make clean all OPT=-Wall

This is a pre-requisite before making a new release.

Test Plan: compile and run unit tests

Reviewers: heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D5019
2012-08-29 14:24:51 -07:00
Dhruba Borthakur
e5fe80e4e3 The sharding of the block cache is limited to 2*20 pieces.
Summary:
The numbers of shards that the block cache is divided into is
configurable. However, if the user specifies that he/she wants
the block cache to be divided into more than 2**20 pieces, then
the system will rey to allocate a huge array of that size) that
could fail.

It is better to limit the sharding of the block cache to an
upper bound. The default sharding is 16 shards (i.e. 2**4)
and the maximum is now 2 million shards (i.e. 2**20).

Also, fixed a bug with the LRUCache where the numShardBits
should be a private member of the LRUCache object rather than
a static variable.

Test Plan:
run db_bench with --cache_numshardbits=64.

Task ID: #

Blame Rev:

Reviewers: heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D5013
2012-08-29 12:17:59 -07:00
heyongqiang
a4f9b8b49e merge 1.5
Summary:

as subject

Test Plan:

db_test table_test

Reviewers: dhruba
2012-08-28 11:43:33 -07:00
Dhruba Borthakur
fc20273e73 Introduce a new method Env->Fsync() that issues fsync (instead of fdatasync).
Summary:
Introduce a new method Env->Fsync() that issues fsync (instead of fdatasync).
This is needed for data durability when running on ext3 filesystems.
Added options to the benchmark db_bench to generate performance numbers
with either fsync or fdatasync enabled.

Cleaned up Makefile to build leveldb_shell only when building the thrift
leveldb server.

Test Plan: build and run benchmark

Reviewers: heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D4911
2012-08-27 21:24:17 -07:00
Dhruba Borthakur
e5a7c8e580 Log the open-options to the LOG.
Summary: Log the open-options to the LOG. Use options_ instead of options because SanitizeOptions could modify the max_file_open limit.

Test Plan: num db_bench

Reviewers: heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D4833
2012-08-22 12:22:12 -07:00
heyongqiang
21082fa13c regression for trigger compaction logic
Summary: as subject

Test Plan: manually run db_bench confirmed

Reviewers: dhruba

Differential Revision: https://reviews.facebook.net/D4809
2012-08-21 18:11:21 -07:00
Dhruba Borthakur
f4e7febf22 Record the version of the source repository that was used to build the leveldb library.
Summary: Record the version of the source that we are compiling. We keep a record of the git revision in util/version.cc. This source file is then built as a regular source file as part of the compilation process. One can run "strings executable_filename | grep _build_" to find the version of the source that we used to build the executable file.

Test Plan: none

Differential Revision: https://reviews.facebook.net/D4785
2012-08-21 14:47:15 -07:00
heyongqiang
6ba1f17789 adding a scribe logger in leveldb to log leveldb deploy stats
Summary:
as subject.

A new log is written to scribe via thrift client when a new db is opened and when there is
a compaction.

a new option var scribe_log_db_stats is added.

Test Plan: manually checked using command "ptail -time 0 leveldb_deploy_stats"

Reviewers: dhruba

Differential Revision: https://reviews.facebook.net/D4659
2012-08-21 11:43:22 -07:00
Dhruba Borthakur
e56b2c5a31 Prevent concurrent multiple opens of leveldb database.
Summary:
The fcntl call cannot detect lock conflicts when invoked multiple times
from the same thread.
Use a static lockedFile Set to record the paths that are locked.
A lockfile request checks to see if htis filename already exists in
lockedFiles, if so, then it triggers an error. Otherwise, it inserts
the filename in the lockedFiles Set.
A unlock file request verifies that the filename is in the lockedFiles
set and removes it from lockedFiles set.

Test Plan: unit test attached

Reviewers: heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D4755
2012-08-20 23:55:04 -07:00
Dhruba Borthakur
c3096afd61 Introduce a new option disableDataSync for opening the database. If this is set to true, then the data written to newly created data files are not sycned to disk, instead depend on the OS to flush dirty data to stable storage. This option is good for bulk
Test Plan:
manual tests

Task ID: #

Blame Rev:

Differential Revision: https://reviews.facebook.net/D4515
2012-08-03 15:23:53 -07:00
Dhruba Borthakur
d11b637f34 bits_per_key is already configurable. It defines how many bloom bits will be used for every key in the database.
My change in this patch is to make the Hash code that is used for blooms to be confgurable. In fact,
one can specify a modified HashCode that inspects only parts of the Key to generate the Hash (used
by booms).

Test Plan: none

Differential Revision: https://reviews.facebook.net/D4059
2012-07-09 23:06:07 -07:00
Dhruba Borthakur
80c663882a Create leveldb server via Thrift.
Summary:
First draft.
Unit tests pass.

Test Plan: unit tests attached

Reviewers: heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D3969
2012-07-07 09:42:39 -07:00
heyongqiang
7600228072 fix compile warning
Summary: as subject

Test Plan: compile

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D3957
2012-07-02 17:37:45 -07:00
heyongqiang
4e4b6812ff Make some variables configurable for each db instance
Summary:
Make configurable 'targetFileSize', 'targetFileSizeMultiplier',
'maxBytesForLevelBase', 'maxBytesForLevelMultiplier',
'expandedCompactionFactor', 'maxGrandParentOverlapFactor'

Test Plan: N/A

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D3801
2012-06-27 14:36:31 -07:00
Dhruba Borthakur
a35e574344 Make Leveldb save data into HDFS files. You have to set USE_HDFS in your environment variable to compile leveldb with HDFS support.
Test Plan: Run benchmark.

Differential Revision: https://reviews.facebook.net/D3549
2012-06-14 00:29:01 -07:00
Dhruba Borthakur
8f293b68a9 Support --bufferedio=[0,1] from db_bench. If bufferedio = 0, then the read code path clears the OS page cache after the IO is completed. The default remains as bufferedio=1
Summary:
Task ID: #

Blame Rev:

Test Plan: Revert Plan:

Differential Revision: https://reviews.facebook.net/D3429
2012-05-29 13:29:44 -07:00
Dhruba Borthakur
a2a0e358cb Add support to specify the number of shards for the Block cache. By default, the block cache is sharded into 16 parts.
Summary:
Task ID: #

Blame Rev:

Test Plan: Revert Plan:

Differential Revision: https://reviews.facebook.net/D3273
2012-05-16 17:23:49 -07:00
Arun Sharma
95af128225 SSE4 optimization
Summary:
This speeds up CRC computation significantly on
hardware that supports it. Enabled via -msse4.

Note: the binary won't be usable on older CPUs
that don't support the instruction.

Test Plan: crc32c_test

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D3201
2012-05-15 10:10:01 -07:00
Arun Sharma
921a48428e Optimize for lp64
Summary:
Some code reorganization in-preparation for replacing with a hardware
instruction.

* Use u64 for some of the key types
* Use an ALIGN macro so code is easier to read

Test Plan: crc32c_test

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D3135
2012-05-14 15:40:11 -07:00
Sanjay Ghemawat
85584d497e Added bloom filter support.
In particular, we add a new FilterPolicy class.  An instance
of this class can be supplied in Options when opening a
database.  If supplied, the instance is used to generate
summaries of keys (e.g., a bloom filter) which are placed in
sstables.  These summaries are consulted by DB::Get() so we
can avoid reading sstable blocks that are guaranteed to not
contain the key we are looking for.

This change provides one implementation of FilterPolicy
based on bloom filters.

Other changes:
- Updated version number to 1.4.
- Some build tweaks.
- C binding for CompactRange.
- A few more benchmarks: deleteseq, deleterandom, readmissing, seekrandom.
- Minor .gitignore update.
2012-04-17 08:36:46 -07:00
Sanjay Ghemawat
9013f13b15 use mmap on 64-bit machines to speed-up reads; small build fixes 2012-03-15 09:14:00 -07:00
Sanjay Ghemawat
3c8be108bf fixed issues 66 (leaking files on disk error) and 68 (no sync of CURRENT file) 2012-01-25 14:56:52 -08:00
Hans Wennborg
42fb47f6ed Pass system's CFLAGS, remove exit time destructor, sstable bug fix.
- Pass system's values of CFLAGS,LDFLAGS.
  Don't override OPT if it's already set.
  Original patch by Alessio Treglia <alessio@debian.org>:
  http://code.google.com/p/leveldb/issues/detail?id=27#c6

- Remove 1 exit time destructor from leveldb.
  See http://crbug.com/101600

- Fix problem where sstable building code would pass an
  internal key to the user comparator.

(Sync with uptream at 25436817.)
2011-11-14 17:06:16 +00:00
Hans Wennborg
36a5f8ed7f A number of fixes:
- Replace raw slice comparison with a call to user comparator.
  Added test for custom comparators.

- Fix end of namespace comments.

- Fixed bug in picking inputs for a level-0 compaction.

  When finding overlapping files, the covered range may expand
  as files are added to the input set.  We now correctly expand
  the range when this happens instead of continuing to use the
  old range.  For example, suppose L0 contains files with the
  following ranges:

      F1: a .. d
      F2:    c .. g
      F3:       f .. j

  and the initial compaction target is F3.  We used to search
  for range f..j which yielded {F2,F3}.  However we now expand
  the range as soon as another file is added.  In this case,
  when F2 is added, we expand the range to c..j and restart the
  search.  That picks up file F1 as well.

  This change fixes a bug related to deleted keys showing up
  incorrectly after a compaction as described in Issue 44.

(Sync with upstream @25072954)
2011-10-31 17:22:06 +00:00
Gabor Cselle
299ccedfec A number of bugfixes:
- Added DB::CompactRange() method.

  Changed manual compaction code so it breaks up compactions of
  big ranges into smaller compactions.

  Changed the code that pushes the output of memtable compactions
  to higher levels to obey the grandparent constraint: i.e., we
  must never have a single file in level L that overlaps too
  much data in level L+1 (to avoid very expensive L-1 compactions).

  Added code to pretty-print internal keys.

- Fixed bug where we would not detect overlap with files in
  level-0 because we were incorrectly using binary search
  on an array of files with overlapping ranges.

  Added "leveldb.sstables" property that can be used to dump
  all of the sstables and ranges that make up the db state.

- Removing post_write_snapshot support.  Email to leveldb mailing
  list brought up no users, just confusion from one person about
  what it meant.

- Fixing static_cast char to unsigned on BIG_ENDIAN platforms.

  Fixes	Issue 35 and Issue 36.

- Comment clarification to address leveldb Issue 37.

- Change license in posix_logger.h to match other files.

- A build problem where uint32 was used instead of uint32_t.

Sync with upstream @24408625
2011-10-05 16:30:28 -07:00
Hans Wennborg
213a68eb68 Sync with upstream @23860137.
Fix GCC -Wshadow warnings in LevelDB's public header files,
reported by Dustin.

Add in-memory Env implementation (helpers/memenv/*).
This enables users to create LevelDB databases in-memory.

Initialize ShardedLRUCache::last_id_ to zero.
This fixes a Valgrind warning.

(Also delete port/sha1_* which were removed upstream some time ago.)
2011-09-12 10:21:10 +01:00
gabor@google.com
e3584f9c28 Bugfix for issue 33; reduce lock contention in Get(), parallel benchmarks.
- Fix for issue 33 (non-null-terminated result from
  leveldb_property_value())

- Support for running multiple instances of a benchmark in parallel.

- Reduce lock contention on Get():
  (1) Do not hold the lock while searching memtables.
  (2) Shard block and table caches 16-ways.

  Benchmark for evaluating this change:
  $ db_bench --benchmarks=fillseq1,readrandom --threads=$n
  (fillseq1 is a small hack to make sure fillseq runs once regardless
  of number of threads specified on the command line).



git-svn-id: https://leveldb.googlecode.com/svn/trunk@49 62dab493-f737-651d-591e-8d6aee1b9529
2011-08-22 21:08:51 +00:00
gabor@google.com
ab323f7e1e Bugfixes for iterator and documentation.
- Fix bug in Iterator::Prev where it would return the wrong key.
  Fixes issues 29 and 30.

- Added a tweak to testharness to allow running just some tests.

- Fixing two minor documentation errors based on issues 28 and 25.

- Cleanup; fix namespaces of export-to-C code.
  Also fix one "const char*" vs "char*" mismatch.



git-svn-id: https://leveldb.googlecode.com/svn/trunk@48 62dab493-f737-651d-591e-8d6aee1b9529
2011-08-16 01:21:01 +00:00
dgrogan@chromium.org
a05525d13b @23023120
git-svn-id: https://leveldb.googlecode.com/svn/trunk@47 62dab493-f737-651d-591e-8d6aee1b9529
2011-08-06 00:19:37 +00:00
gabor@google.com
f122c6dfbb Adding FreeBSD support, removing Chromium files, adding benchmark.
- LevelDB patch for FreeBSD. This resolves Issue 22.
  Contributed by dforsythe (thanks!).

- Removing Chromium-specific files.
  They are now going to live in the Chromium repository.

- Adding a benchmark page comparing LevelDB performance
  to SQLite and Kyoto Cabinet's TreeDB, along with
  code to generate the benchmarks.
  Thanks to Kevin Tseng for compiling the benchmarks,
  and Scott Hess and Mikio Hirabayashi for their
  help and advice.



git-svn-id: https://leveldb.googlecode.com/svn/trunk@40 62dab493-f737-651d-591e-8d6aee1b9529
2011-07-27 01:46:25 +00:00
gabor@google.com
60bd8015f2 Speed up Snappy uncompression, new Logger interface.
- Removed one copy of an uncompressed block contents changing
  the signature of Snappy_Uncompress() so it uncompresses into a
  flat array instead of a std::string.
        
  Speeds up readrandom ~10%.

- Instead of a combination of Env/WritableFile, we now have a
  Logger interface that can be easily overridden applications
  that want to supply their own logging.

- Separated out the gcc and Sun Studio parts of atomic_pointer.h
  so we can use 'asm', 'volatile' keywords for Sun Studio.




git-svn-id: https://leveldb.googlecode.com/svn/trunk@39 62dab493-f737-651d-591e-8d6aee1b9529
2011-07-21 02:40:18 +00:00
gabor@google.com
6872ace901 Sun Studio support, and fix for test related memory fixes.
- LevelDB patch for Sun Studio
  Based on a patch submitted by Theo Schlossnagle - thanks!
  This fixes Issue 17.

- Fix a couple of test related memory leaks.



git-svn-id: https://leveldb.googlecode.com/svn/trunk@38 62dab493-f737-651d-591e-8d6aee1b9529
2011-07-19 23:36:47 +00:00
gabor@google.com
6699c7ebe6 Small tweaks and bugfixes for Issue 18 and 19.
Slight tweak to the no-overlap optimization: only push to
level 2 to reduce the amount of wasted space when the same
small key range is being repeatedly overwritten.

Fix for Issue 18: Avoid failure on Windows by avoiding
deletion of lock file until the end of DestroyDB().

Fix for Issue 19: Disregard sequence numbers when checking for 
overlap in sstable ranges. This fixes issue 19: when writing 
the same key over and over again, we would generate a sequence 
of sstables that were never merged together since their sequence
numbers were disjoint.

Don't ignore map/unmap error checks.

Miscellaneous fixes for small problems Sanjay found while diagnosing
issue/9 and issue/16 (corruption_testr failures).
- log::Reader reports the record type when it finds an unexpected type.
- log::Reader no longer reports an error when it encounters an expected
  zero record regardless of the setting of the "checksum" flag.
- Added a missing forward declaration.
- Documented a side-effects of larger write buffer sizes
  (longer recovery time).



git-svn-id: https://leveldb.googlecode.com/svn/trunk@37 62dab493-f737-651d-591e-8d6aee1b9529
2011-07-15 00:20:57 +00:00
gabor@google.com
f57e23351f Platform detection during build, plus compatibility patches for machines without <cstdatomic>.
This revision adds two major changes:
1. build_detect_platform which generates build_config.mk
   with platform-dependent flags for the build process
2. /port/atomic_pointer.h with anAtomicPointerimplementation
   for platforms without <cstdatomic>

Some of this code is loosely based on patches submitted to the 
LevelDB mailing list at https://groups.google.com/forum/#!forum/leveldb
Tip of the hat to Dave Smith and Edouard A, who both sent patches.

The presence of Snappy (http://code.google.com/p/snappy/) and
cstdatomic are now both detected in the build_detect_platform
script (1.) which gets executing during make.

For (2.), instead of broadly importing atomicops_* from Chromium or
the Google performance tools, we chose to just implement AtomicPointer 
and the limited atomic load and store operations it needs. 
This resulted in much less code and fewer files - everything is 
contained in atomic_pointer.h.



git-svn-id: https://leveldb.googlecode.com/svn/trunk@34 62dab493-f737-651d-591e-8d6aee1b9529
2011-06-29 00:30:50 +00:00
dgrogan@chromium.org
740d8b3d00 Update from upstream @21551990
* Patch LevelDB to build for OSX and iOS
* Fix race condition in memtable iterator deletion.
* Other small fixes.

git-svn-id: https://leveldb.googlecode.com/svn/trunk@29 62dab493-f737-651d-591e-8d6aee1b9529
2011-05-28 00:53:58 +00:00
dgrogan@chromium.org
da79909507 sync with upstream @ 21409451
Check the NEWS file for details of what changed.

git-svn-id: https://leveldb.googlecode.com/svn/trunk@28 62dab493-f737-651d-591e-8d6aee1b9529
2011-05-21 02:17:43 +00:00
dgrogan@chromium.org
be9f061d2f pull in hans' mac build fix
git-svn-id: https://leveldb.googlecode.com/svn/trunk@26 62dab493-f737-651d-591e-8d6aee1b9529
2011-04-21 01:54:51 +00:00
dgrogan@chromium.org
ba6dac0e80 @20776309
* env_chromium.cc should not export symbols.
* Fix MSVC warnings.
* Removed large value support.
* Fix broken reference to documentation file

git-svn-id: https://leveldb.googlecode.com/svn/trunk@24 62dab493-f737-651d-591e-8d6aee1b9529
2011-04-20 22:48:11 +00:00
dgrogan@chromium.org
69c6d38342 reverting disastrous MOE commit, returning to r21
git-svn-id: https://leveldb.googlecode.com/svn/trunk@23 62dab493-f737-651d-591e-8d6aee1b9529
2011-04-19 23:11:15 +00:00
dgrogan@chromium.org
b743906eea Revision created by MOE tool push_codebase.
MOE_MIGRATION=


git-svn-id: https://leveldb.googlecode.com/svn/trunk@22 62dab493-f737-651d-591e-8d6aee1b9529
2011-04-19 23:01:25 +00:00
dgrogan@chromium.org
b409afe968 chmod a-x
git-svn-id: https://leveldb.googlecode.com/svn/trunk@21 62dab493-f737-651d-591e-8d6aee1b9529
2011-04-18 23:15:58 +00:00
dgrogan@chromium.org
f779e7a5d8 @20602303. Default file permission is now 755.
git-svn-id: https://leveldb.googlecode.com/svn/trunk@20 62dab493-f737-651d-591e-8d6aee1b9529
2011-04-12 19:38:58 +00:00
jorlow@chromium.org
4671a695fc Move include files into a leveldb subdir.
git-svn-id: https://leveldb.googlecode.com/svn/trunk@18 62dab493-f737-651d-591e-8d6aee1b9529
2011-03-30 18:35:40 +00:00
jorlow@chromium.org
4d66fd5af3 Upstream change.
git-svn-id: https://leveldb.googlecode.com/svn/trunk@17 62dab493-f737-651d-591e-8d6aee1b9529
2011-03-29 22:41:11 +00:00
jorlow@chromium.org
e2da744e12 Upstream changes.
git-svn-id: https://leveldb.googlecode.com/svn/trunk@16 62dab493-f737-651d-591e-8d6aee1b9529
2011-03-28 20:43:44 +00:00
jorlow@chromium.org
e11bdf1935 Upstream changes
git-svn-id: https://leveldb.googlecode.com/svn/trunk@15 62dab493-f737-651d-591e-8d6aee1b9529
2011-03-25 20:27:43 +00:00
jorlow@chromium.org
8303bb1b33 Pull from upstream.
git-svn-id: https://leveldb.googlecode.com/svn/trunk@14 62dab493-f737-651d-591e-8d6aee1b9529
2011-03-22 23:24:02 +00:00
jorlow@chromium.org
6d243ebf79 Make GetTestDirectory threadsafe within Chromium and make it work on Windows.
git-svn-id: https://leveldb.googlecode.com/svn/trunk@13 62dab493-f737-651d-591e-8d6aee1b9529
2011-03-22 19:07:54 +00:00
jorlow@chromium.org
4bcb231187 more upstream changes
git-svn-id: https://leveldb.googlecode.com/svn/trunk@10 62dab493-f737-651d-591e-8d6aee1b9529
2011-03-21 21:06:49 +00:00
jorlow@chromium.org
0e38925490 Sync in bug fixes
git-svn-id: https://leveldb.googlecode.com/svn/trunk@9 62dab493-f737-651d-591e-8d6aee1b9529
2011-03-21 19:40:57 +00:00
jorlow@chromium.org
f67e15e50f Initial checkin.
git-svn-id: https://leveldb.googlecode.com/svn/trunk@2 62dab493-f737-651d-591e-8d6aee1b9529
2011-03-18 22:37:00 +00:00