Commit Graph

69 Commits

Author SHA1 Message Date
Mark Callaghan
70c42bf05f Adds DB::GetNextCompaction and then uses that for rate limiting db_bench
Summary:
Adds a method that returns the score for the next level that most
needs compaction. That method is then used by db_bench to rate limit threads.
Threads are put to sleep at the end of each stats interval until the score
is less than the limit. The limit is set via the --rate_limit=$double option.
The specified value must be > 1.0. Also adds the option --stats_per_interval
to enable additional metrics reported every stats interval.

Task ID: #

Blame Rev:

Test Plan:
run db_bench

Revert Plan:

Database Impact:

Memcache Impact:

Other Notes:

EImportant:

- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D6243
2012-10-29 10:17:43 -07:00
Kai Liu
8965c8d0b9 Add the missing util/auto_split_logger.h
Summary:

Test Plan:

Reviewers:

CC:

Task ID: 1803577

Blame Rev:
2012-10-26 15:23:50 -07:00
Kai Liu
d50f8eb603 Enable LevelDb to create a new log file if current log file is too large.
Summary: Enable LevelDb to create a new log file if current log file is too large.

Test Plan:
Write a script and manually check the generated info LOG.

Task ID: 1803577

Blame Rev:

Reviewers: dhruba, heyongqiang

Reviewed By: heyongqiang

CC: zshao

Differential Revision: https://reviews.facebook.net/D6003
2012-10-26 14:55:02 -07:00
Dhruba Borthakur
cf5adc8016 db_bench was not correctly initializing the value for delete_obsolete_files_period_micros option.
Summary:
The parameter delete_obsolete_files_period_micros controls the
periodicity of deleting obsolete files. db_bench was reading in
this parameter intoa local variable called 'l' but was incorrectly
using another local variable called 'n' while setting it in the
db.options data structure.
This patch also logs the value of delete_obsolete_files_period_micros
in the LOG file at db startup time.

I am hoping that this will improve the overall write throughput drastically.

Test Plan: run db_bench

Reviewers: MarkCallaghan, heyongqiang

Reviewed By: MarkCallaghan

Differential Revision: https://reviews.facebook.net/D6099
2012-10-19 15:10:12 -07:00
Dhruba Borthakur
aa73538f2a The deletion of obsolete files should not occur very frequently.
Summary:
The method DeleteObsolete files is a very costly methind, especially
when the number of files in a system is large. It makes a list of
all live-files and then scans the directory to compute the diff.
By default, this method is executed after every compaction run.

This patch makes it such that DeleteObsolete files is never
invoked twice within a configured period.

Test Plan: run all unit tests

Reviewers: heyongqiang, MarkCallaghan

Reviewed By: MarkCallaghan

Differential Revision: https://reviews.facebook.net/D6045
2012-10-16 10:26:10 -07:00
Dhruba Borthakur
f7975ac733 Implement RowLocks for assoc schema
Summary:
Each assoc is identified by (id1, assocType). This is the rowkey.
Each row has a read/write rowlock. There is statically allocated array
of 2000 read/write locks. A rowkey is murmur-hashed to one of the
read/write locks.

assocPut and assocDelete acquires the rowlock in Write mode.
The key-updates are done within the rowlock with a atomic nosync
batch write to leveldb. Then the rowlock is released and
a write-with-sync is done to sync leveldb transaction log.

Test Plan: added unit test

Reviewers: heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D5859
2012-10-03 23:19:01 -07:00
Dhruba Borthakur
c1006d4276 An configurable option to write data using write instead of mmap.
Summary:
We have seen that reading data via the pread call (instead of
mmap) is much faster on Linux 2.6.x kernels. This patch makes
an equivalent option to switch off mmaps for the write path
as well.

db_bench --mmap_write=0 will use write() instead of mmap() to
write data to a file.

This change is backward compatible, the default
option is to continue using mmap for writing to a file.

Test Plan: "make check all"

Differential Revision: https://reviews.facebook.net/D5781
2012-10-03 17:08:13 -07:00
Dhruba Borthakur
a58d48de79 Implement ReadWrite locks for leveldb
Summary:
Implement ReadWrite locks for leveldb. These will be helpful
to implement a read-modify-write operation (e.g. atomic increments).

Test Plan: does not modify any existing code

Reviewers: heyongqiang

Reviewed By: heyongqiang

CC: MarkCallaghan

Differential Revision: https://reviews.facebook.net/D5787
2012-10-01 22:37:39 -07:00
Dhruba Borthakur
72c45c66c6 Print the block cache size in the LOG.
Summary: Print the block cache size in the LOG.

Test Plan: run db_bench and look at LOG. This is helpful while I was debugging one use-case.

Reviewers: heyongqiang, MarkCallaghan

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D5739
2012-09-29 21:39:19 -07:00
Dhruba Borthakur
ae36e509f8 The BackupAPI should also list the length of the manifest file.
Summary:
The GetLiveFiles() api lists the set of sst files and the current
MANIFEST file. But the database continues to append new data to the
MANIFEST file even when the application is backing it up to the
backup location. This means that the database-version that is
stored in the MANIFEST FILE in the backup location
does not correspond to the sst files returned by GetLiveFiles.

This API adds a new parameter to GetLiveFiles. This new parmeter
returns the current size of the MANIFEST file.

Test Plan: Unit test attached.

Reviewers: heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D5631
2012-09-25 03:13:25 -07:00
Dhruba Borthakur
9e84834eb4 Allow a configurable number of background threads.
Summary:
The background threads are necessary for compaction.
For slower storage, it might be necessary to have more than
one compaction thread per DB. This patch allows creating
a configurable number of worker threads.
The default reamins at 1 (to maintain backward compatibility).

Test Plan:
run all unit tests. changes to db-bench coming in
a separate patch.

Reviewers: heyongqiang

Reviewed By: heyongqiang

CC: MarkCallaghan

Differential Revision: https://reviews.facebook.net/D5559
2012-09-19 15:51:08 -07:00
heyongqiang
a8464ed820 add an option to disable seek compaction
Summary:
as subject. This diff should be good for benchmarking.

will send another diff to make it better in the case the seek compaction is enable.
In that coming diff, will not count a seek if the bloomfilter filters.

Test Plan: build

Reviewers: dhruba, MarkCallaghan

Reviewed By: MarkCallaghan

Differential Revision: https://reviews.facebook.net/D5481
2012-09-17 13:59:57 -07:00
heyongqiang
b85cdca690 add a global var leveldb::useMmapRead to enable mmap Summary:
Summary:
as subject. this can be used for benchmarking.
If we want it for some cases, we can do more changes to make this part of the option.

Test Plan: db_test

Reviewers: dhruba

CC: MarkCallaghan

Differential Revision: https://reviews.facebook.net/D5451
2012-09-16 22:07:35 -07:00
Mark Callaghan
33323f2111 Remove use of mmap for random reads
Summary:
Reads via mmap on concurrent workloads are much slower than pread.
For example on a 24-core server with storage that can do 100k IOPS or more
I can get no more than 10k IOPS with mmap reads and 32+ threads.

Test Plan: db_bench benchmarks

Reviewers: dhruba, heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D5433
2012-09-14 16:43:50 -07:00
Dhruba Borthakur
93f4952089 Ability to switch off filesystem read-aheads
Summary:
Ability to switch off filesystem read-aheads. This change is
backward-compatible: the default setting is to allow file
system read-aheads.

Test Plan: run benchmarks

Reviewers: heyongqiang, adsharma

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D5391
2012-09-13 12:09:56 -07:00
Dhruba Borthakur
4028ae7d31 Do not cache readahead-pages in the OS cache.
Summary:
When posix_fadvise(offset, offset) is usedm it frees up only those
pages in that specified range. But the filesystem could have done some
read-aheads and those get cached in the OS cache.

Do not cache readahead-pages in the OS cache.

Test Plan: run db_bench benchmark.

Reviewers: vamsi, heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D5379
2012-09-13 10:56:02 -07:00
Dhruba Borthakur
407727b75f Fix compiler warnings. Use uint64_t instead of uint.
Summary: Fix compiler warnings. Use uint64_t instead of uint.

Test Plan: build using -Wall

Reviewers: heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D5355
2012-09-12 14:42:36 -07:00
heyongqiang
0f43aa474e put log in a seperate dir
Summary: added a new option db_log_dir, which points the log dir. Inside that dir, in order to make log names unique, the log file name is prefixed with the leveldb data dir absolute path.

Test Plan: db_test

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D5205
2012-09-06 17:52:08 -07:00
Dhruba Borthakur
fe93631678 Clean up compiler warnings generated by -Wall option.
Summary:
Clean up compiler warnings generated by -Wall option.
make clean all OPT=-Wall

This is a pre-requisite before making a new release.

Test Plan: compile and run unit tests

Reviewers: heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D5019
2012-08-29 14:24:51 -07:00
Dhruba Borthakur
e5fe80e4e3 The sharding of the block cache is limited to 2*20 pieces.
Summary:
The numbers of shards that the block cache is divided into is
configurable. However, if the user specifies that he/she wants
the block cache to be divided into more than 2**20 pieces, then
the system will rey to allocate a huge array of that size) that
could fail.

It is better to limit the sharding of the block cache to an
upper bound. The default sharding is 16 shards (i.e. 2**4)
and the maximum is now 2 million shards (i.e. 2**20).

Also, fixed a bug with the LRUCache where the numShardBits
should be a private member of the LRUCache object rather than
a static variable.

Test Plan:
run db_bench with --cache_numshardbits=64.

Task ID: #

Blame Rev:

Reviewers: heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D5013
2012-08-29 12:17:59 -07:00
heyongqiang
a4f9b8b49e merge 1.5
Summary:

as subject

Test Plan:

db_test table_test

Reviewers: dhruba
2012-08-28 11:43:33 -07:00
Dhruba Borthakur
fc20273e73 Introduce a new method Env->Fsync() that issues fsync (instead of fdatasync).
Summary:
Introduce a new method Env->Fsync() that issues fsync (instead of fdatasync).
This is needed for data durability when running on ext3 filesystems.
Added options to the benchmark db_bench to generate performance numbers
with either fsync or fdatasync enabled.

Cleaned up Makefile to build leveldb_shell only when building the thrift
leveldb server.

Test Plan: build and run benchmark

Reviewers: heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D4911
2012-08-27 21:24:17 -07:00
Dhruba Borthakur
e5a7c8e580 Log the open-options to the LOG.
Summary: Log the open-options to the LOG. Use options_ instead of options because SanitizeOptions could modify the max_file_open limit.

Test Plan: num db_bench

Reviewers: heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D4833
2012-08-22 12:22:12 -07:00
heyongqiang
21082fa13c regression for trigger compaction logic
Summary: as subject

Test Plan: manually run db_bench confirmed

Reviewers: dhruba

Differential Revision: https://reviews.facebook.net/D4809
2012-08-21 18:11:21 -07:00
Dhruba Borthakur
f4e7febf22 Record the version of the source repository that was used to build the leveldb library.
Summary: Record the version of the source that we are compiling. We keep a record of the git revision in util/version.cc. This source file is then built as a regular source file as part of the compilation process. One can run "strings executable_filename | grep _build_" to find the version of the source that we used to build the executable file.

Test Plan: none

Differential Revision: https://reviews.facebook.net/D4785
2012-08-21 14:47:15 -07:00
heyongqiang
6ba1f17789 adding a scribe logger in leveldb to log leveldb deploy stats
Summary:
as subject.

A new log is written to scribe via thrift client when a new db is opened and when there is
a compaction.

a new option var scribe_log_db_stats is added.

Test Plan: manually checked using command "ptail -time 0 leveldb_deploy_stats"

Reviewers: dhruba

Differential Revision: https://reviews.facebook.net/D4659
2012-08-21 11:43:22 -07:00
Dhruba Borthakur
e56b2c5a31 Prevent concurrent multiple opens of leveldb database.
Summary:
The fcntl call cannot detect lock conflicts when invoked multiple times
from the same thread.
Use a static lockedFile Set to record the paths that are locked.
A lockfile request checks to see if htis filename already exists in
lockedFiles, if so, then it triggers an error. Otherwise, it inserts
the filename in the lockedFiles Set.
A unlock file request verifies that the filename is in the lockedFiles
set and removes it from lockedFiles set.

Test Plan: unit test attached

Reviewers: heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D4755
2012-08-20 23:55:04 -07:00
Dhruba Borthakur
c3096afd61 Introduce a new option disableDataSync for opening the database. If this is set to true, then the data written to newly created data files are not sycned to disk, instead depend on the OS to flush dirty data to stable storage. This option is good for bulk
Test Plan:
manual tests

Task ID: #

Blame Rev:

Differential Revision: https://reviews.facebook.net/D4515
2012-08-03 15:23:53 -07:00
Dhruba Borthakur
d11b637f34 bits_per_key is already configurable. It defines how many bloom bits will be used for every key in the database.
My change in this patch is to make the Hash code that is used for blooms to be confgurable. In fact,
one can specify a modified HashCode that inspects only parts of the Key to generate the Hash (used
by booms).

Test Plan: none

Differential Revision: https://reviews.facebook.net/D4059
2012-07-09 23:06:07 -07:00
Dhruba Borthakur
80c663882a Create leveldb server via Thrift.
Summary:
First draft.
Unit tests pass.

Test Plan: unit tests attached

Reviewers: heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D3969
2012-07-07 09:42:39 -07:00
heyongqiang
7600228072 fix compile warning
Summary: as subject

Test Plan: compile

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D3957
2012-07-02 17:37:45 -07:00
heyongqiang
4e4b6812ff Make some variables configurable for each db instance
Summary:
Make configurable 'targetFileSize', 'targetFileSizeMultiplier',
'maxBytesForLevelBase', 'maxBytesForLevelMultiplier',
'expandedCompactionFactor', 'maxGrandParentOverlapFactor'

Test Plan: N/A

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D3801
2012-06-27 14:36:31 -07:00
Dhruba Borthakur
a35e574344 Make Leveldb save data into HDFS files. You have to set USE_HDFS in your environment variable to compile leveldb with HDFS support.
Test Plan: Run benchmark.

Differential Revision: https://reviews.facebook.net/D3549
2012-06-14 00:29:01 -07:00
Dhruba Borthakur
8f293b68a9 Support --bufferedio=[0,1] from db_bench. If bufferedio = 0, then the read code path clears the OS page cache after the IO is completed. The default remains as bufferedio=1
Summary:
Task ID: #

Blame Rev:

Test Plan: Revert Plan:

Differential Revision: https://reviews.facebook.net/D3429
2012-05-29 13:29:44 -07:00
Dhruba Borthakur
a2a0e358cb Add support to specify the number of shards for the Block cache. By default, the block cache is sharded into 16 parts.
Summary:
Task ID: #

Blame Rev:

Test Plan: Revert Plan:

Differential Revision: https://reviews.facebook.net/D3273
2012-05-16 17:23:49 -07:00
Arun Sharma
95af128225 SSE4 optimization
Summary:
This speeds up CRC computation significantly on
hardware that supports it. Enabled via -msse4.

Note: the binary won't be usable on older CPUs
that don't support the instruction.

Test Plan: crc32c_test

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D3201
2012-05-15 10:10:01 -07:00
Arun Sharma
921a48428e Optimize for lp64
Summary:
Some code reorganization in-preparation for replacing with a hardware
instruction.

* Use u64 for some of the key types
* Use an ALIGN macro so code is easier to read

Test Plan: crc32c_test

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D3135
2012-05-14 15:40:11 -07:00
Sanjay Ghemawat
85584d497e Added bloom filter support.
In particular, we add a new FilterPolicy class.  An instance
of this class can be supplied in Options when opening a
database.  If supplied, the instance is used to generate
summaries of keys (e.g., a bloom filter) which are placed in
sstables.  These summaries are consulted by DB::Get() so we
can avoid reading sstable blocks that are guaranteed to not
contain the key we are looking for.

This change provides one implementation of FilterPolicy
based on bloom filters.

Other changes:
- Updated version number to 1.4.
- Some build tweaks.
- C binding for CompactRange.
- A few more benchmarks: deleteseq, deleterandom, readmissing, seekrandom.
- Minor .gitignore update.
2012-04-17 08:36:46 -07:00
Sanjay Ghemawat
9013f13b15 use mmap on 64-bit machines to speed-up reads; small build fixes 2012-03-15 09:14:00 -07:00
Sanjay Ghemawat
3c8be108bf fixed issues 66 (leaking files on disk error) and 68 (no sync of CURRENT file) 2012-01-25 14:56:52 -08:00
Hans Wennborg
42fb47f6ed Pass system's CFLAGS, remove exit time destructor, sstable bug fix.
- Pass system's values of CFLAGS,LDFLAGS.
  Don't override OPT if it's already set.
  Original patch by Alessio Treglia <alessio@debian.org>:
  http://code.google.com/p/leveldb/issues/detail?id=27#c6

- Remove 1 exit time destructor from leveldb.
  See http://crbug.com/101600

- Fix problem where sstable building code would pass an
  internal key to the user comparator.

(Sync with uptream at 25436817.)
2011-11-14 17:06:16 +00:00
Hans Wennborg
36a5f8ed7f A number of fixes:
- Replace raw slice comparison with a call to user comparator.
  Added test for custom comparators.

- Fix end of namespace comments.

- Fixed bug in picking inputs for a level-0 compaction.

  When finding overlapping files, the covered range may expand
  as files are added to the input set.  We now correctly expand
  the range when this happens instead of continuing to use the
  old range.  For example, suppose L0 contains files with the
  following ranges:

      F1: a .. d
      F2:    c .. g
      F3:       f .. j

  and the initial compaction target is F3.  We used to search
  for range f..j which yielded {F2,F3}.  However we now expand
  the range as soon as another file is added.  In this case,
  when F2 is added, we expand the range to c..j and restart the
  search.  That picks up file F1 as well.

  This change fixes a bug related to deleted keys showing up
  incorrectly after a compaction as described in Issue 44.

(Sync with upstream @25072954)
2011-10-31 17:22:06 +00:00
Gabor Cselle
299ccedfec A number of bugfixes:
- Added DB::CompactRange() method.

  Changed manual compaction code so it breaks up compactions of
  big ranges into smaller compactions.

  Changed the code that pushes the output of memtable compactions
  to higher levels to obey the grandparent constraint: i.e., we
  must never have a single file in level L that overlaps too
  much data in level L+1 (to avoid very expensive L-1 compactions).

  Added code to pretty-print internal keys.

- Fixed bug where we would not detect overlap with files in
  level-0 because we were incorrectly using binary search
  on an array of files with overlapping ranges.

  Added "leveldb.sstables" property that can be used to dump
  all of the sstables and ranges that make up the db state.

- Removing post_write_snapshot support.  Email to leveldb mailing
  list brought up no users, just confusion from one person about
  what it meant.

- Fixing static_cast char to unsigned on BIG_ENDIAN platforms.

  Fixes	Issue 35 and Issue 36.

- Comment clarification to address leveldb Issue 37.

- Change license in posix_logger.h to match other files.

- A build problem where uint32 was used instead of uint32_t.

Sync with upstream @24408625
2011-10-05 16:30:28 -07:00
Hans Wennborg
213a68eb68 Sync with upstream @23860137.
Fix GCC -Wshadow warnings in LevelDB's public header files,
reported by Dustin.

Add in-memory Env implementation (helpers/memenv/*).
This enables users to create LevelDB databases in-memory.

Initialize ShardedLRUCache::last_id_ to zero.
This fixes a Valgrind warning.

(Also delete port/sha1_* which were removed upstream some time ago.)
2011-09-12 10:21:10 +01:00
gabor@google.com
e3584f9c28 Bugfix for issue 33; reduce lock contention in Get(), parallel benchmarks.
- Fix for issue 33 (non-null-terminated result from
  leveldb_property_value())

- Support for running multiple instances of a benchmark in parallel.

- Reduce lock contention on Get():
  (1) Do not hold the lock while searching memtables.
  (2) Shard block and table caches 16-ways.

  Benchmark for evaluating this change:
  $ db_bench --benchmarks=fillseq1,readrandom --threads=$n
  (fillseq1 is a small hack to make sure fillseq runs once regardless
  of number of threads specified on the command line).



git-svn-id: https://leveldb.googlecode.com/svn/trunk@49 62dab493-f737-651d-591e-8d6aee1b9529
2011-08-22 21:08:51 +00:00
gabor@google.com
ab323f7e1e Bugfixes for iterator and documentation.
- Fix bug in Iterator::Prev where it would return the wrong key.
  Fixes issues 29 and 30.

- Added a tweak to testharness to allow running just some tests.

- Fixing two minor documentation errors based on issues 28 and 25.

- Cleanup; fix namespaces of export-to-C code.
  Also fix one "const char*" vs "char*" mismatch.



git-svn-id: https://leveldb.googlecode.com/svn/trunk@48 62dab493-f737-651d-591e-8d6aee1b9529
2011-08-16 01:21:01 +00:00
dgrogan@chromium.org
a05525d13b @23023120
git-svn-id: https://leveldb.googlecode.com/svn/trunk@47 62dab493-f737-651d-591e-8d6aee1b9529
2011-08-06 00:19:37 +00:00
gabor@google.com
f122c6dfbb Adding FreeBSD support, removing Chromium files, adding benchmark.
- LevelDB patch for FreeBSD. This resolves Issue 22.
  Contributed by dforsythe (thanks!).

- Removing Chromium-specific files.
  They are now going to live in the Chromium repository.

- Adding a benchmark page comparing LevelDB performance
  to SQLite and Kyoto Cabinet's TreeDB, along with
  code to generate the benchmarks.
  Thanks to Kevin Tseng for compiling the benchmarks,
  and Scott Hess and Mikio Hirabayashi for their
  help and advice.



git-svn-id: https://leveldb.googlecode.com/svn/trunk@40 62dab493-f737-651d-591e-8d6aee1b9529
2011-07-27 01:46:25 +00:00
gabor@google.com
60bd8015f2 Speed up Snappy uncompression, new Logger interface.
- Removed one copy of an uncompressed block contents changing
  the signature of Snappy_Uncompress() so it uncompresses into a
  flat array instead of a std::string.
        
  Speeds up readrandom ~10%.

- Instead of a combination of Env/WritableFile, we now have a
  Logger interface that can be easily overridden applications
  that want to supply their own logging.

- Separated out the gcc and Sun Studio parts of atomic_pointer.h
  so we can use 'asm', 'volatile' keywords for Sun Studio.




git-svn-id: https://leveldb.googlecode.com/svn/trunk@39 62dab493-f737-651d-591e-8d6aee1b9529
2011-07-21 02:40:18 +00:00
gabor@google.com
6872ace901 Sun Studio support, and fix for test related memory fixes.
- LevelDB patch for Sun Studio
  Based on a patch submitted by Theo Schlossnagle - thanks!
  This fixes Issue 17.

- Fix a couple of test related memory leaks.



git-svn-id: https://leveldb.googlecode.com/svn/trunk@38 62dab493-f737-651d-591e-8d6aee1b9529
2011-07-19 23:36:47 +00:00