Summary:
`MemTableList::Add()` neglected to mention that it took ownership of the reference held by its caller.
The comment in `MemTable::Get()` was wrong in describing the format of the key.
Test Plan: None
Reviewers: dhruba, sheki, emayanke, vamsi
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7755
Summary:
There was a bug in the ExtendOverlappingInputs method so that
the terminating condition for the backward search was incorrect.
Test Plan: make clean check
Reviewers: sheki, emayanke, MarkCallaghan
Reviewed By: MarkCallaghan
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7725
Summary:
Leveldb has an api OpenForReadOnly() that opens the database
in readonly mode. This call had an option to not process the
transaction log. This patch removes this option and always
processes all transactions that had been committed. It has
been done in such a way that it does not create/write to
any new files in the process. The invariant of "no-writes"
to the leveldb data directory is still true.
This enhancement allows multiple threads to open the same database
in readonly mode and access all trancations that were committed right
upto the OpenForReadOnly call.
I changed the public API to match the new semantics because
there are no users who are currently using this api.
Test Plan: make clean check
Reviewers: sheki
Reviewed By: sheki
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7479
Summary:
1. The OpenForReadOnly() call should not lock the db. This is useful
so that multiple processes can open the same database concurrently
for reading.
2. GetUpdatesSince should not error out if the archive directory
does not exist.
3. A new constructor for WriteBatch that can takes a serialized
string as a parameter of the constructor.
Test Plan: make clean check
Reviewers: sheki
Reviewed By: sheki
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7449
Summary:
Added kMetaDatabase for meta-databases in db/filename.h along with supporting
fuctions.
Fixed switch in DBImpl so that it also handles kMetaDatabase.
Fixed DestroyDB() that it can handle destroying meta-databases.
Test Plan: make check
Reviewers: sheki, emayanke, vamsi, dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D7245
Summary:
C tests would fail sometimes as DestroyDB would return a Failure Status
message when deleting an archival directory which was not created
(WAL_ttl_seconds = 0).
Fix: Ignore the Status returned on Deleting Archival Directory.
Test Plan: * make check
Reviewers: dhruba, emayanke
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7395
Summary:
WriteBatch is now used by the GetUpdatesSinceAPI. This API is external
and will be used by the rocks server. Rocks Server and others will need
to know about the Sequence Number in the WriteBatch. This public method
will allow for that.
Test Plan: make all check.
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7293
Summary:
* Fixed implementation bug in Binary_Searvch introduced in https://reviews.facebook.net/D7119
* Binary search is also overflow safe.
* Delete archive log files and archive dir during DestroyDB
Test Plan: make check
Reviewers: dhruba
CC: kosievdmerwe, emayanke
Differential Revision: https://reviews.facebook.net/D7263
Summary:
Implement a interface to retrieve the most current transaction
id from the database.
Test Plan: Added unit test.
Reviewers: sheki
Reviewed By: sheki
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7269
Summary:
filename.h has functions to do similar things.
Moving code away from db_impl.cc
Test Plan: make check
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D7251
Summary:
How it works:
* GetUpdatesSince takes a SequenceNumber.
* A LogFile with the first SequenceNumber nearest and lesser than the requested Sequence Number is found.
* Seek in the logFile till the requested SeqNumber is found.
* Return an iterator which contains logic to return record's one by one.
Test Plan:
* Test case included to check the good code path.
* Will update with more test-cases.
* Feedback required on test-cases.
Reviewers: dhruba, emayanke
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7119
Summary:
A compaction is picked based on its score. It is useful to
print the compaction score in the LOG because it aids in
debugging. If one looks at the logs, one can find out why
a compaction was preferred over another.
Test Plan: make clean check
Differential Revision: https://reviews.facebook.net/D7137
Summary:
Create a directory "archive" in the DB directory.
During DeleteObsolteFiles move the WAL files (*.log) to the Archive directory,
instead of deleting.
Test Plan: Created a DB using DB_Bench. Reopened it. Checked if files move.
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D6975
Summary:
Scripted and removed all trailing spaces and converted all tabs to
spaces.
Also fixed other lint errors.
All lint errors from this point of time should be taken seriously.
Test Plan: make all check
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D7059
Summary:
LevelDB should delete almost-new keys when a long-open snapshot exists.
The previous behavior is to keep all versions that were created after the
oldest open snapshot. This can lead to database size bloat for
high-update workloads when there are long-open snapshots and long-open
snapshot will be used for logical backup. By "almost new" I mean that the
key was updated more than once after the oldest snapshot.
If there were two snapshots with seq numbers s1 and s2 (s1 < s2), and if
we find two instances of the same key k1 that lie entirely within s1 and
s2 (i.e. s1 < k1 < s2), then the earlier version
of k1 can be safely deleted because that version is not visible in any snapshot.
Test Plan:
unit test attached
make clean check
Differential Revision: https://reviews.facebook.net/D6999
Summary:
Print out status at the end of a compaction run. This helps in
debugging.
Test Plan: make clean check
Reviewers: sheki
Reviewed By: sheki
Differential Revision: https://reviews.facebook.net/D7035
Summary:
When we expand the range of keys for a level 0 compaction, we
need to invoke ParentFilesInCompaction() only once for the
entire range of keys that is being compacted. We were invoking
it for each file that was being compacted, but this triggers
an assertion because each file's range were contiguous but
non-overlapping.
I renamed ParentFilesInCompaction to ParentRangeInCompaction
to adequately represent that it is the range-of-keys and
not individual files that we compact in a single compaction run.
Here is the assertion that is fixed by this patch.
db_test: db/version_set.cc:585: void leveldb::Version::ExtendOverlappingInputs(int, const leveldb::Slice&, const leveldb::Slice&, std::vector<leveldb::FileMetaData*, std::allocator<leveldb::FileMetaData*> >*, int): Assertion `user_cmp->Compare(flimit, user_begin) >= 0' failed.
Test Plan: make clean check OPT=-g
Reviewers: sheki
Reviewed By: sheki
CC: MarkCallaghan, emayanke, leveldb
Differential Revision: https://reviews.facebook.net/D6963
Summary:
On fast filesystems (e.g. /dev/shm and ext4), the flushing
of memstore to disk was fast and quick, and the background compaction
thread was not getting scheduled fast enough to delete obsolete
files before the db was closed. This caused the repair method
to pick up those files that were not part of the db and the unit
test was failing.
The fix is to enhance the unti test to run a compaction before
closing the database so that all files that are not part of the
database are truly deleted from the filesystem.
Test Plan: make c_test; ./c_test
Reviewers: chip, emayanke, sheki
Reviewed By: chip
CC: leveldb
Differential Revision: https://reviews.facebook.net/D6915
Summary:
The compaction process takes some files from LevelK and
merges it into LevelK+1. The number of files it picks from
LevelK was capped such a way that the total amount of
data picked does not exceed the maxfilesize of that level.
This essentially meant that only one file from LevelK
is picked for a single compaction.
For bulkloads, we would like to take many many file from
LevelK and compact them using a single compaction run.
This patch introduces a option called the 'source_compaction_factor'
(similar to expanded_compaction_factor). It is a multiplier
that is multiplied by the maxfilesize of that level to arrive
at the limit that is used to throttle the number of source
files from LevelK. For bulk loads, set source_compaction_factor
to a very high number so that multiple files from the same
level are picked for compaction in a single compaction.
The default value of source_compaction_factor is 1, so that
we can keep backward compatibilty with existing compaction semantics.
Test Plan: make clean check
Reviewers: emayanke, sheki
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D6867
Summary:
This option is needed for fast bulk uploads. The goal is to load
all the data into files in L0 without any interference from
background compactions.
Test Plan: make clean check
Reviewers: sheki
Reviewed By: sheki
CC: leveldb
Differential Revision: https://reviews.facebook.net/D6849
Summary:
The method Finalize() recomputes the compaction score of each
level and then sorts these score from largest to smallest. The
idea is that the level with the largest compaction score will
be a better candidate for compaction. There are usually very
few levels, and a bubble sort code was used to sort these
compaction scores. There existed a bug in the sorting code that
skipped looking at the score for the n-1 level. This meant that
even if the compaction score of the n-1 level is large, it will
not be picked for compaction.
This patch fixes the bug and also introduces "asserts" in the
code to detect any possible inconsistencies caused by future bugs.
This bug existed in the very first code change that introduced
multi-threaded compaction to the leveldb code. That version of
code was committed on Oct 19th via
1ca0584345
Test Plan: make clean check OPT=-g
Reviewers: emayanke, sheki, MarkCallaghan
Reviewed By: sheki
CC: leveldb
Differential Revision: https://reviews.facebook.net/D6837
Summary:
The manifest file contains a series of edits. If the verbose
option is switched on, then print each individual edit in the
manifest file. This helps in debugging.
Test Plan: make clean manifest_dump
Reviewers: emayanke, sheki
Reviewed By: sheki
CC: leveldb
Differential Revision: https://reviews.facebook.net/D6807
Summary: The new function MinLevelToCompress in db_test.cc was incomplete. It needs to tell the calling function-TEST whether the test has to be skipped or not
Test Plan: make all;./db_test
Reviewers: dhruba, heyongqiang
Reviewed By: dhruba
CC: sheki
Differential Revision: https://reviews.facebook.net/D6771
Summary:
The manifest file contains a series of edits. If the verbose
option is switched on, then print each individual edit in the
manifest file. This helps in debugging.
Test Plan: make clean manifest_dump
Reviewers: emayanke, sheki
Reviewed By: sheki
CC: leveldb
Differential Revision: https://reviews.facebook.net/D6807
Summary:
dbstress has an option to reopen the database. Make it such that the
previous handle is not closed before we reopen, this simulates a
situation similar to a process crash.
Added new api to DMImpl to remove the lock file.
Test Plan: run db_stress
Reviewers: emayanke
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D6777
Summary: The new function MinLevelToCompress in db_test.cc was incomplete. It needs to tell the calling function-TEST whether the test has to be skipped or not
Test Plan: make all;./db_test
Reviewers: dhruba, heyongqiang
Reviewed By: dhruba
CC: sheki
Differential Revision: https://reviews.facebook.net/D6771
Summary:
The value specified in max_grandparent_overlap_factor is used to
limit the file size in a compaction run. This patch makes it
configurable when using db_bench.
Test Plan: make clean db_bench
Reviewers: MarkCallaghan, heyongqiang
Reviewed By: heyongqiang
CC: leveldb
Differential Revision: https://reviews.facebook.net/D6729
Summary:
There are applications that operate on multiple leveldb instances.
These applications will like to pass in an opaque type for each
leveldb instance and this type should be passed back to the application
with every invocation of the CompactionFilter api.
Test Plan: Enehanced unit test for opaque parameter to CompactionFilter.
Reviewers: heyongqiang
Reviewed By: heyongqiang
CC: MarkCallaghan, sheki, emayanke
Differential Revision: https://reviews.facebook.net/D6711
Summary:
Compilation used to fail with the error:
db/version_set.cc:1773: error: ‘number_of_files_to_sort_’ is not a member of ‘leveldb::VersionSet’
I created a new method called CheckConsistencyForDeletes() so that
all the high cost checking is done only when OPT=-g is specified.
I also fixed a bug in PickCompactionBySize that was triggered when
OPT=-g was switched on. The base_index in the compaction record
was not set correctly.
Test Plan: make check OPT=-g
Differential Revision: https://reviews.facebook.net/D6687
Summary:
The db_bench utility was broken in 1.5.4.fb because of a
signed-unsigned comparision.
The static variable FLAGS_min_level_to_compress was recently
changed from int to 'unsigned in' but it is initilized to a
nagative value -1.
The segfault is of this type:
Program received signal SIGSEGV, Segmentation fault.
Open (this=0x7fffffffdee0) at db/db_bench.cc:939
939 db/db_bench.cc: No such file or directory.
(gdb) where
Test Plan: run db_bench with no options.
Reviewers: heyongqiang
Reviewed By: heyongqiang
CC: MarkCallaghan, emayanke, sheki
Differential Revision: https://reviews.facebook.net/D6663
Summary:
make clean check OPT=-g fails
leveldb::DBStatistics::getTickerCount(leveldb::Tickers)’:
./db/db_statistics.h:34: error: ‘MAX_NO_TICKERS’ was not declared in this scope
util/ldb_cmd.cc:255: warning: left shift count >= width of type
Test Plan:
make clean check OPT=-g
Reviewers:
CC:
Task ID: #
Blame Rev:
Summary: Record BloomFliter hits and drop off reasons during compaction.
Test Plan: Unit tests work.
Reviewers: dhruba, heyongqiang
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D6591
Summary:
disable size compaction in ldb reduce_levels, this will avoid compactions rather than the manual comapction,
added --compression=none|snappy|zlib|bzip2 and --file_size= per-file size to ldb reduce_levels command
Test Plan: run ldb
Reviewers: dhruba, MarkCallaghan
Reviewed By: dhruba
CC: sheki, emayanke
Differential Revision: https://reviews.facebook.net/D6597
Summary:
Prototype stat's collection. Diff is a good estimate of what
the final code will look like.
A few assumptions :
* Used a global static instance of the statistics object. Plan to pass
it to each internal function. Static allows metrics only at app
level.
* In the Ticker's do not do any locking. Depend on the mutex at each
function of LevelDB. If we ever remove the mutex, we should change
here too. The other option is use atomic objects anyways as there
won't be any contention as they will be always acquired only by one
thread.
* The counters are dumb, increment through lifecycle. Plan to use ods
etc to get last5min stat etc.
Test Plan:
made changes in db_bench
Ran ./db_bench --statistics=1 --num=10000 --cache_size=5000
This will print the cache hit/miss stats.
Reviewers: dhruba, heyongqiang
Differential Revision: https://reviews.facebook.net/D6441
Summary:
When a new version is created, we sort all the files at every
level based on their size. This is necessary because we want
to compact the largest file first. The sorting takes quite a
bit of CPU.
Moved the sorting code to be outside the mutex. Also, the
earlier code was sorting files at all levels but we do not
need to sort the highest-number level because those files
are never the cause of any compaction. To reduce sorting
costs, we sort only the first few files in each level
because it is likely that those are the only files in that
level that will be picked for compaction.
At steady state, I have seen that this patch increase
throughout from 1500 writes/sec to 1700 writes/sec at the
end of a 72 hour run. The cpu saving by not sorting the
last level was not distinctive in this test run because
there were only 100K files in the highest numbered level.
I expect the cpu saving to be significant when the number of
files is much higher.
This is mostly an early preview and not ready for rigorous review.
With this patch, the writs/sec is now bottlenecked not by the sorting code but by GetOverlappingInputs. I am working on a patch to optimize GetOverlappingInputs.
Test Plan: make check
Reviewers: MarkCallaghan, heyongqiang
Reviewed By: heyongqiang
Differential Revision: https://reviews.facebook.net/D6411
Summary:
The Version::GetOverlappingInputs() is called multiple times in
the compaction code path. Eack invocation does a binary search
for overlapping files in the specified key range.
This patch remembers the offset of an overlapped file when
GetOverlappingInputs() is called the first time within
a compaction run. Suceeding calls to GetOverlappingInputs()
uses the remembered index to avoid the binary search.
I measured that 1000 iterations of GetOverlappingInputs
takes around 4500 microseconds without this patch. If I use
this patch with the hint on every invocation, then 1000
iterations take about 3900 microsecond.
Test Plan: make check OPT=-g
Reviewers: heyongqiang
Reviewed By: heyongqiang
CC: MarkCallaghan, emayanke, sheki
Differential Revision: https://reviews.facebook.net/D6513
Summary:
Added a conditional flush in ~DBImpl to flush.
There is still a chance of writes not being persisted if there is a
crash (not a clean shutdown) before the DBImpl instance is destroyed.
Test Plan: modified db_test to meet the new expectations.
Reviewers: dhruba, heyongqiang
Differential Revision: https://reviews.facebook.net/D6519
Summary:
The default compilation process now uses "-Wall" to compile.
Fix all compilation error generated by gcc.
Test Plan: make all check
Reviewers: heyongqiang, emayanke, sheki
Reviewed By: heyongqiang
CC: MarkCallaghan
Differential Revision: https://reviews.facebook.net/D6525
Summary:
The method Version::GetOverlappingInputs used a sequential search
to map a kay-range to a set of files. But the files are arranged
in ascending order of key, so a biary search is more effective.
This patch implements Version::GetOverlappingInputsBinarySearch
that finds one file that corresponds to the specified key range
and then iterates backwards and forwards to find all overlapping
files.
This patch is critical for making compactions efficient, especially
when there are thousands of files in a single level.
I measured that 1000 iterations of TEST_MaxNextLevelOverlappingBytes
takes 16000 microseconds without this patch. With this patch, the
same method takes about 4600 microseconds.
Test Plan: Almost all unit tests in db_test uses this method to lookup keys.
Reviewers: heyongqiang
Reviewed By: heyongqiang
CC: MarkCallaghan, emayanke, sheki
Differential Revision: https://reviews.facebook.net/D6465
Summary:
There are certain use-cases where the application intends to
delete older keys aftre they have expired a certian time period.
One option for those applications is to periodically scan the
entire database and delete appropriate keys.
A better way is to allow the application to hook into the
compaction process. This patch allows the application to set
a method callback for every key that is being compacted. If
this method returns true, then the key is not preserved in
the output of the compaction.
Test Plan:
This is mostly to preview the proposed new public api.
Since it is a public api, please do due diligence on reviewing it.
I will be writing test cases for this api in mynext version of
this patch.
Reviewers: MarkCallaghan, heyongqiang
Reviewed By: heyongqiang
CC: sheki, adsharma
Differential Revision: https://reviews.facebook.net/D6285
Summary: as subject.
Test Plan: manually test it, will add a testcase
Reviewers: dhruba, MarkCallaghan
Differential Revision: https://reviews.facebook.net/D6345
Summary: Leveldb currently uses windowBits=-14 while using zlib compression.(It was earlier 15). This makes the setting configurable. Related changes here: https://reviews.facebook.net/D6105
Test Plan: make all check
Reviewers: dhruba, MarkCallaghan, sheki, heyongqiang
Differential Revision: https://reviews.facebook.net/D6393
Summary:
as subject
Test Plan:
run db_bench and db_test
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D6111
Summary:
This makes the stall timers in MakeRoomForWrite more accurate by timing
the sleeps. From looking at the logs the real sleep times are usually
about 2000 usecs each when SleepForMicros(1000) is called. The modified LOG messages are:
2012/10/29-12:06:33.271984 2b3cc872f700 delaying write 13 usecs for level0_slowdown_writes_trigger
2012/10/29-12:06:34.688939 2b3cc872f700 delaying write 1728 usecs for rate limits with max score 3.83
Task ID: #
Blame Rev:
Test Plan:
run db_bench, look at DB/LOG
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D6297
Summary:
The leveldb API is enhanced to support different compression algorithms at
different levels.
This adds the option min_level_to_compress to db_bench that specifies
the minimum level for which compression should be done when
compression is enabled. This can be used to disable compression for levels
0 and 1 which are likely to suffer from stalls because of the CPU load
for memtable flushes and (L0,L1) compaction. Level 0 is special as it
gets frequent memtable flushes. Level 1 is special as it frequently
gets all:all file compactions between it and level 0. But all other levels
could be the same. For any level N where N > 1, the rate of sequential
IO for that level should be the same. The last level is the
exception because it might not be full and because files from it are
not read to compact with the next larger level.
The same amount of time will be spent doing compaction at any
level N excluding N=0, 1 or the last level. By this standard all
of those levels should use the same compression. The difference is that
the loss (using more disk space) from a faster compression algorithm
is less significant for N=2 than for N=3. So we might be willing to
trade disk space for faster write rates with no compression
for L0 and L1, snappy for L2, zlib for L3. Using a faster compression
algorithm for the mid levels also allows us to reclaim some cpu
without trading off much loss in disk space overhead.
Also note that little is to be gained by compressing levels 0 and 1. For
a 4-level tree they account for 10% of the data. For a 5-level tree they
account for 1% of the data.
With compression enabled:
* memtable flush rate is ~18MB/second
* (L0,L1) compaction rate is ~30MB/second
With compression enabled but min_level_to_compress=2
* memtable flush rate is ~320MB/second
* (L0,L1) compaction rate is ~560MB/second
This practicaly takes the same code from https://reviews.facebook.net/D6225
but makes the leveldb api more general purpose with a few additional
lines of code.
Test Plan: make check
Differential Revision: https://reviews.facebook.net/D6261
Summary:
Adds the "MB/sec in" and "MB/sec out" to this line:
Amplification: 1.7 rate, 0.01 GB in, 0.02 GB out, 8.24 MB/sec in, 13.75 MB/sec out
Changes all values to be reported per interval and since test start for this line:
... thread 0: (10000,60000) ops and (19155.6,27307.5) ops/second in (0.522041,2.197198) seconds
Task ID: #
Blame Rev:
Test Plan:
run db_bench
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D6291
Summary:
A previous commit 4c107587ed introduced
the idea that some version updates might not delete obsolete files.
This means that if a unit test blindly counts the number of files
in the db directory it might not represent the true state of the database.
Use GetLiveFiles() insteads to count the number of live files in the database.
Test Plan:
make check
Summary:
Adds a method that returns the score for the next level that most
needs compaction. That method is then used by db_bench to rate limit threads.
Threads are put to sleep at the end of each stats interval until the score
is less than the limit. The limit is set via the --rate_limit=$double option.
The specified value must be > 1.0. Also adds the option --stats_per_interval
to enable additional metrics reported every stats interval.
Task ID: #
Blame Rev:
Test Plan:
run db_bench
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D6243
Summary: Enable LevelDb to create a new log file if current log file is too large.
Test Plan:
Write a script and manually check the generated info LOG.
Task ID: 1803577
Blame Rev:
Reviewers: dhruba, heyongqiang
Reviewed By: heyongqiang
CC: zshao
Differential Revision: https://reviews.facebook.net/D6003
Summary:
I used server uptime to compute per-level IO throughput rates. I
intended to use time spent doing compaction at that level. This fixes that.
Task ID: #
Blame Rev:
Test Plan:
run db_bench, look at results
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D6237
Summary:
A previous commit 4c107587ed introduced
the idea that some version updates might not delete obsolete files.
This means that if a unit test blindly counts the number of files
in the db directory it might not represent the true state of the database.
Use GetLiveFiles() insteads to count the number of live files in the database.
Test Plan: make check
Reviewers: heyongqiang, MarkCallaghan
Reviewed By: MarkCallaghan
Differential Revision: https://reviews.facebook.net/D6207
Summary:
It is best if we pick the largest file to compact in a level.
This reduces the write amplification factor for compactions.
Each level has an auxiliary data structure called files_by_size_
that sorts all files by their size. This data structure is
updated when a new version is created.
Test Plan: make check
Differential Revision: https://reviews.facebook.net/D6195
Summary:
Prior to multi-threaded compaction, wrap-around would be done by using
current_->files_[level[0]. With this change we should be
using the first file for which f->being_compacted is not true.
1ca0584345 (commitcomment-2041516)
Test Plan: make check
Differential Revision: https://reviews.facebook.net/D6165
Summary:
This adds more statistics to be reported by GetProperty("leveldb.stats").
The new stats include time spent waiting on stalls in MakeRoomForWrite.
This also includes the total amplification rate where that is:
(#bytes of sequential IO during compaction) / (#bytes from Put)
This also includes a lot more data for the per-level compaction report.
* Rn(MB) - MB read from level N during compaction between levels N and N+1
* Rnp1(MB) - MB read from level N+1 during compaction between levels N and N+1
* Wnew(MB) - new data written to the level during compaction
* Amplify - ( Write(MB) + Rnp1(MB) ) / Rn(MB)
* Rn - files read from level N during compaction between levels N and N+1
* Rnp1 - files read from level N+1 during compaction between levels N and N+1
* Wnp1 - files written to level N+1 during compaction between levels N and N+1
* NewW - new files written to level N+1 during compaction
* Count - number of compactions done for this level
This is the new output from DB::GetProperty("leveldb.stats"). The old output stopped at Write(MB)
Compactions
Level Files Size(MB) Time(sec) Read(MB) Write(MB) Rn(MB) Rnp1(MB) Wnew(MB) Amplify Read(MB/s) Write(MB/s) Rn Rnp1 Wnp1 NewW Count
-------------------------------------------------------------------------------------------------------------------------------------
0 3 6 33 0 576 0 0 576 -1.0 0.0 1.3 0 0 0 0 290
1 127 242 351 5316 5314 570 4747 567 17.0 12.1 12.1 287 2399 2685 286 32
2 161 328 54 822 824 326 496 328 4.0 1.9 1.9 160 251 411 160 161
Amplification: 22.3 rate, 0.56 GB in, 12.55 GB out
Uptime(secs): 439.8
Stalls(secs): 206.938 level0_slowdown, 0.000 level0_numfiles, 24.129 memtable_compaction
Task ID: #
Blame Rev:
Test Plan:
run db_bench
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
(cherry picked from commit ecdeead38f86cc02e754d0032600742c4f02fec8)
Reviewers: dhruba
Differential Revision: https://reviews.facebook.net/D6153
Summary:
The compaction process deletes a large number of files. This takes
quite a bit of time and is best done outside the mutex lock.
Test Plan: make check
Differential Revision: https://reviews.facebook.net/D6123
Summary:
The compaction process deletes a large number of files. This takes
quite a bit of time and is best done outside the mutex lock.
Test Plan: make check
Differential Revision: https://reviews.facebook.net/D6123
Summary:
The compaction process deletes a large number of files. This takes
quite a bit of time and is best done outside the mutex lock.
Test Plan:
Reviewers:
CC:
Task ID: #
Blame Rev:
Summary:
The parameter delete_obsolete_files_period_micros controls the
periodicity of deleting obsolete files. db_bench was reading in
this parameter intoa local variable called 'l' but was incorrectly
using another local variable called 'n' while setting it in the
db.options data structure.
This patch also logs the value of delete_obsolete_files_period_micros
in the LOG file at db startup time.
I am hoping that this will improve the overall write throughput drastically.
Test Plan: run db_bench
Reviewers: MarkCallaghan, heyongqiang
Reviewed By: MarkCallaghan
Differential Revision: https://reviews.facebook.net/D6099
published in https://reviews.facebook.net/D5997.
Summary:
This patch allows compaction to occur in multiple background threads
concurrently.
If a manual compaction is issued, the system falls back to a
single-compaction-thread model. This is done to ensure correctess
and simplicity of code. When the manual compaction is finished,
the system resumes its concurrent-compaction mode automatically.
The updates to the manifest are done via group-commit approach.
Test Plan: run db_bench
Summary:
The method DeleteObsolete files is a very costly methind, especially
when the number of files in a system is large. It makes a list of
all live-files and then scans the directory to compute the diff.
By default, this method is executed after every compaction run.
This patch makes it such that DeleteObsolete files is never
invoked twice within a configured period.
Test Plan: run all unit tests
Reviewers: heyongqiang, MarkCallaghan
Reviewed By: MarkCallaghan
Differential Revision: https://reviews.facebook.net/D6045
Summary: Enhance db_bench to allow setting the number of levels in a database.
Test Plan: run db_bench and look at LOG
Reviewers: heyongqiang, MarkCallaghan
Reviewed By: MarkCallaghan
CC: MarkCallaghan
Differential Revision: https://reviews.facebook.net/D6027
Summary:
We have seen that reading data via the pread call (instead of
mmap) is much faster on Linux 2.6.x kernels. This patch makes
an equivalent option to switch off mmaps for the write path
as well.
db_bench --mmap_write=0 will use write() instead of mmap() to
write data to a file.
This change is backward compatible, the default
option is to continue using mmap for writing to a file.
Test Plan: "make check all"
Differential Revision: https://reviews.facebook.net/D5781
Summary:
The option is zero by default and in that case reporting is unchanged.
By unchanged, the interval at which stats are reported is scaled after each
report and newline is not issued after each report so one line is rewritten.
When non-zero it specifies the constant interval (in operations) at which
statistics are reported and the stats include the rate per interval. This
makes it easier to determine whether QPS changes over the duration of the test.
Task ID: #
Blame Rev:
Test Plan:
run db_bench
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: dhruba
Reviewed By: dhruba
CC: heyongqiang
Differential Revision: https://reviews.facebook.net/D5817
Summary:
A simple CLI which calles DB->CompactRange()
Can take String key's as range.
Test Plan:
Inserted data into a table.
Waited for a minute, used compact tool on it. File modification time's
changed so Compact did something on the files.
Existing unit tests work.
Reviewers: heyongqiang, dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D5697
Summary:
In the current code, a Get() call can trigger compaction if it has to look at more than one file. This causes unnecessary compaction because looking at more than one file is a penalty only if the file is not yet in the cache. Also, th current code counts these files before the bloom filter check is applied.
This patch counts a 'seek' only if the file fails the bloom filter
check and has to read in data block(s) from the storage.
This patch also counts a 'seek' if a file is not present in the file-cache, because opening a file means that its index blocks need to be read into cache.
Test Plan: unit test attached. I will probably add one more unti tests.
Reviewers: heyongqiang
Reviewed By: heyongqiang
CC: MarkCallaghan
Differential Revision: https://reviews.facebook.net/D5709
Summary:
If ReadCompaction is switched off, then it is better to not even
submit background compaction jobs. I see about 3% increase in
read-throughput on a pure memory database.
Test Plan: run db_bench
Reviewers: heyongqiang
Reviewed By: heyongqiang
Differential Revision: https://reviews.facebook.net/D5673
Summary:
The GetLiveFiles() api lists the set of sst files and the current
MANIFEST file. But the database continues to append new data to the
MANIFEST file even when the application is backing it up to the
backup location. This means that the database-version that is
stored in the MANIFEST FILE in the backup location
does not correspond to the sst files returned by GetLiveFiles.
This API adds a new parameter to GetLiveFiles. This new parmeter
returns the current size of the MANIFEST file.
Test Plan: Unit test attached.
Reviewers: heyongqiang
Reviewed By: heyongqiang
Differential Revision: https://reviews.facebook.net/D5631
Summary: Print out the compile version in the LOG.
Test Plan: run dbbench and verify LOG
Reviewers: heyongqiang
Reviewed By: heyongqiang
Differential Revision: https://reviews.facebook.net/D5529
Summary:
as subject. This diff should be good for benchmarking.
will send another diff to make it better in the case the seek compaction is enable.
In that coming diff, will not count a seek if the bloomfilter filters.
Test Plan: build
Reviewers: dhruba, MarkCallaghan
Reviewed By: MarkCallaghan
Differential Revision: https://reviews.facebook.net/D5481
Summary:
A set of apis that allows an application to backup data from the
leveldb database based on a set of files.
Test Plan: unint test attached. more coming soon.
Reviewers: heyongqiang
Reviewed By: heyongqiang
Differential Revision: https://reviews.facebook.net/D5439
Summary:
as subject. this can be used for benchmarking.
If we want it for some cases, we can do more changes to make this part of the option.
Test Plan: db_test
Reviewers: dhruba
CC: MarkCallaghan
Differential Revision: https://reviews.facebook.net/D5451
Summary:
FLAGS_cache_size is a long, no need to scan %lld into a size_t
for it (which generates a compiler warning)
Test Plan: run db_bench
Reviewers: dhruba, heyongqiang
Reviewed By: heyongqiang
CC: heyongqiang
Differential Revision: https://reviews.facebook.net/D5427
Summary:
This adds an option to db_bench to specify the compression algorithm to
use for LevelDB
Test Plan: ran db_bench
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D5421
Summary:
Ability to switch off filesystem read-aheads. This change is
backward-compatible: the default setting is to allow file
system read-aheads.
Test Plan: run benchmarks
Reviewers: heyongqiang, adsharma
Reviewed By: heyongqiang
Differential Revision: https://reviews.facebook.net/D5391
Summary: Enable db_bench to specify block size.
Test Plan: compile and run
Reviewers: heyongqiang
Reviewed By: heyongqiang
Differential Revision: https://reviews.facebook.net/D5373
Summary: added a new option db_log_dir, which points the log dir. Inside that dir, in order to make log names unique, the log file name is prefixed with the leveldb data dir absolute path.
Test Plan: db_test
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D5205
Summary: If none of reads or writes are specified by user, then pick the FLAGS_NUM as the number of iterations in the ReadRandomWriteRandom test. If either reads or writes are defined, then use their maximum.
Test Plan: run benchmark
Reviewers: heyongqiang
Reviewed By: heyongqiang
Differential Revision: https://reviews.facebook.net/D5217
Summary:
This patch enables the db_bench benchmark to issue both random reads and random writes at the same time. This options can be trigged via
./db_bench --benchmarks=readrandomwriterandom
The default percetage of reads is 90.
One can change the percentage of reads by specifying the --readwritepercent.
./db_bench --benchmarks=readrandomwriterandom=50
This is a feature request from Jeffro asking for leveldb performance with a 90:10 read:write ratio.
Test Plan: run on test machine.
Reviewers: heyongqiang
Reviewed By: heyongqiang
Differential Revision: https://reviews.facebook.net/D5067
Summary:
Clean up compiler warnings generated by -Wall option.
make clean all OPT=-Wall
This is a pre-requisite before making a new release.
Test Plan: compile and run unit tests
Reviewers: heyongqiang
Reviewed By: heyongqiang
Differential Revision: https://reviews.facebook.net/D5019
Summary:
The numbers of shards that the block cache is divided into is
configurable. However, if the user specifies that he/she wants
the block cache to be divided into more than 2**20 pieces, then
the system will rey to allocate a huge array of that size) that
could fail.
It is better to limit the sharding of the block cache to an
upper bound. The default sharding is 16 shards (i.e. 2**4)
and the maximum is now 2 million shards (i.e. 2**20).
Also, fixed a bug with the LRUCache where the numShardBits
should be a private member of the LRUCache object rather than
a static variable.
Test Plan:
run db_bench with --cache_numshardbits=64.
Task ID: #
Blame Rev:
Reviewers: heyongqiang
Reviewed By: heyongqiang
Differential Revision: https://reviews.facebook.net/D5013
Summary: as subject. ported the change from google code leveldb 1.5
Test Plan: run db_test
Reviewers: dhruba
Differential Revision: https://reviews.facebook.net/D4839
Summary:
Introduce a new method Env->Fsync() that issues fsync (instead of fdatasync).
This is needed for data durability when running on ext3 filesystems.
Added options to the benchmark db_bench to generate performance numbers
with either fsync or fdatasync enabled.
Cleaned up Makefile to build leveldb_shell only when building the thrift
leveldb server.
Test Plan: build and run benchmark
Reviewers: heyongqiang
Reviewed By: heyongqiang
Differential Revision: https://reviews.facebook.net/D4911
Summary:
as subject
add a tool to read sst file
as subject.
./sst_reader --command=check --file=
./sst_reader --command=scan --file=
Test Plan:
db_test
run this command
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D4881
Summary: Log the open-options to the LOG. Use options_ instead of options because SanitizeOptions could modify the max_file_open limit.
Test Plan: num db_bench
Reviewers: heyongqiang
Reviewed By: heyongqiang
Differential Revision: https://reviews.facebook.net/D4833
Summary:
Fixed unit test c_test by initializing logger=NULL.
Removed "atomic" from last_log_ts so that unit tests do not require C11 compiler.
Anyway, last_log_ts is mostly used for logging, so it is ok if it is loosely
accurate.
Test Plan: run c_test
Reviewers: heyongqiang
Reviewed By: heyongqiang
Differential Revision: https://reviews.facebook.net/D4803
Summary: Record the version of the source that we are compiling. We keep a record of the git revision in util/version.cc. This source file is then built as a regular source file as part of the compilation process. One can run "strings executable_filename | grep _build_" to find the version of the source that we used to build the executable file.
Test Plan: none
Differential Revision: https://reviews.facebook.net/D4785
Summary:
as subject.
A new log is written to scribe via thrift client when a new db is opened and when there is
a compaction.
a new option var scribe_log_db_stats is added.
Test Plan: manually checked using command "ptail -time 0 leveldb_deploy_stats"
Reviewers: dhruba
Differential Revision: https://reviews.facebook.net/D4659
Summary:
The fcntl call cannot detect lock conflicts when invoked multiple times
from the same thread.
Use a static lockedFile Set to record the paths that are locked.
A lockfile request checks to see if htis filename already exists in
lockedFiles, if so, then it triggers an error. Otherwise, it inserts
the filename in the lockedFiles Set.
A unlock file request verifies that the filename is in the lockedFiles
set and removes it from lockedFiles set.
Test Plan: unit test attached
Reviewers: heyongqiang
Reviewed By: heyongqiang
Differential Revision: https://reviews.facebook.net/D4755
Summary: as subject and only maintain 10 log files.
Test Plan: new test in db_test
Reviewers: dhruba
Differential Revision: https://reviews.facebook.net/D4731
Summary: as subject. The flush will flush everything in the db.
Test Plan: new test in db_test.cc
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D4029
Summary:
Added option --writes=xxx to specify the number of keys that we want to overwrite in the benchmark.
Task ID: #
Blame Rev:
Test Plan: Revert Plan:
Reviewers: adsharma
CC: sc
Differential Revision: https://reviews.facebook.net/D3465
Summary:
The db_bench test was not using the specified value for the max-file-open. Fixed.
The fs readhead is switched off.
Gather statistics about the table cache and print it out at the end of the tets run.
Test Plan: Revert Plan:
Reviewers: adsharma, sc
Reviewed By: adsharma
Differential Revision: https://reviews.facebook.net/D3441
Summary:
skiplist doesn't cache the location of the last insert and becomes
CPU bound when the input data has sequential keys.
Notes on thread safety: ::Insert() already requires external
synchronization. So this change is not making it any worse.
Test Plan: skiplist_test
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D3129
In particular, we add a new FilterPolicy class. An instance
of this class can be supplied in Options when opening a
database. If supplied, the instance is used to generate
summaries of keys (e.g., a bloom filter) which are placed in
sstables. These summaries are consulted by DB::Get() so we
can avoid reading sstable blocks that are guaranteed to not
contain the key we are looking for.
This change provides one implementation of FilterPolicy
based on bloom filters.
Other changes:
- Updated version number to 1.4.
- Some build tweaks.
- C binding for CompactRange.
- A few more benchmarks: deleteseq, deleterandom, readmissing, seekrandom.
- Minor .gitignore update.
- Pass system's values of CFLAGS,LDFLAGS.
Don't override OPT if it's already set.
Original patch by Alessio Treglia <alessio@debian.org>:
http://code.google.com/p/leveldb/issues/detail?id=27#c6
- Remove 1 exit time destructor from leveldb.
See http://crbug.com/101600
- Fix problem where sstable building code would pass an
internal key to the user comparator.
(Sync with uptream at 25436817.)