Summary:
Compilation used to fail with the error:
db/version_set.cc:1773: error: ‘number_of_files_to_sort_’ is not a member of ‘leveldb::VersionSet’
I created a new method called CheckConsistencyForDeletes() so that
all the high cost checking is done only when OPT=-g is specified.
I also fixed a bug in PickCompactionBySize that was triggered when
OPT=-g was switched on. The base_index in the compaction record
was not set correctly.
Test Plan: make check OPT=-g
Differential Revision: https://reviews.facebook.net/D6687
Summary:
The db_bench utility was broken in 1.5.4.fb because of a
signed-unsigned comparision.
The static variable FLAGS_min_level_to_compress was recently
changed from int to 'unsigned in' but it is initilized to a
nagative value -1.
The segfault is of this type:
Program received signal SIGSEGV, Segmentation fault.
Open (this=0x7fffffffdee0) at db/db_bench.cc:939
939 db/db_bench.cc: No such file or directory.
(gdb) where
Test Plan: run db_bench with no options.
Reviewers: heyongqiang
Reviewed By: heyongqiang
CC: MarkCallaghan, emayanke, sheki
Differential Revision: https://reviews.facebook.net/D6663
Summary: FLAGS_reopen (configurable) specifies the number of times the databse is to be reopened. FLAGS_ops_per_thread is divided into points based on that reopen field. At these points all threads come together to wait for the databse to reopen. Each thread "votes" for the database to reopen and when all have voted, the database reopens.
Test Plan: make all;./db_stress
Reviewers: dhruba, MarkCallaghan, sheki, asad, heyongqiang
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D6627
Summary:
I changed the reduce_num_levels logic to avoid "compactRange()" call if the current number of levels in use (levels that contain files) is smaller than the new num of levels.
And that change breaks the assert in reduce_levels_test
Test Plan: run reduce_levels_test
Reviewers: dhruba, MarkCallaghan
Reviewed By: dhruba
CC: emayanke, sheki
Differential Revision: https://reviews.facebook.net/D6651
Summary:
make clean check OPT=-g fails
leveldb::DBStatistics::getTickerCount(leveldb::Tickers)’:
./db/db_statistics.h:34: error: ‘MAX_NO_TICKERS’ was not declared in this scope
util/ldb_cmd.cc:255: warning: left shift count >= width of type
Test Plan:
make clean check OPT=-g
Reviewers:
CC:
Task ID: #
Blame Rev:
Summary: Record BloomFliter hits and drop off reasons during compaction.
Test Plan: Unit tests work.
Reviewers: dhruba, heyongqiang
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D6591
Summary:
disable size compaction in ldb reduce_levels, this will avoid compactions rather than the manual comapction,
added --compression=none|snappy|zlib|bzip2 and --file_size= per-file size to ldb reduce_levels command
Test Plan: run ldb
Reviewers: dhruba, MarkCallaghan
Reviewed By: dhruba
CC: sheki, emayanke
Differential Revision: https://reviews.facebook.net/D6597
Summary: Stress test modified to do deletes and later verify them
Test Plan: running the test: db_stress
Reviewers: dhruba, heyongqiang, asad, sheki, MarkCallaghan
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D6567
Summary:
Prototype stat's collection. Diff is a good estimate of what
the final code will look like.
A few assumptions :
* Used a global static instance of the statistics object. Plan to pass
it to each internal function. Static allows metrics only at app
level.
* In the Ticker's do not do any locking. Depend on the mutex at each
function of LevelDB. If we ever remove the mutex, we should change
here too. The other option is use atomic objects anyways as there
won't be any contention as they will be always acquired only by one
thread.
* The counters are dumb, increment through lifecycle. Plan to use ods
etc to get last5min stat etc.
Test Plan:
made changes in db_bench
Ran ./db_bench --statistics=1 --num=10000 --cache_size=5000
This will print the cache hit/miss stats.
Reviewers: dhruba, heyongqiang
Differential Revision: https://reviews.facebook.net/D6441
Summary:
When a new version is created, we sort all the files at every
level based on their size. This is necessary because we want
to compact the largest file first. The sorting takes quite a
bit of CPU.
Moved the sorting code to be outside the mutex. Also, the
earlier code was sorting files at all levels but we do not
need to sort the highest-number level because those files
are never the cause of any compaction. To reduce sorting
costs, we sort only the first few files in each level
because it is likely that those are the only files in that
level that will be picked for compaction.
At steady state, I have seen that this patch increase
throughout from 1500 writes/sec to 1700 writes/sec at the
end of a 72 hour run. The cpu saving by not sorting the
last level was not distinctive in this test run because
there were only 100K files in the highest numbered level.
I expect the cpu saving to be significant when the number of
files is much higher.
This is mostly an early preview and not ready for rigorous review.
With this patch, the writs/sec is now bottlenecked not by the sorting code but by GetOverlappingInputs. I am working on a patch to optimize GetOverlappingInputs.
Test Plan: make check
Reviewers: MarkCallaghan, heyongqiang
Reviewed By: heyongqiang
Differential Revision: https://reviews.facebook.net/D6411
Summary:
The Version::GetOverlappingInputs() is called multiple times in
the compaction code path. Eack invocation does a binary search
for overlapping files in the specified key range.
This patch remembers the offset of an overlapped file when
GetOverlappingInputs() is called the first time within
a compaction run. Suceeding calls to GetOverlappingInputs()
uses the remembered index to avoid the binary search.
I measured that 1000 iterations of GetOverlappingInputs
takes around 4500 microseconds without this patch. If I use
this patch with the hint on every invocation, then 1000
iterations take about 3900 microsecond.
Test Plan: make check OPT=-g
Reviewers: heyongqiang
Reviewed By: heyongqiang
CC: MarkCallaghan, emayanke, sheki
Differential Revision: https://reviews.facebook.net/D6513
Summary:
Added a conditional flush in ~DBImpl to flush.
There is still a chance of writes not being persisted if there is a
crash (not a clean shutdown) before the DBImpl instance is destroyed.
Test Plan: modified db_test to meet the new expectations.
Reviewers: dhruba, heyongqiang
Differential Revision: https://reviews.facebook.net/D6519
Summary:
The default compilation process now uses "-Wall" to compile.
Fix all compilation error generated by gcc.
Test Plan: make all check
Reviewers: heyongqiang, emayanke, sheki
Reviewed By: heyongqiang
CC: MarkCallaghan
Differential Revision: https://reviews.facebook.net/D6525
Summary:
The method Version::GetOverlappingInputs used a sequential search
to map a kay-range to a set of files. But the files are arranged
in ascending order of key, so a biary search is more effective.
This patch implements Version::GetOverlappingInputsBinarySearch
that finds one file that corresponds to the specified key range
and then iterates backwards and forwards to find all overlapping
files.
This patch is critical for making compactions efficient, especially
when there are thousands of files in a single level.
I measured that 1000 iterations of TEST_MaxNextLevelOverlappingBytes
takes 16000 microseconds without this patch. With this patch, the
same method takes about 4600 microseconds.
Test Plan: Almost all unit tests in db_test uses this method to lookup keys.
Reviewers: heyongqiang
Reviewed By: heyongqiang
CC: MarkCallaghan, emayanke, sheki
Differential Revision: https://reviews.facebook.net/D6465
Summary:
There are certain use-cases where the application intends to
delete older keys aftre they have expired a certian time period.
One option for those applications is to periodically scan the
entire database and delete appropriate keys.
A better way is to allow the application to hook into the
compaction process. This patch allows the application to set
a method callback for every key that is being compacted. If
this method returns true, then the key is not preserved in
the output of the compaction.
Test Plan:
This is mostly to preview the proposed new public api.
Since it is a public api, please do due diligence on reviewing it.
I will be writing test cases for this api in mynext version of
this patch.
Reviewers: MarkCallaghan, heyongqiang
Reviewed By: heyongqiang
CC: sheki, adsharma
Differential Revision: https://reviews.facebook.net/D6285
Summary: as subject.
Test Plan: manually test it, will add a testcase
Reviewers: dhruba, MarkCallaghan
Differential Revision: https://reviews.facebook.net/D6345
Summary: Leveldb currently uses windowBits=-14 while using zlib compression.(It was earlier 15). This makes the setting configurable. Related changes here: https://reviews.facebook.net/D6105
Test Plan: make all check
Reviewers: dhruba, MarkCallaghan, sheki, heyongqiang
Differential Revision: https://reviews.facebook.net/D6393
Summary:
as subject
Test Plan:
run db_bench and db_test
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D6111
Summary:
This makes the stall timers in MakeRoomForWrite more accurate by timing
the sleeps. From looking at the logs the real sleep times are usually
about 2000 usecs each when SleepForMicros(1000) is called. The modified LOG messages are:
2012/10/29-12:06:33.271984 2b3cc872f700 delaying write 13 usecs for level0_slowdown_writes_trigger
2012/10/29-12:06:34.688939 2b3cc872f700 delaying write 1728 usecs for rate limits with max score 3.83
Task ID: #
Blame Rev:
Test Plan:
run db_bench, look at DB/LOG
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D6297
Summary:
The leveldb API is enhanced to support different compression algorithms at
different levels.
This adds the option min_level_to_compress to db_bench that specifies
the minimum level for which compression should be done when
compression is enabled. This can be used to disable compression for levels
0 and 1 which are likely to suffer from stalls because of the CPU load
for memtable flushes and (L0,L1) compaction. Level 0 is special as it
gets frequent memtable flushes. Level 1 is special as it frequently
gets all:all file compactions between it and level 0. But all other levels
could be the same. For any level N where N > 1, the rate of sequential
IO for that level should be the same. The last level is the
exception because it might not be full and because files from it are
not read to compact with the next larger level.
The same amount of time will be spent doing compaction at any
level N excluding N=0, 1 or the last level. By this standard all
of those levels should use the same compression. The difference is that
the loss (using more disk space) from a faster compression algorithm
is less significant for N=2 than for N=3. So we might be willing to
trade disk space for faster write rates with no compression
for L0 and L1, snappy for L2, zlib for L3. Using a faster compression
algorithm for the mid levels also allows us to reclaim some cpu
without trading off much loss in disk space overhead.
Also note that little is to be gained by compressing levels 0 and 1. For
a 4-level tree they account for 10% of the data. For a 5-level tree they
account for 1% of the data.
With compression enabled:
* memtable flush rate is ~18MB/second
* (L0,L1) compaction rate is ~30MB/second
With compression enabled but min_level_to_compress=2
* memtable flush rate is ~320MB/second
* (L0,L1) compaction rate is ~560MB/second
This practicaly takes the same code from https://reviews.facebook.net/D6225
but makes the leveldb api more general purpose with a few additional
lines of code.
Test Plan: make check
Differential Revision: https://reviews.facebook.net/D6261
Summary:
Adds the "MB/sec in" and "MB/sec out" to this line:
Amplification: 1.7 rate, 0.01 GB in, 0.02 GB out, 8.24 MB/sec in, 13.75 MB/sec out
Changes all values to be reported per interval and since test start for this line:
... thread 0: (10000,60000) ops and (19155.6,27307.5) ops/second in (0.522041,2.197198) seconds
Task ID: #
Blame Rev:
Test Plan:
run db_bench
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D6291
Summary:
A previous commit 4c107587ed introduced
the idea that some version updates might not delete obsolete files.
This means that if a unit test blindly counts the number of files
in the db directory it might not represent the true state of the database.
Use GetLiveFiles() insteads to count the number of live files in the database.
Test Plan:
make check
Summary:
Adds a method that returns the score for the next level that most
needs compaction. That method is then used by db_bench to rate limit threads.
Threads are put to sleep at the end of each stats interval until the score
is less than the limit. The limit is set via the --rate_limit=$double option.
The specified value must be > 1.0. Also adds the option --stats_per_interval
to enable additional metrics reported every stats interval.
Task ID: #
Blame Rev:
Test Plan:
run db_bench
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D6243
Summary: Enable LevelDb to create a new log file if current log file is too large.
Test Plan:
Write a script and manually check the generated info LOG.
Task ID: 1803577
Blame Rev:
Reviewers: dhruba, heyongqiang
Reviewed By: heyongqiang
CC: zshao
Differential Revision: https://reviews.facebook.net/D6003
Summary:
AFAIK proper /bin/sh does not support "+=". Note that only our
changes use "+=". The Google code does A="$A + $B" rather than A+=$B.
Task ID: #
Blame Rev:
Test Plan:
build
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D6231
Summary:
I used server uptime to compute per-level IO throughput rates. I
intended to use time spent doing compaction at that level. This fixes that.
Task ID: #
Blame Rev:
Test Plan:
run db_bench, look at results
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D6237
Summary:
A previous commit 4c107587ed introduced
the idea that some version updates might not delete obsolete files.
This means that if a unit test blindly counts the number of files
in the db directory it might not represent the true state of the database.
Use GetLiveFiles() insteads to count the number of live files in the database.
Test Plan: make check
Reviewers: heyongqiang, MarkCallaghan
Reviewed By: MarkCallaghan
Differential Revision: https://reviews.facebook.net/D6207
Summary:
It is best if we pick the largest file to compact in a level.
This reduces the write amplification factor for compactions.
Each level has an auxiliary data structure called files_by_size_
that sorts all files by their size. This data structure is
updated when a new version is created.
Test Plan: make check
Differential Revision: https://reviews.facebook.net/D6195
Summary:
Prior to multi-threaded compaction, wrap-around would be done by using
current_->files_[level[0]. With this change we should be
using the first file for which f->being_compacted is not true.
1ca0584345 (commitcomment-2041516)
Test Plan: make check
Differential Revision: https://reviews.facebook.net/D6165
Summary:
This adds more statistics to be reported by GetProperty("leveldb.stats").
The new stats include time spent waiting on stalls in MakeRoomForWrite.
This also includes the total amplification rate where that is:
(#bytes of sequential IO during compaction) / (#bytes from Put)
This also includes a lot more data for the per-level compaction report.
* Rn(MB) - MB read from level N during compaction between levels N and N+1
* Rnp1(MB) - MB read from level N+1 during compaction between levels N and N+1
* Wnew(MB) - new data written to the level during compaction
* Amplify - ( Write(MB) + Rnp1(MB) ) / Rn(MB)
* Rn - files read from level N during compaction between levels N and N+1
* Rnp1 - files read from level N+1 during compaction between levels N and N+1
* Wnp1 - files written to level N+1 during compaction between levels N and N+1
* NewW - new files written to level N+1 during compaction
* Count - number of compactions done for this level
This is the new output from DB::GetProperty("leveldb.stats"). The old output stopped at Write(MB)
Compactions
Level Files Size(MB) Time(sec) Read(MB) Write(MB) Rn(MB) Rnp1(MB) Wnew(MB) Amplify Read(MB/s) Write(MB/s) Rn Rnp1 Wnp1 NewW Count
-------------------------------------------------------------------------------------------------------------------------------------
0 3 6 33 0 576 0 0 576 -1.0 0.0 1.3 0 0 0 0 290
1 127 242 351 5316 5314 570 4747 567 17.0 12.1 12.1 287 2399 2685 286 32
2 161 328 54 822 824 326 496 328 4.0 1.9 1.9 160 251 411 160 161
Amplification: 22.3 rate, 0.56 GB in, 12.55 GB out
Uptime(secs): 439.8
Stalls(secs): 206.938 level0_slowdown, 0.000 level0_numfiles, 24.129 memtable_compaction
Task ID: #
Blame Rev:
Test Plan:
run db_bench
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
(cherry picked from commit ecdeead38f86cc02e754d0032600742c4f02fec8)
Reviewers: dhruba
Differential Revision: https://reviews.facebook.net/D6153
Summary:
I still get failures from this. Not sure whether there was a fix in progress.
Task ID: #
Blame Rev:
Test Plan:
compile
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D6147
Summary:
The compaction process deletes a large number of files. This takes
quite a bit of time and is best done outside the mutex lock.
Test Plan: make check
Differential Revision: https://reviews.facebook.net/D6123
Summary:
The compaction process deletes a large number of files. This takes
quite a bit of time and is best done outside the mutex lock.
Test Plan: make check
Differential Revision: https://reviews.facebook.net/D6123