Summary:
@igor pointed out that there is a potential data race because of the way we use the newly introduced LogBuffer. After "bg_compaction_scheduled_--" or "bg_flush_scheduled_--", they can both become 0. As soon as the lock is released after that, DBImpl's deconstructor can go ahead and deconstruct all the states inside DB, including the info_log object hold in a shared pointer of the options object it keeps. At that point it is not safe anymore to continue using the info logger to write the delayed logs.
With the patch, lock is released temporarily for log buffer to be flushed before "bg_compaction_scheduled_--" or "bg_flush_scheduled_--". In order to make sure we don't miss any pending flush or compaction, a new flag bg_schedule_needed_ is added, which is set to be true if there is a pending flush or compaction but not scheduled because of the max thread limit. If the flag is set to be true, the scheduling function will be called before compaction or flush thread finishes.
Thanks @igor for this finding!
Test Plan: make all check
Reviewers: haobo, igor
Reviewed By: haobo
CC: dhruba, ljin, yhchiang, igor, leveldb
Differential Revision: https://reviews.facebook.net/D16767
Summary: Add a function to Env so that users can query the waiting queue length of each thread pool
Test Plan: add a test in env_test
Reviewers: haobo
Reviewed By: haobo
CC: dhruba, igor, yhchiang, ljin, nkg-, leveldb
Differential Revision: https://reviews.facebook.net/D16755
Summary:
(1) Fix SanitizeOptions() to also check HashLinkList. The current
dynamic case just happens to work because the 2 classes have the same
layout.
(2) Do not delete SliceTransform object in HashSkipListFactory and
HashLinkListFactory destructor. Reason: SanitizeOptions() enforces
prefix_extractor and SliceTransform to be the same object when
Hash**Factory is used. This makes the behavior strange: when
Hash**Factory is used, prefix_extractor will be released by RocksDB. If
other memtable factory is used, prefix_extractor should be released by
user.
Test Plan: db_bench && make asan_check
Reviewers: haobo, igor, sdong
Reviewed By: igor
CC: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D16587
Summary: Moved LogBuffer class to an internal header. Removed some unneccesary indirection. Enabled log buffer for BackgroundCallFlush. Forced log buffer flush right after Unlock to improve time ordering of info log.
Test Plan: make check; db_bench compare LOG output
Reviewers: sdong
Reviewed By: sdong
CC: leveldb, igor
Differential Revision: https://reviews.facebook.net/D16707
Summary:
If verify_checksums_in_compaction is true, compaction will verify checksums. This is default.
If it's false, compaction doesn't verify checksums. This is useful for in-memory workloads.
Test Plan: corruption_test
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16695
Summary:
Add a check at the end of GetImpl to release SuperVersion if it becomes
obsolete. Also do Scrape() inside InstallSuperVersion so it happens more
frequent.
Test Plan:
make all check
running asan_check now
Reviewers: igor, haobo, sdong, dhruba
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16641
Summary:
Change to store the return value from ftruncate().
The reason is that ftruncate() has "warn_unused_result" attribute in some environment.
Signed-off-by: Yumikiyo Osanai <yumios.art@gmail.com>
Summary: valgrind reports issues. This patch seems to fix it.
Test Plan: run the tests that fails in valgrind
Reviewers: igor, haobo, kailiu
Reviewed By: kailiu
CC: dhruba, ljin, yhchiang, leveldb
Differential Revision: https://reviews.facebook.net/D16653
Summary:
Blocks allocated with fallocate will take extra space on disk even if they are unused and the file is close.
Now we remove the extra blocks at the end of the file by calling `ftruncate`.
Test Plan: added a test to env_test
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16647
Summary:
With the use of tmpfs or ramfs, unit tests related to GetUniqueID()
failed because of the failure from ioctl, which doesn't work with these
fancy file systems at all.
I fixed this issue and make sure all related tests run on the "regular"
storage (disk or flash).
Test Plan: TEST_TMPDIR=/dev/shm make check -j32
Reviewers: igor, dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16593
Summary: Now while the background thread is picking compactions, it writes out multiple info_logs, especially for universal compaction, which introduces a chance of waiting log writing in mutex, which is bad. To remove this risk, write all those info logs to a buffer and flush it after releasing the mutex.
Test Plan:
make all check
check the log lines while running some tests that trigger compactions.
Reviewers: haobo, igor, dhruba
Reviewed By: dhruba
CC: i.am.jin.lei, dhruba, yhchiang, leveldb, nkg-
Differential Revision: https://reviews.facebook.net/D16515
Summary:
Currently, there is no easy way for user to change log level of info log. Add a parameter in options to specify that.
Also make the default level to INFO level. Removing the [INFO] tag if it is INFO level as I don't want to cause performance regression. (add [LOG] means another mem-copy and string formatting).
Test Plan:
make all check
manual check the levels work as expected.
Reviewers: dhruba, yhchiang
Reviewed By: yhchiang
CC: dhruba, igor, i.am.jin.lei, ljin, haobo, leveldb
Differential Revision: https://reviews.facebook.net/D16563
Summary:
Add helper function to print perf context data in db_bench if enabled.
I didn't find any code that actually exports perf context data. Not sure
if I missed anything
Test Plan: ran db_bench
Reviewers: haobo, sdong, igor
Reviewed By: igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16575
Summary:
this is the key component extracted from diff: https://reviews.facebook.net/D14271
I separate it to a dedicated patch to make the review easier.
Test Plan: added a unit test and passed it.
Reviewers: haobo, sdong, dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16245
Summary:
This diff does two things:
(1) Log::Reader does not report a corruption when the last record in a log or manifest file is truncated (meaning that log writer died in the middle of the write). Inherited the code from LevelDB: https://code.google.com/p/leveldb/source/detail?r=269fc6ca9416129248db5ca57050cd5d39d177c8#
(2) Turn off mmap writes for all writes to log and manifest files
(2) is necessary because if we use mmap writes, the last record is not truncated, but is actually filled with zeros, making checksum fail. It is hard to recover from checksum failing.
Test Plan:
Added unit tests from LevelDB
Actually recovered a "corrupted" MANIFEST file.
Reviewers: dhruba, haobo
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16119
Summary: Added list_column_family command and also updated dump_manifest
Test Plan: no
Reviewers: dhruba, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16419
Summary:
InfoLogLevel test now checks the number of lines of the output log file
instead of the number of bytes in the log file.
This diff fixes the issue that the previous InfoLogLevel test in
auto_roll_logger_test passed in make check but fails when valgrind
is used.
Test Plan: run with make check and valgrind.
Reviewers: kailiu
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16407
Summary: as title
Test Plan:
asan_check
will post results later
Reviewers: haobo, igor, dhruba, sdong
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16257
Summary: fix the memory leak that was captured by jenkin build.
Test Plan: ran the valgrind test locally
Reviewers: yhchiang
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16389
Summary:
* Now each Log related function has a variant that takes an additional
argument indicating its log level, which is one of the following:
- DEBUG, INFO, WARN, ERROR, FATAL.
* To ensure backward-compatibility, old version Log functions are kept
unchanged.
* Logger now has a member variable indicating its log level. Any incoming
Log request which log level is lower than Logger's log level will not
be output.
* The output of the newer version Log will be prefixed by its log level.
Test Plan:
Add a LogType test in auto_roll_logger_test.cc
= Sample log output =
2014/02/11-00:03:07.683895 7feded179840 [DEBUG] this is the message to be written to the log file!!
2014/02/11-00:03:07.683898 7feded179840 [INFO] this is the message to be written to the log file!!
2014/02/11-00:03:07.683900 7feded179840 [WARN] this is the message to be written to the log file!!
2014/02/11-00:03:07.683903 7feded179840 [ERROR] this is the message to be written to the log file!!
2014/02/11-00:03:07.683906 7feded179840 [FATAL] this is the message to be written to the log file!!
Reviewers: dhruba, xjin, kailiu
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16071
Summary:
This is not a generic thread local implementation in the sense that it
only takes pointer. But it does support multiple instances per thread
and lets user plugin function to perform cleanup when thread exits or an
instance gets destroyed.
Test Plan: unit test for now
Reviewers: haobo, igor, sdong, dhruba
Reviewed By: igor
CC: leveldb, kailiu
Differential Revision: https://reviews.facebook.net/D16131
Summary: A simple benchmark that simulates WAL append. It can be used to test different platform/file system's performance on WAL.
Test Plan: run it.
Reviewers: haobo, kailiu
Reviewed By: haobo
CC: igor, dhruba, i.am.jin.lei, yhchiang, leveldb, nkg-
Differential Revision: https://reviews.facebook.net/D16239
Summary:
The change to the public behavior:
* When opening a DB or creating new column family client gets a ColumnFamilyHandle.
* As long as column family handle is alive, client can do whatever he wants with it, even drop it
* Dropped column family can still be read from (using the column family handle)
* Added a new call CloseColumnFamily(). Client has to close all column families that he has opened before deleting the DB
* As soon as column family is closed, any calls to DB using that column family handle will fail (also any outstanding calls)
Internally:
* Ref-counting ColumnFamilyData
* New thread-safety for ColumnFamilySet
* Dropped column families are now completely dropped and their memory cleaned-up
Test Plan: added some tests to column_family_test
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16101
Summary: Clean up IOErrors so that it only indicates errors talking to device.
Test Plan: make all check
Reviewers: igor, haobo, dhruba, emayanke
Reviewed By: igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15831
Summary:
This patch optimized Get() code paths by avoiding malloc of iterators. Iterator creation is moved to mem table rep implementations, where a callback is called when any key is found. This is the same practice as what we do in (SST) table readers.
db_bench result for readrandom following a writeseq, with no compression, single thread and tmpfs, we see throughput improved to 144958 from 139027, about 3%.
Test Plan: make all check
Reviewers: dhruba, haobo, igor
Reviewed By: haobo
CC: leveldb, yhchiang
Differential Revision: https://reviews.facebook.net/D14685
Summary: When we open a DB, we should dump only DBOptions and then when we create a new column family, we dump ColumnFamilyOptions for each one.
Test Plan: make check, confirm contents of the LOG
Reviewers: dhruba, haobo, sdong, kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16011
Summary: Nothing major, just an extra return line and posibility of leaking fb in NewRandomRWFile
Test Plan: make check
Reviewers: kailiu, dhruba
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15993
Summary:
This is not column-family related diff. It is in columnfamily branch because the change is significant and we want to push it with next major release (3.0).
It removes the leveldb notion of one thread pool and expands it to two thread pools by default (HIGH and LOW). Flush process is removed from compaction process and all flush threads are executed on HIGH thread pool, since we don't want long-running compactions to influence flush latency.
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15987
Summary: Replaced most of occurrences of Options with more specific DBOptions. This brings us very close to supporting different configuration options for each column family.
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15933
Summary:
Adapting table cache to column families is interesting. We want table cache to be global LRU, so if some column families are use not as often as others, we want them to be evicted from cache. However, current TableCache object also constructs tables on its own. If table is not found in the cache, TableCache automatically creates new table. We want each column family to be able to specify different table factory.
To solve the problem, we still have a single LRU, but we provide the LRUCache object to TableCache on construction. We have one TableCache per column family, but the underyling cache is shared by all TableCache objects.
This allows us to have a global LRU, but still be able to support different table factories for different column families. Also, in the future it will also be able to support different directories for different column families.
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15915
Summary:
There are three SanitizeOption-s now : one for DBOptions, one for ColumnFamilyOptions and one for Options (which just calls the other two)
I have also reshuffled some options -- table_cache options and info_log should live in DBOptions, for example.
Test Plan: make check doesn't complain
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15873
Summary: RocksDB doesn't compile on 32-bit architecture apparently. This is attempt to fix some of 32-bit errors. They are reported here: https://gist.github.com/paxos/8789697
Test Plan: RocksDB still compiles on 64-bit :)
Reviewers: kailiu
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15825
Summary: In PlainTable, use one single byte to represent 8 bytes of internal bytes, if seqID = 0 and it is value type (which should be common for bottom most files). It is to save 7 bytes for uncompressed cases.
Test Plan: make all check
Reviewers: haobo, dhruba, kailiu
Reviewed By: haobo
CC: igor, leveldb
Differential Revision: https://reviews.facebook.net/D15489
Summary: Compaction picker and internal key comparator are different for each column family (not global), so they should live in ColumnFamilyData
Test Plan: make check
Reviewers: dhruba, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15801
Summary:
Easy thing goes first. This patch moves arena to internal dir; based
on which, the coming patch will deal with memtable_rep.
Test Plan: make check
Reviewers: haobo, sdong, dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15615
Summary: ColumnFamilyData grew a lot, there's much more data that it holds now. It makes more sense to encapsulate it better by making it a class.
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15579
Summary: All memtables and immutable memtables are moved from DBImpl to ColumnFamilyData. For now, they are all referenced from default column family in DBImpl. It shouldn't be hard to get them from custom column family.
Test Plan: make check
Reviewers: dhruba, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15459
Summary:
@dhruba, I'm not sure where we need to sync the directory. I implemented the function in Env() and added the dir sync just after we close the newly created file in the builder.
Should I also add FsyncDir() to new files that get created by a compaction?
Test Plan: Confirmed that FsyncDir is returning Status::OK()
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D14751
Summary: Converting from length prefixed buffer back to internal key costs some CPU but it is not necessary. In this patch, internal keys are pass though the functions so that we don't need to convert back to it.
Test Plan: make all check
Reviewers: haobo, kailiu
Reviewed By: kailiu
CC: igor, leveldb
Differential Revision: https://reviews.facebook.net/D15393
Summary: By removing some includes form options.h and reply on forward declaration, we can more easily reason the dependencies.
Test Plan: make all check
Reviewers: kailiu, haobo, igor, dhruba
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15411
Summary: On a shutdown, freeing memory takes a long time. If we're shutting down, we don't really care about memory leaks. I added a call to Cache that will avoid freeing all objects in cache.
Test Plan:
I created a script to test the speedup and demonstrate how to use the call: https://phabricator.fb.com/P3864368
Clean shutdown took 7.2 seconds, while fast and dirty one took 6.3 seconds. Unfortunately, the speedup is not that big, but should be bigger with bigger block_cache. I have set up the capacity to 80GB, but the script filled up only ~7GB.
Reviewers: dhruba, haobo, MarkCallaghan, xjin
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15069
Summary:
A lot of our code implicitly assumes number_levels to be static. ReduceNumberOfLevels() breaks that assumption. For example, after calling ReduceNumberOfLevels(), DBImpl::NumberLevels() will be different from VersionSet::NumberLevels(). This is dangerous. Thankfully, it's not in public headers and is only used from LDB cmd tool. LDB tool is only using it statically, i.e. it never calls it with running DB instance. With this diff, we make it explicitly static. This way, we can assume number_levels to be immutable and not break assumption that lot of our code is relying upon. LDB tool can still use the method.
Also, I removed the method from a separate file since it breaks filename completition. version_se<TAB> now completes to "version_set." instead of "version_set" (without the dot). I don't see a big reason that the function should be in a different file.
Test Plan: reduce_levels_test
Reviewers: dhruba, haobo, kailiu, sdong
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15303
Summary: We have 3 versions of GetLengthPrefixedSlice() and one of them is no longer in use.
Test Plan: make
Reviewers: sdong, igor, haobo, dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15399
Summary:
For some reason, D15099 caused a big performance regression: https://fburl.com/16059000
After digging a bit, I figured out that the reason was that std::atomic_uint_fast64_t was allocated in an array. When I switched from an array to vector, the QPS returned to the previous level. I'm not sure why this is happening, but this diff seems to fix the performance regression.
Test Plan: I ran the regression script, observed the performance going back to normal
Reviewers: tnovak, kailiu, haobo
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15375
Summary:
This diff takes an even more aggressive way to inline the functions. A decent rule that I followed is "not inline a function if it is more than 10 lines long."
Normally optimizing code by inline is ugly and hard to control, but since one of our usecase has significant amount of CPU used in functions from coding.cc, I'd like to try this diff out.
Test Plan:
1. the size for some .o file increased a little bit, but most less than 1%. So I think the negative impact of inline is negligible.
2. As the regression test shows (ran for 10 times and I calculated the average number)
Metrics Befor After
========================================================================
rocksdb.build.fillseq.qps 426595 444515 (+4.6%)
rocksdb.build.memtablefillrandom.qps 121739 123110
rocksdb.build.memtablereadrandom.qps 1285103 1280520
rocksdb.build.overwrite.qps 125816 135570 (+9%)
rocksdb.build.readrandom_fillunique_random.qps 285995 296863
rocksdb.build.readrandom_memtable_sst.qps 1027132 1027279
rocksdb.build.readrandom.qps 1041427 1054665
rocksdb.build.readrandom_smallblockcache.qps 1028631 1038433
rocksdb.build.readwhilewriting.qps 918352 914629
Reviewers: haobo, sdong, igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15291
Summary:
I created a separate class ColumnFamilySet to keep track of column families. Before we did this in VersionSet and I believe this approach is cleaner.
Let me know if you have any comments. I will commit tomorrow.
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15357
Summary:
In my MacOS, the member variables are populated with random numbers after initialization.
This diff fixes it by fill these arrays with 0.
Test Plan: make && ./table_test
Reviewers: igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15315
Summary: I'm separating code-cleanup part of https://reviews.facebook.net/D14517. This will make D14517 easier to understand and this diff easier to review.
Test Plan: make check
Reviewers: haobo, kailiu, sdong, dhruba, tnovak
Reviewed By: tnovak
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15099
Summary:
shared_ptr is slower than unique_ptr (which literally comes with no performance cost compare with raw pointers).
In memtable and memtable rep, we use shared_ptr when we'd actually should use unique_ptr.
According to igor's previous work, we are likely to make quite some performance gain from this diff.
Test Plan: make check
Reviewers: dhruba, igor, sdong, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15213
Summary:
With column families VersionSet will not have a constant number of levels (each CF can have different options), so we'll need to eliminate call to VersionSet::NumberLevels()
This diff decreases number of callsites, but we're not there yet. It associates number of levels with Version (each version is associated with single CF) instead of VersionSet.
I have also slightly changed how VersionSet keeps track of manifest size.
This diff also modifies constructor of Compaction such that it takes input_version and automatically Ref()s it. Before this was done outside of constructor.
In next diffs I will continue to decrease number of callsites of VersionSet::NumberLevels() and also references to current_
Test Plan: make check
Reviewers: haobo, dhruba, kailiu, sdong
Reviewed By: sdong
Differential Revision: https://reviews.facebook.net/D15171
Summary:
In latest leaf's, MayContainHash() consistently consumes 5%~7% CPU usage.
I checked the code and did an experiment with/without inlining this method.
In release mode, with `1024 * 1024 * 256` bits and `1024 * 512` entries, both call 2^30 MayContainHash() with distinctive parameters.
As the result showed, this patch reduced the running time from 9.127 sec to 7.891 sec.
Test Plan: make check
Reviewers: sdong, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15177
Summary:
When doing CompactRange(), we should first flush the memtable and then calculate max_level_with_files. Also, we want to compact all the levels that have files, including level `max_level_with_files`.
This patch fixed the unit test.
Test Plan: Added a failing unit test and a fix, so it's not failing anymore.
Reviewers: dhruba, haobo, sdong
Reviewed By: haobo
CC: leveldb, xjin
Differential Revision: https://reviews.facebook.net/D14421
Summary: The application can set a callback function, which is applied on the previous value. And calculates the new value. This new value can be set, either inplace, if the previous value existed in memtable, and new value is smaller than previous value. Otherwise the new value is added normally.
Test Plan: fbmake. Added unit tests. All unit tests pass.
Reviewers: dhruba, haobo
Reviewed By: haobo
CC: sdong, kailiu, xinyaohu, sumeet, leveldb
Differential Revision: https://reviews.facebook.net/D14745
Summary:
Added an option (max_successive_merges) that can be used to specify the
maximum number of successive merge operations on a key in the memtable.
This can be used to improve performance of the "get" operation. If many
successive merge operations are performed on a key, the performance of "get"
operations on the key deteriorates, as the value has to be computed for each
"get" operation by applying all the successive merge operations.
FB Task ID: #3428853
Test Plan:
make all check
db_bench --benchmarks=readrandommergerandom
counter_stress_test
Reviewers: haobo, vamsi, dhruba, sdong
Reviewed By: haobo
CC: zshao
Differential Revision: https://reviews.facebook.net/D14991
Summary: Full list constructed for full iterator can be leaked. This was a bug introduced when I copy the full iterator codes from hash skip list to hash link list. This patch fixes it.
Test Plan: Run valgrind test against db_test and make sure the memory leak is fixed
Reviewers: kailiu, haobo
Reviewed By: kailiu
CC: igor, leveldb
Differential Revision: https://reviews.facebook.net/D15093
Summary: Currently, even if statistics is not enabled, StopWatch only for the stats still gets the time of the day, which is wasteful. This patch adds a new option to StopWatch to disable this get in this case.
Test Plan: make all check
Reviewers: dhruba, haobo, igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14703
Conflicts:
db/db_impl.cc
Summary:
Implement a mem table, in which keys are hashed based on prefixes. In each bucket, entries are organized in a sorted linked list. It has the same thread safety guarantee as skip list.
The motivation is to optimize memory usage for the case that prefix hashing is primary way of seeking to the entry. Compared to hash skip list implementation, this implementation is more memory efficient, but inside each bucket, search is always linear. The target scenario is that there are only very limited number of records in each hash bucket.
Test Plan: Add a test case in db_test
Reviewers: haobo, kailiu, dhruba
Reviewed By: haobo
CC: igor, nkg-, leveldb
Differential Revision: https://reviews.facebook.net/D14979
Summary: Currently, even if statistics is not enabled, StopWatch only for the stats still gets the time of the day, which is wasteful. This patch adds a new option to StopWatch to disable this get in this case.
Test Plan: make all check
Reviewers: dhruba, haobo, igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14703
Summary: Use two vectors for different types of memory allocation.
Test Plan: run all unit tests.
Reviewers: haobo, sdong
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15027
Summary:
In addition to implementing OpenWithColumnFamilies, this diff also includes some minor changes:
* Changed all column family names from Slice() to std::string. The performance of column family name handling is not critical, and it's more convenient and cleaner to have names as std::strings
* Implemented ColumnFamilyOptions(const Options&) and DBOptions(const Options&)
* Added ColumnFamilyOptions to VersionSet::ColumnFamilyData. ColumnFamilyOptions are specified on OpenWithColumnFamilies() and CreateColumnFamily()
I will keep the diff in the Phabricator for a day or two and will push to the branch then. Feel free to comment even after the diff has been pushed.
Test Plan: Added a simple unit test
Reviewers: dhruba, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15033
This seems out of place as it's the only time RocksDB prints to stdout in the
normal course of operations. Thread IDs can still be retrieved from the LOG
file: cut -d ' ' -f2 LOG | sort | uniq | egrep -x '[0-9a-f]+'
Summary: this diff only replace the cases when we need to frequently create vector with small amount of entries. This diff doesn't aim to improve performance of a specific area, but more like a small scale test for the autovector and see how it works in real life.
Test Plan:
make check
I also ran the performance tests, however there is no performance gain/loss. All performance numbers are pretty much the same before/after the change.
Reviewers: dhruba, haobo, sdong, igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14985
Summary:
In some places we have NotFound status created with empty message, but it doesn't avoid a malloc. With this patch, the malloc is avoided for that case.
The motivation of it is that I found in db_bench readrandom test when all keys are not existing, about 4% of the total running time is spent on malloc of Status, plus a similar amount of CPU spent on free of them, which is not necessary.
Test Plan: make all check
Reviewers: dhruba, haobo, igor
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14691
Summary:
A vector that leverages pre-allocated stack-based array to achieve better
performance for array with small amount of items.
Test Plan:
Added tests for both correctness and performance
Here is the performance benchmark between vector and autovector
Please note that in the test "Creation and Insertion Test", the test case were designed with the motivation described below:
* no element inserted: internal array of std::vector may not really get
initialize.
* one element inserted: internal array of std::vector must have
initialized.
* kSize elements inserted. This shows the most time we'll spend if we
keep everything in stack.
* 2 * kSize elements inserted. The internal vector of
autovector must have been initialized.
Note: kSize is the capacity of autovector
=====================================================
Creation and Insertion Test
=====================================================
created 100000 vectors:
each was inserted with 0 elements
total time elapsed: 128000 (ns)
created 100000 autovectors:
each was inserted with 0 elements
total time elapsed: 3641000 (ns)
created 100000 VectorWithReserveSizes:
each was inserted with 0 elements
total time elapsed: 9896000 (ns)
-----------------------------------
created 100000 vectors:
each was inserted with 1 elements
total time elapsed: 11089000 (ns)
created 100000 autovectors:
each was inserted with 1 elements
total time elapsed: 5008000 (ns)
created 100000 VectorWithReserveSizes:
each was inserted with 1 elements
total time elapsed: 24271000 (ns)
-----------------------------------
created 100000 vectors:
each was inserted with 4 elements
total time elapsed: 39369000 (ns)
created 100000 autovectors:
each was inserted with 4 elements
total time elapsed: 10121000 (ns)
created 100000 VectorWithReserveSizes:
each was inserted with 4 elements
total time elapsed: 28473000 (ns)
-----------------------------------
created 100000 vectors:
each was inserted with 8 elements
total time elapsed: 75013000 (ns)
created 100000 autovectors:
each was inserted with 8 elements
total time elapsed: 18237000 (ns)
created 100000 VectorWithReserveSizes:
each was inserted with 8 elements
total time elapsed: 42464000 (ns)
-----------------------------------
created 100000 vectors:
each was inserted with 16 elements
total time elapsed: 102319000 (ns)
created 100000 autovectors:
each was inserted with 16 elements
total time elapsed: 76724000 (ns)
created 100000 VectorWithReserveSizes:
each was inserted with 16 elements
total time elapsed: 68285000 (ns)
-----------------------------------
=====================================================
Sequence Access Test
=====================================================
performed 100000 sequence access against vector
size: 4
total time elapsed: 198000 (ns)
performed 100000 sequence access against autovector
size: 4
total time elapsed: 306000 (ns)
-----------------------------------
performed 100000 sequence access against vector
size: 8
total time elapsed: 565000 (ns)
performed 100000 sequence access against autovector
size: 8
total time elapsed: 512000 (ns)
-----------------------------------
performed 100000 sequence access against vector
size: 16
total time elapsed: 1076000 (ns)
performed 100000 sequence access against autovector
size: 16
total time elapsed: 1070000 (ns)
-----------------------------------
Reviewers: dhruba, haobo, sdong, chip
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14655
Summary:
Some changes to PlainTable format:
(1) support variable key length
(2) use user defined slice transformer to extract prefixes
(3) Run some test cases against PlainTable in db_test and table_test
Test Plan: test db_test
Reviewers: haobo, kailiu
CC: dhruba, igor, leveldb, nkg-
Differential Revision: https://reviews.facebook.net/D14457
Summary:
<This diff is for Column Family branch>
Sharing some of the work I've done so far. This diff compiles and passes the tests.
The biggest change is in options.h - I broke down Options into two parts - DBOptions and ColumnFamilyOptions. DBOptions is DB-specific (env, create_if_missing, block_cache, etc.) and ColumnFamilyOptions is column family-specific (all compaction options, compresion options, etc.). Note that this does not break backwards compatibility at all.
Further, I created DBWithColumnFamily which inherits DB interface and adds new functions with column family support. Clients can transparently switch to DBWithColumnFamily and it will not break their backwards compatibility.
There are few methods worth checking out: ListColumnFamilies(), MultiNewIterator(), MultiGet() and GetSnapshot(). [GetSnapshot() returns the snapshot across all column families for now - I think that's what we agreed on]
Finally, I made small changes to WriteBatch so we are able to atomically insert data across column families.
Please provide feedback.
Test Plan: make check works, the code is backward compatible
Reviewers: dhruba, haobo, sdong, kailiu, emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14445
Summary:
By disassemble the function, we found that the atomic variables do invoke the `lock` that locks the memory bus.
As a tradeoff, we protect the GetUsage by mutex and leave usage_ as plain size_t.
Test Plan: passed `cache_test`
Reviewers: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14667
Summary: make release complains signed unsigned comparison.
Test Plan: make release
Reviewers: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14661
Summary: This diff will help us to figure out the memory usage for the cache part.
Test Plan: added a new memory usage test for cache
Reviewers: haobo, sdong, dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14559
Summary:
I have ran a get benchmark where all the data is in the cache and observed that most of the time is spent on waiting for lock in LRUCache.
This is an effort to optimize LRUCache.
Test Plan:
The data was loaded with fillseq. Then, I ran a benchmark:
/db_bench --db=/tmp/rocksdb_stat_bench --num=1000000 --benchmarks=readrandom --statistics=1 --use_existing_db=1 --threads=16 --disable_seek_compaction=1 --cache_size=20000000000 --cache_numshardbits=8 --table_cache_numshardbits=8
I ran the benchmark three times. Here are the results:
AFTER THE PATCH: 798072, 803998, 811807
BEFORE THE PATCH: 782008, 815593, 763017
Reviewers: dhruba, haobo, kailiu
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14571
Summary: as title
Test Plan: dynamic_bloom_test
Reviewers: dhruba, sdong, kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14385
Summary: This will allow us to access constant via `DB::GetOptions().table_cache.GetCapacity()` or `DB::GetOptions().block_cache.GetCapacity()` since GetOptions() is also constant method.
Summary: So fflush() takes a lock which is heavyweight. I added flush_pending_, but more importantly, I removed LogFlush() from foreground threads.
Test Plan: ./db_test
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14535
Summary:
In this diff I present you BackupableDB v1. You can easily use it to backup your DB and it will do incremental snapshots for you.
Let's first describe how you would use BackupableDB. It's inheriting StackableDB interface so you can easily construct it with your DB object -- it will add a method RollTheSnapshot() to the DB object. When you call RollTheSnapshot(), current snapshot of the DB will be stored in the backup dir. To restore, you can just call RestoreDBFromBackup() on a BackupableDB (which is a static method) and it will restore all files from the backup dir. In the next version, it will even support automatic backuping every X minutes.
There are multiple things you can configure:
1. backup_env and db_env can be different, which is awesome because then you can easily backup to HDFS or wherever you feel like.
2. sync - if true, it *guarantees* backup consistency on machine reboot
3. number of snapshots to keep - this will keep last N snapshots around if you want, for some reason, be able to restore from an earlier snapshot. All the backuping is done in incremental fashion - if we already have 00010.sst, we will not copy it again. *IMPORTANT* -- This is based on assumption that 00010.sst never changes - two files named 00010.sst from the same DB will always be exactly the same. Is this true? I always copy manifest, current and log files.
4. You can decide if you want to flush the memtables before you backup, or you're fine with backing up the log files -- either way, you get a complete and consistent view of the database at a time of backup.
5. More things you can find in BackupableDBOptions
Here is the directory structure I use:
backup_dir/CURRENT_SNAPSHOT - just 4 bytes holding the latest snapshot
0, 1, 2, ... - files containing serialized version of each snapshot - containing a list of files
files/*.sst - sst files shared between snapshots - if one snapshot references 00010.sst and another one needs to backup it from the DB, it will just reference the same file
files/ 0/, 1/, 2/, ... - snapshot directories containing private snapshot files - current, manifest and log files
All the files are ref counted and deleted immediatelly when they get out of scope.
Some other stuff in this diff:
1. Added GetEnv() method to the DB. Discussed with @haobo and we agreed that it seems right thing to do.
2. Fixed StackableDB interface. The way it was set up before, I was not able to implement BackupableDB.
Test Plan:
I have a unittest, but please don't look at this yet. I just hacked it up to help me with debugging. I will write a lot of good tests and update the diff.
Also, `make asan_check`
Reviewers: dhruba, haobo, emayanke
Reviewed By: dhruba
CC: leveldb, haobo
Differential Revision: https://reviews.facebook.net/D14295
Summary:
This will help me a lot! When we hit an assertion in unittest, we get the whole stack trace now.
Also, changed stack trace a bit, we now include actual demangled C++ class::function symbols!
Test Plan: Added ASSERT_TRUE(false) to a test, observed a stack trace
Reviewers: haobo, dhruba, kailiu
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14499
Summary:
This adds 2 options for compression to db_bench:
* universal_compression_size_percent
* compression_level - to set zlib compression level
It also logs compression_size_percent at startup in LOG
Task ID: #
Blame Rev:
Test Plan:
make check, run db_bench
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14439
Summary:
Let's get rid of TransformRep and it's children. We have confirmed that HashSkipListRep works better with multifeed, so there is no benefit to keeping this around.
This diff is mostly just deleting references to obsoleted functions. I also have a diff for fbcode that we'll need to push when we switch to new release.
I had to expose HashSkipListRepFactory in the client header files because db_impl.cc needs access to GetTransform() function for SanitizeOptions.
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14397
Summary: The preprocessor does not follow normal rules of && evaluation, tries to evaluate __GLIBC_PREREQ(2, 12) even though the defined() check fails. This breaks the build if __GLIBC_PREREQ is absent.
Test Plan: Try adding #undef __GLIBC_PREREQ above the offending line, build no longer breaks
Reviewed By: igor
Blame Rev: 4c81383628
Summary: Makes it easier to monitor performance with top
Test Plan: ./manual_compaction_test with `top -H` running. Previously was two `manual_compacti`, now one shows `rocksdb:bg0`.
Reviewers: igor, dhruba
Reviewed By: igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14367
Summary: As title. Especially, HashSkipListRepFactory will be able to specify a relatively small height, to reduce the memory overhead of one skiplist per bucket.
Test Plan: make check and test it on leaf4
Reviewers: dhruba, sdong, kailiu
CC: reconnect.grayhat, leveldb
Differential Revision: https://reviews.facebook.net/D14307
Summary:
Fix a stupid bug I just introduced in b59d4d5a50, which I didn't even mean to include.
GCC might remove the munmap.
Test Plan: Run it and make sure munmap succeeds
Reviewers: haobo, kailiu
Reviewed By: kailiu
CC: dhruba, reconnect.grayhat, leveldb
Differential Revision: https://reviews.facebook.net/D14361
Summary: liveness of the statistics object is already ensured by the shared pointer in DB options. There's no reason to pass again shared pointer among internal functions. Raw pointer is sufficient and efficient.
Test Plan: make check
Reviewers: dhruba, MarkCallaghan, igor
Reviewed By: dhruba
CC: leveldb, reconnect.grayhat
Differential Revision: https://reviews.facebook.net/D14289
Summary:
Provide a framework to profile a query in detail to figure out latency bottleneck. Currently, in Get(), Put() and iterators, 2-3 simple timing is used. We can easily add more profile counters to the framework later.
Test Plan: Enable this profiling in seveal existing tests.
Reviewers: haobo, dhruba, kailiu, emayanke, vamsi, igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14001
Conflicts:
table/merger.cc
Summary:
Provide a framework to profile a query in detail to figure out latency bottleneck. Currently, in Get(), Put() and iterators, 2-3 simple timing is used. We can easily add more profile counters to the framework later.
Test Plan: Enable this profiling in seveal existing tests.
Reviewers: haobo, dhruba, kailiu, emayanke, vamsi, igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14001
Summary:
For prefix mem tables, encoding mem table key may be unnecessary if the prefix doesn't have any key. This patch is a little bit hacky but I want to try out the performance gain of removing this lazy initialization.
In longer term, we might want to revisit the way we abstract mem tables implementations.
Test Plan: make all check
Reviewers: haobo, igor, kailiu
Reviewed By: igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14265
Summary:
A Simple plain table format. No block structure. When creating the table reader, scanning the full table to create indexes.
Test Plan:Add unit test
Reviewers:haobo,dhruba,kailiu
CC:
Task ID: #
Blame Rev:
Summary:
Previously we introduce a `flush_block_policy_factory` in Options, however, that options is strongly releated to Table based tables.
It will make more sense to move it to block based table's own factory class.
Test Plan: make check to pass existing tests
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14211
Summary:
The primary motivation of the changes is to make it easier to figure out the inside of the tables.
* rename "table stats" to "table properties" since now we have more than "integers" to store in the property block.
* Add filter block size to the basic table properties.
* Whenever a table is built, we'll log the table properties (the sample output is in Test Plan).
* Make an api to expose deleted keys.
Test Plan:
Passed all existing test. and the sample output of table stats:
==================================================================
Basic Properties
------------------------------------------------------------------
# data blocks: 1
# entries: 1
raw key size: 9
raw average key size: 9
raw value size: 9
raw average value size: 0
data block size: 25
index block size: 27
filter block size: 18
(estimated) table size: 70
filter policy: rocksdb.BuiltinBloomFilter
==================================================================
User collected properties: InternalKeyPropertiesCollector
------------------------------------------------------------------
kDeletedKeys: 1
==================================================================
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14187
Summary:
1. Moved the compiler back to 4.8.1 and uses Centos 5.2 binaries if OS is Centos 5.2.
2. Fixes this issue: https://github.com/facebook/rocksdb/issues/7
3. We use lot of c++11 features, so we can't pretend we can compile without them. Makes it a first class dependency.
4. Fix blob_store_test, which failes on Ubuntu with "too many files opened" error
5. Removed dependency on port/port_chromium.h, which does not even exist on our system
Test Plan: make clean; make check
Reviewers: dhruba, kailiu
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14145
Summary: This diff invoves some more complicated issues in the posix environment.
Test Plan: works under mac os. will need to verify dev box.
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14061
Summary:
mac and our dev server has totally differnt definition of uint64_t, therefore fixing the warning in mac has actually made code in linux uncompileable.
Test Plan:
make clean && make -j32
Summary: The work to make sure mac os compiles rocksdb is not completed yet. But at least we can start cleaning some warnings captured only by g++ from mac os..
Test Plan: ran make in mac os
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14049
Summary:
@haobo's suggestions from https://reviews.facebook.net/D13827
Renaming some variables, deprecating purge_log_after_flush, changing for loop into auto for loop.
I have not implemented deleting objects outside of mutex yet because it would require a big code change - we would delete object in db_impl, which currently does not know anything about object because it's defined in version_edit.h (FileMetaData). We should do it at some point, though.
Test Plan: Ran deletefile_test
Reviewers: haobo
Reviewed By: haobo
CC: leveldb, haobo
Differential Revision: https://reviews.facebook.net/D14025
Summary: In our project, when writing to the database, we want to form the value as the concatenation of a small header and a larger payload. It's a shame to have to copy the payload just so we can give RocksDB API a linear view of the value. Since RocksDB makes a copy internally, it's easy to support gather writes.
Test Plan: write_batch_test, new test case
Reviewers: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13947
Summary:
Here's one solution we discussed on speeding up FindObsoleteFiles. Keep a set of all files in DBImpl and update the set every time we create a file. I probably missed few other spots where we create a file.
It might speed things up a bit, but makes code uglier. I don't really like it.
Much better approach would be to abstract all file handling to a separate class. Think of it as layer between DBImpl and Env. Having a separate class deal with file namings and deletion would benefit both code cleanliness (especially with huge DBImpl) and speed things up. It will take a huge effort to do this, though.
Let's discuss offline today.
Test Plan: Ran ./db_stress, verified that files are getting deleted
Reviewers: dhruba, haobo, kailiu, emayanke
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D13827
Summary: What @haobo done with TransformRep, now in TransformRepNoLock. Similar implementation, except that I made DynamicIterator a subclass of Iterator which makes me have less iterator initializations.
Test Plan: ./prefix_test. Seeing huge savings vs. TransformRep again!
Reviewers: dhruba, haobo, sdong, kailiu
Reviewed By: haobo
CC: leveldb, haobo
Differential Revision: https://reviews.facebook.net/D13953
Summary: Allow block based table to configure the way flushing the blocks. This feature will allow us to add support for prefix-aligned block.
Test Plan: make check
Reviewers: dhruba, haobo, sdong, igor
Reviewed By: sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13875
Summary:
Added a new call LogFlush() that flushes the log contents to the OS buffers. We never call it with lock held.
We call it once for every Read/Write and often in compaction/flush process so the frequency should not be a problem.
Test Plan: db_test
Reviewers: dhruba, haobo, kailiu, emayanke
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13935
Summary: Added a prefix_seek flag in ReadOptions to indicate that Seek is prefix aware(might not return data with different prefix), and also not bound to a specific prefix. Multiple Seeks and range scans can be invoked on the same iterator. If a specific prefix is specified, this flag will be ignored. Just a quick prototype that works for PrefixHashRep, the new lockless memtable could be easily extended with this support too.
Test Plan: test it on Leaf
Reviewers: dhruba, kailiu, sdong, igor
Reviewed By: igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13929
Summary:
Archive cleaning will still happen every WAL_ttl seconds
but archived logs will be deleted only if archive size
is greater then a WAL_size_limit value.
Empty archived logs will be deleted evety WAL_ttl.
Test Plan:
1. Unit tests pass.
2. Benchmark.
Reviewers: emayanke, dhruba, haobo, sdong, kailiu, igor
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13869
Summary: We have to be able to catch last few log outputs before a crash
Test Plan: no
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13917
Summary:
I'm sending this diff together with https://reviews.facebook.net/D13881 because it didn't allow me to send only the array one.
Here I also replaced unordered_map with just an array of shared_ptrs. This elminated all the locks.
I will run the new benchmark and post the results here.
Test Plan: db_test
Reviewers: dhruba, haobo
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13893
Summary:
Rocksdb can now support a uncompressed block cache, or a compressed
block cache or both. Lookups first look for a block in the
uncompressed cache, if it is not found only then it is looked up
in the compressed cache. If it is found in the compressed cache,
then it is uncompressed and inserted into the uncompressed cache.
It is possible that the same block resides in the compressed cache
as well as the uncompressed cache at the same time. Both caches
have their own individual LRU policy.
Test Plan: Unit test case attached.
Reviewers: kailiu, sdong, haobo, leveldb
Reviewed By: haobo
CC: xjin, haobo
Differential Revision: https://reviews.facebook.net/D12675
Summary: Added an option --count_delim=<char> which takes the given character as delimiter ('.' by default) and reports count of each row type found in the db
Test Plan:
1. Created test in file (for DBDumperCommand) rocksdb/tools/ldb_test.py which puts various key value pair in db and checks the output using dump --count_delim ,--count_delim="." and --count_delim=",".
2. Created test in file (for InternalDumperCommand) rocksdb/tools/ldb_test.py which puts various key value pair in db and checks the output using dump --count_delim ,--count_delim="." and --count_delim=",".
3. Manually created a database with several keys of several type and verified by running the command
./ldb db=<path> dump --count_delim="<char>"
./ldb db=<path> idump --count_delim="<char>"
Reviewers: vamsi, dhruba, emayanke, kailiu
Reviewed By: vamsi
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13815
Summary: This might help with p99 performance, but does not solve the real problem. More discussion on #2947135
Test Plan: make check
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13809
Summary:
Currently for each put, a fresh memory is allocated, and a new entry is added to the memtable with a new sequence number irrespective of whether the key already exists in the memtable. This diff is an attempt to update the value inplace for existing keys. It currently handles a very simple case:
1. Key already exists in the current memtable. Does not inplace update values in immutable memtable or snapshot
2. Latest value type is a 'put' ie kTypeValue
3. New value size is less than existing value, to avoid reallocating memory
TODO: For a put of an existing key, deallocate memory take by values, for other value types till a kTypeValue is found, ie. remove kTypeMerge.
TODO: Update the transaction log, to allow consistent reload of the memtable.
Test Plan: Added a unit test verifying the inplace update. But some other unit tests broken due to invalid sequence number checks. WIll fix them next.
Reviewers: xinyaohu, sumeet, haobo, dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12423
Automatic commit by arc
Summary: This patch makes Table and TableBuilder a abstract class and make all the implementation of the current table into BlockedBasedTable and BlockedBasedTable Builder.
Test Plan: Make db_test.cc to work with block based table. Add a new test simple_table_db_test.cc where a different simple table format is implemented.
Reviewers: dhruba, haobo, kailiu, emayanke, vamsi
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13521
Summary:
1. Added a new option that support user-defined table stats collection.
2. Added a deleted key stats collector in `utilities`
Test Plan:
Added a unit test for newly added code.
Also ran make check to make sure other tests are not broken.
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13491
Summary:
I added max_size option in blobstore. Since we now know the maximum number of buckets we'll ever use, we can allocate an array of buckets and access its elements without use of any locks! Common case Get doesn't lock anything now.
Benchmarks on 16KB block size show no impact on speed, though.
Test Plan: unittests + benchmark
Reviewers: dhruba, haobo, kailiu
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13641
Summary: as title
Test Plan: make check; ./perf_context_test
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13629
Summary:
I have implemented a FreeList version that supports fragmented blob chunks. Each block gets allocated and freed in FIFO order. Since the idea for the blocks to be big, we will not take a big hit of non-sequential IO. Free list is also faster, taking only O(k) size in both free and allocate instead of O(N) as before.
See more info on the task: https://our.intern.facebook.com/intern/tasks/?t=2990558
Also, I'm taking Slice instead of const char * and size in Put function.
Test Plan: unittests
Reviewers: haobo, kailiu, dhruba, emayanke
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13569
Summary:
Create a new type of file on startup if it doesn't already exist called DBID.
This will store a unique number generated from boost library's uuid header file.
The use-case is to identify the case of a db losing all its data and coming back up either empty or from an image(backup/live replica's recovery)
the key point to note is that DBID is not stored in a backup or db snapshot
It's preferable to use Boost for uuid because:
1) A non-standard way of generating uuid is not good
2) /proc/sys/kernel/random/uuid generates a uuid but only on linux environments and the solution would not be clean
3) c++ doesn't have any direct way to get a uuid
4) Boost is a very good library that was already having linkage in rocksdb from third-party
Note: I had to update the TOOLCHAIN_REV in build files to get latest verison of boost from third-party as the older version had a bug.
I had to put Wno-uninitialized in Makefile because boost-1.51 has an unitialized variable and rocksdb would not comiple otherwise. Latet open-source for boost is 1.54 but is not there in third-party. I have notified the concerned people in fbcode about it.
@kailiu : While releasing to third-party, an additional dependency will need to be created for boost in TARGETS file. I can help identify.
Test Plan:
Expand db_test to test 2 cases
1) Restarting db with Id file present - verify that no change to Id
2)Restarting db with Id file deleted - verify that a different Id is there after reopen
Also run make all check
Reviewers: dhruba, haobo, kailiu, sdong
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13587
Summary: Per @haobo's request, rephrasing the comment for allocate
Test Plan: It's a comment!
Reviewers: haobo, kailiu
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13575
Summary: This caused Siying's unit test to fail.
Test Plan: Unittest
Reviewers: dhruba, kailiu, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13539
Summary: Allow ttl flag
Test Plan:
tested on my database that has merge operations and ttl
Revert Plan: OK
Task ID: #3038186
Reviewers: emayanke, dhruba, haobo
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13503
Summary:
Developing a capability for storing values on external backing file(s).
This is just a highly unoptimized first pass - supports:
1) Allocating some portion of external file to be used to store value
2) Freeing the range, enabling it to be reused by other values
As next steps, I plan to:
1) Create some kind of stress testing. Once I can measure stuff, I can focus on optimizing.
2) Optimize locking.
3) Optimize freelist data structure. Currently we have O(n) for both freeing and allocation.
4) Figure out how to do recovery.
Test Plan: Created a unit test.
Reviewers: dhruba, haobo, kailiu
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13389
Summary:
tmpfs might not support fallocate(). Fix unit test so that this
does not cause a unit test to fail.
Test Plan: ./env_test
Reviewers: emayanke, igor, kailiu
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13455
Summary:
This is needed to make existing dbs be able to open and also because BytewiseComparator was not changed since leveldb.
The inverted order in the error message caused confusion prebiously
Test Plan: make; open existing db
Reviewers: leveldb, dhruba
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D13449
Summary:
With this patch, when LRUCache.Insert() is called and the cache is full, it will first try to free up entries whose reference counter is 1 (would become 0 after remo\
ving from the cache). We do it in two passes, in the first pass, we only try to release those unreferenced entries. If we cannot free enough space after traversing t\
he first remove_scan_cnt_ entries, we start from the beginning again and remove those entries being used.
Test Plan: add two unit tests to cover the codes
Reviewers: dhruba, haobo, emayanke
Reviewed By: emayanke
CC: leveldb, emayanke, xjin
Differential Revision: https://reviews.facebook.net/D13377
Summary:
As title. Fix an lint error:
Lint: CppLint Error
Single-argument constructor 'Value(int v)' may inadvertently be used as a type conversion constructor. Prefix the function with the 'explicit' keyword to avoid this, or add an /* implicit */ comment to suppress this warning.
Test Plan: N/A
Reviewers: emayanke, haobo, dhruba
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13401
Summary: virtual NewRandomRWFile is not implemented on EnvHdfs, causing build failure.
Test Plan: make clean; make all check
Reviewers: dhruba, haobo, kailiu
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13383
Summary: I have implemented basic simple use case that I need for External Value Store I'm working on. There is a potential for making this prettier by refactoring/combining WritableFile and RandomAccessFile, avoiding some copypasta. However, I decided to implement just the basic functionality, so I can continue working on the other diff.
Test Plan: Added a unittest
Reviewers: dhruba, haobo, kailiu
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13365
Summary: In some cases, you might not want to store the data log (write ahead log) files in the same dir as the sst files. An example use case is leaf, which stores sst files in tmpfs. And would like to save the log files in a separate dir (disk) to save memory.
Test Plan: make all. Ran db_test test. A few test failing. P2785018. If you guys don't see an obvious problem with the code, maybe somebody from the rocksdb team could help me debug the issue here. Running this on leaf worked well. I could see logs stored on disk, and deleted appropriately after compactions. Obviously this is only one set of options. The unit tests cover different options. Seems like I'm missing some edge cases.
Reviewers: dhruba, haobo, leveldb
CC: xinyaohu, sumeet
Differential Revision: https://reviews.facebook.net/D13239
Summary: Split Unref into two parts -> cheap and expensive. Try to call expensive Unref outside of critical section to decrease lock contention.
Test Plan: unittests
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb, kailiu
Differential Revision: https://reviews.facebook.net/D13299
Summary:
Change namespace from leveldb to rocksdb. This allows a single
application to link in open-source leveldb code as well as
rocksdb code into the same process.
Test Plan: compile rocksdb
Reviewers: emayanke
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13287
Summary: As title. This is just a quick hack and not ready for commit. fails a lot of unit test. I will test/debug it directly in ViewState shadow .
Test Plan: Try it in shadow test.
Reviewers: dhruba, xjin
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12933
Summary: as title. unit test not polished. this is for a quick live test
Test Plan: live
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13221
Summary: as title
Test Plan: make check
Reviewers: dhruba, emayanke
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13179
Summary:
The constructor for Vector memtable has a parameter called 'count'
that specifies the capacity of the vector to be reserved at allocation
time. It was incorrectly used to initialize the size of the vector.
Test Plan: Enhanced db_test.
Reviewers: haobo, xjin, emayanke
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13083
Summary:
Added a new api to the Environment that allows clearing out not-needed
pages from the OS cache. This will be helpful when the compressed
block cache replaces the OS cache.
Test Plan: EnvPosixTest.InvalidateCache
Reviewers: haobo
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13041
Summary:
There is a use-case where we want to insert data into rocksdb as
fast as possible. Vector rep is used for this purpose.
The background flush thread needs to flush the vectorrep to
storage. It acquires the dblock then sorts the vector, releases
the dblock and then writes the sorted vector to storage. This is
suboptimal because the lock is held during the sort, which
prevents new writes for occuring.
This patch moves the sorting of the vector rep to outside the
db mutex. Performance is now as fastas the underlying storage
system. If you are doing buffered writes to rocksdb files, then
you can observe throughput upwards of 200 MB/sec writes.
This is an early draft and not yet ready to be reviewed.
Test Plan:
make check
Task ID: #
Blame Rev:
Reviewers: haobo
Reviewed By: haobo
CC: leveldb, haobo
Differential Revision: https://reviews.facebook.net/D12987
Summary: move the TwoPools test to the end of thread related tests. Otherwise, the SetBackgroundThreads call would increase the Low pool size and affect the result of other tests.
Test Plan: make env_test; ./env_test
Reviewers: dhruba, emayanke, xjin
Reviewed By: xjin
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12939
Summary:
Added a new field called max_size_amplification_ratio in the
CompactionOptionsUniversal structure. This determines the maximum
percentage overhead of space amplification.
The size amplification is defined to be the ratio between the size of
the oldest file to the sum of the sizes of all other files. If the
size amplification exceeds the specified value, then min_merge_width
and max_merge_width are ignored and a full compaction of all files is done.
A value of 10 means that the size a database that stores 100 bytes
of user data could occupy 110 bytes of physical storage.
Test Plan: Unit test DBTest.UniversalCompactionSpaceAmplification added.
Reviewers: haobo, emayanke, xjin
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12825
Summary:
this is the ground work for separating memtable flush jobs to their own thread pool.
Both SetBackgroundThreads and Schedule take a third parameter Priority to indicate which thread pool they are working on. The names LOW and HIGH are just identifiers for two different thread pools, and does not indicate real difference in 'priority'. We can set number of threads in the pools independently.
The thread pool implementation is refactored.
Test Plan: make check
Reviewers: dhruba, emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12885