Summary: This patch fixed a race condition where a log file is moved to archived dir in the middle of GetSortedWalFiles. Without the fix, the log file would be missed in the result, which leads to transaction log iterator gap. A test utility SyncPoint is added to help reproducing the race condition.
Test Plan: TransactionLogIteratorRace; make check
Reviewers: dhruba, ljin
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17121
Summary:
This diff adds a new CompactionFilterV2 API that roll up the
decisions of kv pairs during compactions. These kv pairs must share the
same key prefix. They are buffered inside the db.
typedef std::vector<Slice> SliceVector;
virtual std::vector<bool> Filter(int level,
const SliceVector& keys,
const SliceVector& existing_values,
std::vector<std::string>* new_values,
std::vector<bool>* values_changed
) const = 0;
Application can override the Filter() function to operate
on the buffered kv pairs. More details in the inline documentation.
Test Plan:
make check. Added unit tests to make sure Keep, Delete,
Change all works.
Reviewers: haobo
CCs: leveldb
Differential Revision: https://reviews.facebook.net/D15087
Summary:
Everytime a client opens a DB, we do a sanity check that:
* checks the existance of all the necessary files
* verifies that file sizes are correct
Some of the code was stolen from https://reviews.facebook.net/D16935
Test Plan: added a unit test
Reviewers: dhruba, haobo, sdong
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17097
Summary: Add a property to calculate number of background errors encountered to help users build their monitoring
Test Plan: Add a unit test. make all check
Reviewers: haobo, igor, dhruba
Reviewed By: igor
CC: ljin, nkg-, yhchiang, leveldb
Differential Revision: https://reviews.facebook.net/D16959
Summary:
When the manifest is getting rolled the following happens:
1) manifest_file_number_ is assigned to a new manifest number (even though the old one is still current)
2) mutex is unlocked
3) SetCurrentFile() creates temporary file manifest_file_number_.dbtmp
4) SetCurrentFile() renames manifest_file_number_.dbtmp to CURRENT
5) mutex is locked
If FindObsoleteFiles happens between (3) and (4) it will:
1) Delete manifest_file_number_.dbtmp (because it's not in pending_outputs_)
2) Delete old manifest (because the manifest_file_number_ already points to a new one)
I introduce the concept of prev_manifest_file_number_ that will avoid the race condition.
However, we should discuss the future of MANIFEST file rolling. We found some race conditions with it last week and who knows how many more are there. Nobody is using it in production because we don't trust the implementation. Should we even support it?
Test Plan: make check
Reviewers: ljin, dhruba, haobo, sdong
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16929
Summary:
@igor pointed out that there is a potential data race because of the way we use the newly introduced LogBuffer. After "bg_compaction_scheduled_--" or "bg_flush_scheduled_--", they can both become 0. As soon as the lock is released after that, DBImpl's deconstructor can go ahead and deconstruct all the states inside DB, including the info_log object hold in a shared pointer of the options object it keeps. At that point it is not safe anymore to continue using the info logger to write the delayed logs.
With the patch, lock is released temporarily for log buffer to be flushed before "bg_compaction_scheduled_--" or "bg_flush_scheduled_--". In order to make sure we don't miss any pending flush or compaction, a new flag bg_schedule_needed_ is added, which is set to be true if there is a pending flush or compaction but not scheduled because of the max thread limit. If the flag is set to be true, the scheduling function will be called before compaction or flush thread finishes.
Thanks @igor for this finding!
Test Plan: make all check
Reviewers: haobo, igor
Reviewed By: haobo
CC: dhruba, ljin, yhchiang, igor, leveldb
Differential Revision: https://reviews.facebook.net/D16767
Summary:
I had this diff for a while to test column families implementation. Last night, I ran it sucessfully for 10 hours with the command:
time ./db_stress --threads=30 --ops_per_thread=200000000 --max_key=5000 --column_families=20 --clear_column_family_one_in=3000000 --verify_before_write=1 --reopen=50 --max_background_compactions=10 --max_background_flushes=10 --db=/tmp/db_stress
It is ready to be committed :)
Test Plan: Ran it for 10 hours
Reviewers: dhruba, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16797
Summary: KSVObsolete is no longer nullptr and needs to be checked explicitly. Also did some minor code cleanup and added a stat counter to track superversion cleanups incurred in the foreground.
Test Plan: make check
Reviewers: ljin
Reviewed By: ljin
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16701
Summary: Moved LogBuffer class to an internal header. Removed some unneccesary indirection. Enabled log buffer for BackgroundCallFlush. Forced log buffer flush right after Unlock to improve time ordering of info log.
Test Plan: make check; db_bench compare LOG output
Reviewers: sdong
Reviewed By: sdong
CC: leveldb, igor
Differential Revision: https://reviews.facebook.net/D16707
Summary: Adding the last missing function -- NewIterators(). Pretty simple implementation
Test Plan: added a unit test
Reviewers: dhruba, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16689
Summary:
Add a check at the end of GetImpl to release SuperVersion if it becomes
obsolete. Also do Scrape() inside InstallSuperVersion so it happens more
frequent.
Test Plan:
make all check
running asan_check now
Reviewers: igor, haobo, sdong, dhruba
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16641
Summary: Now while the background thread is picking compactions, it writes out multiple info_logs, especially for universal compaction, which introduces a chance of waiting log writing in mutex, which is bad. To remove this risk, write all those info logs to a buffer and flush it after releasing the mutex.
Test Plan:
make all check
check the log lines while running some tests that trigger compactions.
Reviewers: haobo, igor, dhruba
Reviewed By: dhruba
CC: i.am.jin.lei, dhruba, yhchiang, leveldb, nkg-
Differential Revision: https://reviews.facebook.net/D16515
Summary:
Add an optional input parameter ReadOptions to DB::GetUpdateSince(),
which allows the verification of checksums to be disabled by setting
ReadOptions::verify_checksums to false.
Test Plan: Tests are done off-line and will not be included in the regular unit test.
Reviewers: igor
Reviewed By: igor
CC: leveldb, xjin, dhruba
Differential Revision: https://reviews.facebook.net/D16305
Summary: as title
Test Plan:
asan_check
will post results later
Reviewers: haobo, igor, dhruba, sdong
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16257
Summary: Big CF diff uncovered some lint errors. This diff fixes some of them. Not much to see here
Test Plan: make check
Reviewers: dhruba, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16347
Summary: DBImpl now keeps a list of alive_log_files_. On every FindObsoleteFiles, it deletes all alive log files that are smaller than versions_->MinLogNumber()
Test Plan:
make check passes
no specific unit tests yet, will add
Reviewers: dhruba, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16293
Summary: Adapt table properties to column family world
Test Plan: make check
Reviewers: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16161
Summary:
This is a huge diff and it was hectic, but the idea is actually quite simple. Every operation (Put, Get, etc.) done on default column family in DBTest is now forwarded to non-default ("pikachu"). The good news is that we had zero test failures! Column families look stable so far.
One interesting test that I adapted for column families is MultiThreadedTest. I replaced every Put() with a WriteBatch writing to all column families concurrently. Every Put in the write batch contains unique_id. Instead of Get() I do a multiget across all column families with the same key. If atomicity holds, I expect to see the same unique_id in all column families.
Test Plan: This is a test!
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16149
Summary: Provide a public API for users to access the table properties for each SSTable.
Test Plan: Added a unit tests to test the function correctness under differnet conditions.
Reviewers: haobo, dhruba, sdong
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16083
Summary:
The change to the public behavior:
* When opening a DB or creating new column family client gets a ColumnFamilyHandle.
* As long as column family handle is alive, client can do whatever he wants with it, even drop it
* Dropped column family can still be read from (using the column family handle)
* Added a new call CloseColumnFamily(). Client has to close all column families that he has opened before deleting the DB
* As soon as column family is closed, any calls to DB using that column family handle will fail (also any outstanding calls)
Internally:
* Ref-counting ColumnFamilyData
* New thread-safety for ColumnFamilySet
* Dropped column families are now completely dropped and their memory cleaned-up
Test Plan: added some tests to column_family_test
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16101
Summary: This covers existing table files before DB open happens and avoids contention on table cache
Test Plan: db_test
Reviewers: haobo, sdong, igor, dhruba
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16089
Summary: Replaced most of occurrences of Options with more specific DBOptions. This brings us very close to supporting different configuration options for each column family.
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15933
Summary:
Adapting table cache to column families is interesting. We want table cache to be global LRU, so if some column families are use not as often as others, we want them to be evicted from cache. However, current TableCache object also constructs tables on its own. If table is not found in the cache, TableCache automatically creates new table. We want each column family to be able to specify different table factory.
To solve the problem, we still have a single LRU, but we provide the LRUCache object to TableCache on construction. We have one TableCache per column family, but the underyling cache is shared by all TableCache objects.
This allows us to have a global LRU, but still be able to support different table factories for different column families. Also, in the future it will also be able to support different directories for different column families.
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15915
Summary: InternalStats is a messy thing, keeping both DB data and column family data. However, it's better off living in ColumnFamilyData than in DBImpl. For now, at least.
Test Plan: make check
Reviewers: dhruba, kailiu, haobo, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15879
Summary:
There are three SanitizeOption-s now : one for DBOptions, one for ColumnFamilyOptions and one for Options (which just calls the other two)
I have also reshuffled some options -- table_cache options and info_log should live in DBOptions, for example.
Test Plan: make check doesn't complain
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15873
Summary: user_comparator() is a Column Family property, not DBImpl
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15855
Summary: Support for different column families in Iterator and MultiGet code path.
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15849
Summary:
Use super_version insider NewIterator to avoid Ref() each component
separately under mutex
The new added bench shows NewIterator QPS increases from 515K to 719K
No meaningful improvement for multiget I guess due to its relatively small
cost comparing to 90 keys fetch in the test.
Test Plan: unit test and db_bench
Reviewers: igor, sdong
Reviewed By: igor
CC: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D15609
Summary: This diff enables non-default column families to get compacted both automatically and also by calling CompactRange()
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15813
Summary: Removed default_cfd_ from all flush code paths. This means we can now flush memtables from arbitrary column families!
Test Plan: Added a new unit test
Reviewers: dhruba, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15789
Summary: Making room for write will be the hardest part of the column family implementation. For now, I just iterate through all column families and run MakeRoomForWrite() for every one.
Test Plan: make check does not complain
Reviewers: dhruba, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15597
Summary:
In DBImpl we keep track of some statistics internally and expose them via GetProperty(). This diff encapsulates all the internal statistics into a class InternalStatisics. Most of it is copy/paste.
Apart from cleaning up db_impl.cc, this diff is also necessary for Column families, since every column family should have its own CompactionStats, MakeRoomForWrite-stall stats, etc. It's much easier to keep track of it in every column family if it's nicely encapsulated in its own class.
Test Plan: make check
Reviewers: dhruba, kailiu, haobo, sdong, emayanke
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15273
Summary: ColumnFamilyData grew a lot, there's much more data that it holds now. It makes more sense to encapsulate it better by making it a class.
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15579
Summary: This one is big. It adds ability to write to and read from different column families (see the unit test). It also supports recovery of different column families from log, which was the hardest part to reason about. We need to make sure to never delete the log file which has unflushed data from any column family. To support that, I added another concept, which is versions_->MinLogNumber()
Test Plan: Added a unit test in column_family_test
Reviewers: dhruba, haobo, sdong, kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15537
Summary: All memtables and immutable memtables are moved from DBImpl to ColumnFamilyData. For now, they are all referenced from default column family in DBImpl. It shouldn't be hard to get them from custom column family.
Test Plan: make check
Reviewers: dhruba, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15459
Summary:
@dhruba, I'm not sure where we need to sync the directory. I implemented the function in Env() and added the dir sync just after we close the newly created file in the builder.
Should I also add FsyncDir() to new files that get created by a compaction?
Test Plan: Confirmed that FsyncDir is returning Status::OK()
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D14751
Summary:
MemTableListVersion is to MemTableList what Version is to VersionSet. I took almost the same ideas to develop MemTableListVersion. The reason is to have copying std::list done in background, while flushing, rather than in foreground (MultiGet() and NewIterator()) under a mutex! Also, whenever we copied MemTableList, we copied also some MemTableList metadata (flush_requested_, commit_in_progress_, etc.), which was wasteful.
This diff avoids std::list copy under a mutex in both MultiGet() and NewIterator(). I created a small database with some number of immutable memtables, and creating 100.000 iterators in a single-thread (!) decreased from {188739, 215703, 198028} to {154352, 164035, 159817}. A lot of the savings come from code under a mutex, so we should see much higher savings with multiple threads. Creating new iterator is very important to LogDevice team.
I also think this diff will make SuperVersion obsolete for performance reasons. I will try it in the next diff. SuperVersion gave us huge savings on Get() code path, but I think that most of the savings came from copying MemTableList under a mutex. If we had MemTableListVersion, we would never need to copy the entire object (like we still do in NewIterator() and MultiGet())
Test Plan: `make check` works. I will also do `make valgrind_check` before commit
Reviewers: dhruba, haobo, kailiu, sdong, emayanke, tnovak
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15255
Summary: as title
Test Plan:
make all check
What else tests shall I cover?
Reviewers: igor, haobo
CC:
Differential Revision: https://reviews.facebook.net/D15339
Summary:
This diff implements a special type of iterator that doesn't create a snapshot
(can be used to read newly inserted data) and is optimized for doing sequential
reads.
TailingIterator uses current superversion number to determine whether to
invalidate its internal iterators. If the version hasn't changed, it can often
avoid doing expensive seeks over immutable structures (sst files and immutable
memtables).
Test Plan:
* new unit tests
* running LD with this patch
Reviewers: igor, dhruba, haobo, sdong, kailiu
Reviewed By: sdong
CC: leveldb, lovro, march
Differential Revision: https://reviews.facebook.net/D15285
Summary:
This diff does two things:
* Rethinks how we call Recover() with read_only option. Before, we call it with pointer to memtable where we'd like to apply those changes to. This memtable is set in db_impl_readonly.cc and it's actually DBImpl::mem_. Why don't we just apply updates to mem_ right away? It seems more intuitive.
* Changes when we apply updates to manifest. Before, the process is to recover all the logs, flush it to sst files and then do one giant commit that atomically adds all recovered sst files and sets the next log number. This works good enough, but causes some small troubles for my column family approach, since I can't have one VersionEdit apply to more than single column family[1]. The change here is to commit the files recovered from logs right away. Here is the state of the world before the change:
1. Recover log 5, add new sst files to edit
2. Recover log 7, add new sst files to edit
3. Recover log 8, add new sst files to edit
4. Commit all added sst files to manifest and mark log files 5, 7 and 8 as recoverd (via SetLogNumber(9) function)
After the change, we'll do:
1. Recover log 5, commit the new sst files and set log 5 as recovered
2. Recover log 7, commit the new sst files and set log 7 as recovered
3. Recover log 8, commit the new sst files and set log 8 as recovered
The added (small) benefit is that if we fail after (2), the new recovery will only have to recover log 8. In previous case, we'll have to restart the recovery from the beginning. The bigger benefit will be to enable easier integration of multiple column families in Recovery code path.
[1] I'm happy to dicuss this decison, but I believe this is the cleanest way to go. It also makes backward compatibility much easier. We don't have a requirement of adding multiple column families atomically.
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15237
Summary:
This had a few bugs.
1) bottom and top were reversed. top is for the max value but the callers were passing the max
value to bottom. The result is that the max sleep is used when n >= bottom.
2) one of the callers passed values with type double and these values are frequently between
1.0 and 2.0 so rounding will do some bad things
3) sometimes the function returned 0 when there should be a stall
With this change and one other diff (out for review soon) there are slightly fewer stalls on one workload.
With the fix.
Stalls(secs): 160.166 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 58.495 leveln_slowdown
Stalls(count): 910261 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 54526 leveln_slowdown
Without the fix.
Stalls(secs): 172.227 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 56.538 leveln_slowdown
Stalls(count): 160831 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 52845 leveln_slowdown
Task ID: #
Blame Rev:
Test Plan:
run db_bench for --benchmarks=overwrite with IO-bound database
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: haobo
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15243
Summary:
There were some functions in VersionSet that had no reason to be there instead of Version. Moving them to Version will make column families implementation easier.
The functions moved are:
* NumLevelBytes
* LevelSummary
* LevelFileSummary
* MaxNextLevelOverlappingBytes
* AddLiveFiles (previously AddLiveFilesCurrentVersion())
* NeedSlowdownForNumLevel0Files
The diff continues on (and depends on) D15171
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong, emayanke
Reviewed By: sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15183
Summary:
When doing CompactRange(), we should first flush the memtable and then calculate max_level_with_files. Also, we want to compact all the levels that have files, including level `max_level_with_files`.
This patch fixed the unit test.
Test Plan: Added a failing unit test and a fix, so it's not failing anymore.
Reviewers: dhruba, haobo, sdong
Reviewed By: haobo
CC: leveldb, xjin
Differential Revision: https://reviews.facebook.net/D14421
Summary: When building batch group, don't actually build a new batch since it requires heavy-weight mem copy and malloc. Only store references to the batches and build the batch group without lock held.
Test Plan:
`make check`
I am also planning to run performance tests. The workload that will benefit from this change is readwhilewriting. I will post the results once I have them.
Reviewers: dhruba, haobo, kailiu
Reviewed By: haobo
CC: leveldb, xjin
Differential Revision: https://reviews.facebook.net/D15063
Summary:
Code was always compressing L0 files written by a memtable flush
when compression was enabled. Now this is done when
min_level_to_compress=0 for leveled compaction and when
universal_compaction_size_percent=-1 for universal compaction.
Task ID: #3416472
Blame Rev:
Test Plan:
ran db_bench with compression options
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: dhruba, igor, sdong
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14757
Summary:
In addition to implementing OpenWithColumnFamilies, this diff also includes some minor changes:
* Changed all column family names from Slice() to std::string. The performance of column family name handling is not critical, and it's more convenient and cleaner to have names as std::strings
* Implemented ColumnFamilyOptions(const Options&) and DBOptions(const Options&)
* Added ColumnFamilyOptions to VersionSet::ColumnFamilyData. ColumnFamilyOptions are specified on OpenWithColumnFamilies() and CreateColumnFamily()
I will keep the diff in the Phabricator for a day or two and will push to the branch then. Feel free to comment even after the diff has been pushed.
Test Plan: Added a simple unit test
Reviewers: dhruba, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15033
Summary:
The way DBImpl::TEST_CompactRange() throttles down the number of bg compactions
can cause it to deadlock when CompactRange() is called concurrently from
multiple threads. Imagine a following scenario with only two threads
(max_background_compactions is 10 and bg_compaction_scheduled_ is initially 0):
1. Thread #1 increments bg_compaction_scheduled_ (to LargeNumber), sets
bg_compaction_scheduled_ to 9 (newvalue), schedules the compaction
(bg_compaction_scheduled_ is now 10) and waits for it to complete.
2. Thread #2 calls TEST_CompactRange(), increments bg_compaction_scheduled_
(now LargeNumber + 10) and waits on a cv for bg_compaction_scheduled_ to
drop to LargeNumber.
3. BG thread completes the first manual compaction, decrements
bg_compaction_scheduled_ and wakes up all threads waiting on bg_cv_.
Thread #1 runs, increments bg_compaction_scheduled_ by LargeNumber again
(now 2*LargeNumber + 9). Since that's more than LargeNumber + newvalue,
thread #2 also goes to sleep (waiting on bg_cv_), without resetting
bg_compaction_scheduled_.
This diff attempts to address the problem by introducing a new counter
bg_manual_only_ (when positive, MaybeScheduleFlushOrCompaction() will only
schedule manual compactions).
Test Plan:
I could pretty much consistently reproduce the deadlock with a program that
calls CompactRange(nullptr, nullptr) immediately after Write() from multiple
threads. This no longer happens with this patch.
Tests (make check) pass.
Reviewers: dhruba, igor, sdong, haobo
Reviewed By: igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14799
Summary:
This diff provides basic implementations of CreateColumnFamily(), DropColumnFamily() and ListColumnFamilies(). It builds on top of https://reviews.facebook.net/D14733
It also includes a bug fix for DBImplReadOnly, where Get implementation would be redirected to DBImpl instead of DBImplReadOnly.
Test Plan: Added unit test
Reviewers: dhruba, haobo, kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15021
Summary: this diff only replace the cases when we need to frequently create vector with small amount of entries. This diff doesn't aim to improve performance of a specific area, but more like a small scale test for the autovector and see how it works in real life.
Test Plan:
make check
I also ran the performance tests, however there is no performance gain/loss. All performance numbers are pretty much the same before/after the change.
Reviewers: dhruba, haobo, sdong, igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14985
Summary: I made some cleanup while reading the source code in `db`. Most changes are about style, naming or C++ 11 new features.
Test Plan: ran `make check`
Reviewers: haobo, dhruba, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15009
Summary:
We don't want two threads to clash if they concurrently call DisableFileDeletions() and EnableFileDeletions(). I'm adding a counter that will enable file deletions only after all DisableFileDeletions() calls have been negated with EnableFileDeletions().
However, we also don't want to break the old behavior, so I added a parameter force to EnableFileDeletions(). If force is true, we will still enable file deletions after every call to EnableFileDeletions(), which is what is happening now.
Test Plan: make check
Reviewers: dhruba, haobo, sanketh
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14781
Summary:
Instead of locking and saving a DB state, we can cache a DB state and update it only when it changes. This change reduces lock contention and speeds up read operations on the DB.
Performance improvements are substantial, although there is some cost in no-read workloads. I ran the regression tests on my devserver and here are the numbers:
overwrite 56345 -> 63001
fillseq 193730 -> 185296
readrandom 771301 -> 1219803 (58% improvement!)
readrandom_smallblockcache 677609 -> 862850
readrandom_memtable_sst 710440 -> 1109223
readrandom_fillunique_random 221589 -> 247869
memtablefillrandom 105286 -> 92643
memtablereadrandom 763033 -> 1288862
Test Plan:
make asan_check
I am also running db_stress
Reviewers: dhruba, haobo, sdong, kailiu
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14679
Summary:
<This diff is for Column Family branch>
Sharing some of the work I've done so far. This diff compiles and passes the tests.
The biggest change is in options.h - I broke down Options into two parts - DBOptions and ColumnFamilyOptions. DBOptions is DB-specific (env, create_if_missing, block_cache, etc.) and ColumnFamilyOptions is column family-specific (all compaction options, compresion options, etc.). Note that this does not break backwards compatibility at all.
Further, I created DBWithColumnFamily which inherits DB interface and adds new functions with column family support. Clients can transparently switch to DBWithColumnFamily and it will not break their backwards compatibility.
There are few methods worth checking out: ListColumnFamilies(), MultiNewIterator(), MultiGet() and GetSnapshot(). [GetSnapshot() returns the snapshot across all column families for now - I think that's what we agreed on]
Finally, I made small changes to WriteBatch so we are able to atomically insert data across column families.
Please provide feedback.
Test Plan: make check works, the code is backward compatible
Reviewers: dhruba, haobo, sdong, kailiu, emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14445
Summary:
In this diff I present you BackupableDB v1. You can easily use it to backup your DB and it will do incremental snapshots for you.
Let's first describe how you would use BackupableDB. It's inheriting StackableDB interface so you can easily construct it with your DB object -- it will add a method RollTheSnapshot() to the DB object. When you call RollTheSnapshot(), current snapshot of the DB will be stored in the backup dir. To restore, you can just call RestoreDBFromBackup() on a BackupableDB (which is a static method) and it will restore all files from the backup dir. In the next version, it will even support automatic backuping every X minutes.
There are multiple things you can configure:
1. backup_env and db_env can be different, which is awesome because then you can easily backup to HDFS or wherever you feel like.
2. sync - if true, it *guarantees* backup consistency on machine reboot
3. number of snapshots to keep - this will keep last N snapshots around if you want, for some reason, be able to restore from an earlier snapshot. All the backuping is done in incremental fashion - if we already have 00010.sst, we will not copy it again. *IMPORTANT* -- This is based on assumption that 00010.sst never changes - two files named 00010.sst from the same DB will always be exactly the same. Is this true? I always copy manifest, current and log files.
4. You can decide if you want to flush the memtables before you backup, or you're fine with backing up the log files -- either way, you get a complete and consistent view of the database at a time of backup.
5. More things you can find in BackupableDBOptions
Here is the directory structure I use:
backup_dir/CURRENT_SNAPSHOT - just 4 bytes holding the latest snapshot
0, 1, 2, ... - files containing serialized version of each snapshot - containing a list of files
files/*.sst - sst files shared between snapshots - if one snapshot references 00010.sst and another one needs to backup it from the DB, it will just reference the same file
files/ 0/, 1/, 2/, ... - snapshot directories containing private snapshot files - current, manifest and log files
All the files are ref counted and deleted immediatelly when they get out of scope.
Some other stuff in this diff:
1. Added GetEnv() method to the DB. Discussed with @haobo and we agreed that it seems right thing to do.
2. Fixed StackableDB interface. The way it was set up before, I was not able to implement BackupableDB.
Test Plan:
I have a unittest, but please don't look at this yet. I just hacked it up to help me with debugging. I will write a lot of good tests and update the diff.
Also, `make asan_check`
Reviewers: dhruba, haobo, emayanke
Reviewed By: dhruba
CC: leveldb, haobo
Differential Revision: https://reviews.facebook.net/D14295
Summary: This would enable rocksdb users to get the db identity without depending on implementation details(storing that in IDENTITY file)
Test Plan: db/db_test (has identity checks)
Reviewers: dhruba, haobo, igor, kailiu
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14463
Summary:
I went through all remaining shared_ptrs and removed the ones that I found not-necessary. Only GenerateCachePrefix() is called fairly often, so don't expect much perf wins.
The ones that are left are accessed infrequently and I think we're fine with keeping them.
Test Plan: make asan_check
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14427
Summary:
The commit at 27bbef1180 had a memory leak
that was detected by valgrind. The memtable that has a refcount decrement
in MemTableList::InstallMemtableFlushResults was not freed.
Test Plan: valgrind ./db_test --leak-check=full
Reviewers: igor
Reviewed By: igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14391
Summary: We need access to options for BackupableDB
Test Plan: make check
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb, reconnect.grayhat
Differential Revision: https://reviews.facebook.net/D14331
Summary: This is part of https://reviews.facebook.net/D14295 -- smaller diff that is easier to review
Test Plan: make asan_check
Reviewers: dhruba, haobo, emayanke
Reviewed By: emayanke
CC: leveldb, kailiu, reconnect.grayhat
Differential Revision: https://reviews.facebook.net/D14301
Summary:
Created a unittest that verifies that automatic deletion performed by PurgeObsoleteFiles() works correctly.
Also, few small fixes on the logic part -- call version_set_->GetObsoleteFiles() in FindObsoleteFiles() instead of on some arbitrary positions.
Test Plan: Created a unit test
Reviewers: dhruba, haobo, nkg-
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14079
Summary:
@haobo's suggestions from https://reviews.facebook.net/D13827
Renaming some variables, deprecating purge_log_after_flush, changing for loop into auto for loop.
I have not implemented deleting objects outside of mutex yet because it would require a big code change - we would delete object in db_impl, which currently does not know anything about object because it's defined in version_edit.h (FileMetaData). We should do it at some point, though.
Test Plan: Ran deletefile_test
Reviewers: haobo
Reviewed By: haobo
CC: leveldb, haobo
Differential Revision: https://reviews.facebook.net/D14025
Summary:
Here's one solution we discussed on speeding up FindObsoleteFiles. Keep a set of all files in DBImpl and update the set every time we create a file. I probably missed few other spots where we create a file.
It might speed things up a bit, but makes code uglier. I don't really like it.
Much better approach would be to abstract all file handling to a separate class. Think of it as layer between DBImpl and Env. Having a separate class deal with file namings and deletion would benefit both code cleanliness (especially with huge DBImpl) and speed things up. It will take a huge effort to do this, though.
Let's discuss offline today.
Test Plan: Ran ./db_stress, verified that files are getting deleted
Reviewers: dhruba, haobo, kailiu, emayanke
Reviewed By: dhruba
Differential Revision: https://reviews.facebook.net/D13827
Summary:
Archive cleaning will still happen every WAL_ttl seconds
but archived logs will be deleted only if archive size
is greater then a WAL_size_limit value.
Empty archived logs will be deleted evety WAL_ttl.
Test Plan:
1. Unit tests pass.
2. Benchmark.
Reviewers: emayanke, dhruba, haobo, sdong, kailiu, igor
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13869
Summary:
strict essentially means that we MUST find the startsequence. Thus we should return if starteSequence is not found in the first file in case strict is set. This will take care of ending the iterator in case of permanent gaps due to corruptions in the log files
Also created NextImpl function that will have internal variable to distinguish whether Next is being called from StartSequence or by application.
Set NotFoudn::gaps status to give an indication of gaps happeneing.
Polished the inline documentation at various places
Test Plan:
* db_repl_stress test
* db_test relating to transaction log iterator
* fbcode/wormhole/rocksdb/rocks_log_iterator
* sigma production machine sigmafio032.prn1
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13689