rocksdb

Author	SHA1	Message	Date
Yi Wu	7c27f3ddc6	Blob DB: fix snapshot handling Summary: Blob db will keep blob file if data in the file is visible to an active snapshot. Before this patch it checks whether there is an active snapshot has sequence number greater than the earliest sequence in the file. This is problematic since we take snapshot on every read, if it keep having reads, old blob files will not be cleanup. Change to check if there is an active snapshot falls in the range of [earliest_sequence, obsolete_sequence) where obsolete sequence is 1. if data is relocated to another file by garbage collection, it is the latest sequence at the time garbage collection finish 2. otherwise, it is the latest sequence of the file Closes https://github.com/facebook/rocksdb/pull/3087 Differential Revision: D6182519 Pulled By: yiwu-arbug fbshipit-source-id: cdf4c35281f782eb2a9ad6a87b6727bbdff27a45	2017-11-19 19:07:15 -08:00
Mikhail Antonov	7fe3b32896	Added support for differential snapshots Summary: The motivation for this PR is to add to RocksDB support for differential (incremental) snapshots, as snapshot of the DB changes between two points in time (one can think of it as diff between to sequence numbers, or the diff D which can be thought of as an SST file or just set of KVs that can be applied to sequence number S1 to get the database to the state at sequence number S2). This feature would be useful for various distributed storages layers built on top of RocksDB, as it should help reduce resources (time and network bandwidth) needed to recover and rebuilt DB instances as replicas in the context of distributed storages. From the API standpoint that would like client app requesting iterator between (start seqnum) and current DB state, and reading the "diff". This is a very draft PR for initial review in the discussion on the approach, i'm going to rework some parts and keep updating the PR. For now, what's done here according to initial discussions: Preserving deletes: - We want to be able to optionally preserve recent deletes for some defined period of time, so that if a delete came in recently and might need to be included in the next incremental snapshot it would't get dropped by a compaction. This is done by adding new param to Options (preserve deletes flag) and new variable to DB Impl where we keep track of the sequence number after which we don't want to drop tombstones, even if they are otherwise eligible for deletion. - I also added a new API call for clients to be able to advance this cutoff seqnum after which we drop deletes; i assume it's more flexible to let clients control this, since otherwise we'd need to keep some kind of timestamp < -- > seqnum mapping inside the DB, which sounds messy and painful to support. Clients could make use of it by periodically calling GetLatestSequenceNumber(), noting the timestamp, doing some calculation and figuring out by how much we need to advance the cutoff seqnum. - Compaction codepath in compaction_iterator.cc has been modified to avoid dropping tombstones with seqnum > cutoff seqnum. Iterator changes: - couple params added to ReadOptions, to optionally allow client to request internal keys instead of user keys (so that client can get the latest value of a key, be it delete marker or a put), as well as min timestamp and min seqnum. TableCache changes: - I modified table_cache code to be able to quickly exclude SST files from iterators heep if creation_time on the file is less then iter_start_ts as passed in ReadOptions. That would help a lot in some DB settings (like reading very recent data only or using FIFO compactions), but not so much for universal compaction with more or less long iterator time span. What's left: - Still looking at how to best plug that inside DBIter codepath. So far it seems that FindNextUserKeyInternal only parses values as UserKeys, and iter->key() call generally returns user key. Can we add new API to DBIter as internal_key(), and modify this internal method to optionally set saved_key_ to point to the full internal key? I don't need to store actual seqnum there, but I do need to store type. Closes https://github.com/facebook/rocksdb/pull/2999 Differential Revision: D6175602 Pulled By: mikhail-antonov fbshipit-source-id: c779a6696ee2d574d86c69cec866a3ae095aa900	2017-11-01 18:56:43 -07:00
Maysam Yabandeh	17731a43a6	WritePrepared Txn: Optimize for recoverable state Summary: GetCommitTimeWriteBatch is currently used to store some state as part of commit in 2PC. In MyRocks it is specifically used to store some data that would be needed only during recovery. So it is not need to be stored in memtable right after each commit. This patch enables an optimization to write the GetCommitTimeWriteBatch only to the WAL. The batch will be written to memtable during recovery when the WAL is replayed. To cover the case when WAL is deleted after memtable flush, the batch is also buffered and written to memtable right before each memtable flush. Closes https://github.com/facebook/rocksdb/pull/3071 Differential Revision: D6148023 Pulled By: maysamyabandeh fbshipit-source-id: 2d09bae5565abe2017c0327421010d5c0d55eaa7	2017-11-01 17:26:46 -07:00
Maysam Yabandeh	63822eb761	Enable two write queues for transactions Summary: Enable concurrent_prepare flag for WritePrepared transactions and extend the existing transaction tests with this config. Closes https://github.com/facebook/rocksdb/pull/3046 Differential Revision: D6106534 Pulled By: maysamyabandeh fbshipit-source-id: 88c8d21d45bc492beb0a131caea84a2ac5e7d38c	2017-10-23 14:27:04 -07:00
Maysam Yabandeh	7e38238981	WritePrepared Txn: Disable GC during recovery Summary: Disables GC during recovery of a WritePrepared txn db to avoid GCing uncommitted key values. Closes https://github.com/facebook/rocksdb/pull/2980 Differential Revision: D6000191 Pulled By: maysamyabandeh fbshipit-source-id: fc4d522c643d24ebf043f811fe4ecd0dd0294675	2017-10-18 09:11:50 -07:00
Yi Wu	eaaef91178	Blob DB: Store blob index as kTypeBlobIndex in base db Summary: Blob db insert blob index to base db as kTypeBlobIndex type, to tell apart values written by plain rocksdb or blob db. This is to make it possible to migrate from existing rocksdb to blob db. Also with the patch blob db garbage collection get away from OptimisticTransaction. Instead it use a custom write callback to achieve similar behavior as OptimisticTransaction. This is because we need to pass the is_blob_index flag to DBImpl::Get but OptimisticTransaction don't support it. Closes https://github.com/facebook/rocksdb/pull/3000 Differential Revision: D6050044 Pulled By: yiwu-arbug fbshipit-source-id: 61dc72ab9977625e75f78cd968e7d8a3976e3632	2017-10-17 17:28:11 -07:00
Yi Wu	fb4ae4d810	fix DBImpl::NewInternalIterator super-version leak on failure Summary: Close #2955 Closes https://github.com/facebook/rocksdb/pull/2960 Differential Revision: D5962872 Pulled By: yiwu-arbug fbshipit-source-id: a6472d5c015bea3dc476c572ff5a5c90259e6059	2017-10-11 14:57:43 -07:00
Andrew Kryczka	70aa942153	fix file numbers after repair Summary: The file numbers assigned post-repair were sometimes smaller than older files' numbers due to `LogAndApply` saving the wrong next file number in the manifest. - Mark the highest file seen during repair as used before `LogAndApply` so the correct next file number will be stored. - Renamed `MarkFileNumberUsedDuringRecovery` to `MarkFileNumberUsed` since now it's used during repair in addition to during recovery - Added `TEST_Current_Next_FileNo` to expose the next file number for the unit test. Closes https://github.com/facebook/rocksdb/pull/2988 Differential Revision: D6018083 Pulled By: ajkr fbshipit-source-id: 3f25cbf74439cb8f16dd12af90b67f9f9f75e718	2017-10-10 13:12:37 -07:00
Yi Wu	8c392a31d7	WritePrepared Txn: Iterator Summary: On iterator create, take a snapshot, create a ReadCallback and pass the ReadCallback to the underlying DBIter to check if key is committed. Closes https://github.com/facebook/rocksdb/pull/2981 Differential Revision: D6001471 Pulled By: yiwu-arbug fbshipit-source-id: 3565c4cdaf25370ba47008b0e0cb65b31dfe79fe	2017-10-09 17:15:28 -07:00
Yi Wu	d1b74b0c82	WritePrepared Txn: Compaction/Flush Summary: Update Compaction/Flush to support WritePreparedTxnDB: Add SnapshotChecker which is a proxy to query WritePreparedTxnDB::IsInSnapshot. Pass SnapshotChecker to DBImpl on WritePreparedTxnDB open. CompactionIterator use it to check if a key has been committed and if it is visible to a snapshot. In CompactionIterator: * check if key has been committed. If not, output uncommitted keys AS-IS. * use SnapshotChecker to check if key is visible to a snapshot when in need. * do not output key with seq = 0 if the key is not committed. Closes https://github.com/facebook/rocksdb/pull/2926 Differential Revision: D5902907 Pulled By: yiwu-arbug fbshipit-source-id: 945e037fdf0aa652dc5ba0ad879461040baa0320	2017-10-06 10:41:53 -07:00
Adrien Schildknecht	01542400a8	Inform caller when rocksdb is stalling writes Summary: Add a new function in Listener to let the caller know when rocksdb is stalling writes. Closes https://github.com/facebook/rocksdb/pull/2897 Differential Revision: D5860124 Pulled By: schischi fbshipit-source-id: ee791606169aa64f772c86f817cebf02624e05e1	2017-10-05 18:11:43 -07:00
Yi Wu	d1cab2b64e	Add ValueType::kTypeBlobIndex Summary: Add kTypeBlobIndex value type, which will be used by blob db only, to insert a (key, blob_offset) KV pair. The purpose is to 1. Make it possible to open existing rocksdb instance as blob db. Existing value will be of kTypeIndex type, while value inserted by blob db will be of kTypeBlobIndex. 2. Make rocksdb able to detect if the db contains value written by blob db, if so return error. 3. Make it possible to have blob db optionally store value in SST file (with kTypeValue type) or as a blob value (with kTypeBlobIndex type). The root db (DBImpl) basically pretended kTypeBlobIndex are normal value on write. On Get if is_blob is provided, return whether the value read is of kTypeBlobIndex type, or return Status::NotSupported() status if is_blob is not provided. On scan allow_blob flag is pass and if the flag is true, return wether the value is of kTypeBlobIndex type via iter->IsBlob(). Changes on blob db side will be in a separate patch. Closes https://github.com/facebook/rocksdb/pull/2886 Differential Revision: D5838431 Pulled By: yiwu-arbug fbshipit-source-id: 3c5306c62bc13bb11abc03422ec5cbcea1203cca	2017-10-03 09:11:23 -07:00
Maysam Yabandeh	385049baf2	WritePrepared Txn: Recovery Summary: Recover txns from the WAL. Also added some unit tests. Closes https://github.com/facebook/rocksdb/pull/2901 Differential Revision: D5859596 Pulled By: maysamyabandeh fbshipit-source-id: 6424967b231388093b4effffe0a3b1b7ec8caeb0	2017-09-28 16:56:45 -07:00
Quinn Jarrell	6a541afcc4	Make bytes_per_sync and wal_bytes_per_sync mutable Summary: SUMMARY Moves the bytes_per_sync and wal_bytes_per_sync options from immutableoptions to mutable options. Also if wal_bytes_per_sync is changed, the wal file and memtables are flushed. TEST PLAN ran make check all passed Two new tests SetBytesPerSync, SetWalBytesPerSync check that after issuing setoptions with a new value for the var, the db options have the new value. Closes https://github.com/facebook/rocksdb/pull/2893 Reviewed By: yiwu-arbug Differential Revision: D5845814 Pulled By: TheRushingWookie fbshipit-source-id: 93b52d779ce623691b546679dcd984a06d2ad1bd	2017-09-27 17:49:45 -07:00
Maysam Yabandeh	60beefd6e0	WritePrepared Txn: Advance seq one per batch Summary: By default the seq number in DB is increased once per written key. WritePrepared txns requires the seq to be increased once per the entire batch so that the seq would be used as the prepare timestamp by which the transaction is identified. Also we need to increase seq for the commit marker since it would give a unique id to the commit timestamp of transactions. Two unit tests are added to verify our understanding of how the seq should be increased. The recovery path requires much more work and is left to another patch. Closes https://github.com/facebook/rocksdb/pull/2885 Differential Revision: D5837843 Pulled By: maysamyabandeh fbshipit-source-id: a08960b93d727e1cf438c254d0c2636fb133cc1c	2017-09-18 14:45:08 -07:00
Maysam Yabandeh	f46464d383	write-prepared txn: call IsInSnapshot Summary: This patch instruments the read path to verify each read value against an optional ReadCallback class. If the value is rejected, the reader moves on to the next value. The WritePreparedTxn makes use of this feature to skip sequence numbers that are not in the read snapshot. Closes https://github.com/facebook/rocksdb/pull/2850 Differential Revision: D5787375 Pulled By: maysamyabandeh fbshipit-source-id: 49d808b3062ab35e7ae98ad388f659757794184c	2017-09-11 09:14:48 -07:00
Siying Dong	0e99323ac2	Fix CLANG Analyze Summary: clang analyze shows warnings after we upgrade the CLANG version. Fix them. Closes https://github.com/facebook/rocksdb/pull/2839 Differential Revision: D5769060 Pulled By: siying fbshipit-source-id: 3f8e4df715590d8984f6564b608fa08cfdfa5f14	2017-09-07 14:28:06 -07:00
Artem Danilov	8a6708f5f2	Extend property map with compaction stats Summary: This branch extends existing property map which keeps values in doubles to keep values in strings so that it can be used to provide wider range of properties. The immediate need for that is to provide IO stall stats in an easy parseable way to MyRocks which is also part of this branch. Closes https://github.com/facebook/rocksdb/pull/2794 Differential Revision: D5717676 Pulled By: Tema fbshipit-source-id: e34ba5b79ba774697f7b97ce1138d8fd55471b8a	2017-08-30 15:26:55 -07:00
Maysam Yabandeh	eb6425303e	Update WritePrepared with the pseudo code Summary: Implement the main body of WritePrepared pseudo code. This includes PrepareInternal and CommitInternal, as well as AddCommitted which updates the commit map. It also provides a IsInSnapshot method that could be later called form the read path to decide if a version is in the read snapshot or it should other be skipped. This patch lacks unit tests and does not attempt to offer an efficient implementation. The idea is that to have the API specified so that we can work on related tasks in parallel. Closes https://github.com/facebook/rocksdb/pull/2713 Differential Revision: D5640021 Pulled By: maysamyabandeh fbshipit-source-id: bfa7a05e8d8498811fab714ce4b9c21530514e1c	2017-08-16 16:57:47 -07:00
Aaron G	7848f0b24c	add VerifyChecksum() to db.h Summary: We need a tool to check any sst file corruption in the db. It will check all the sst files in current version and read all the blocks (data, meta, index) with checksum verification. If any verification fails, the function will return non-OK status. Closes https://github.com/facebook/rocksdb/pull/2498 Differential Revision: D5324269 Pulled By: lightmark fbshipit-source-id: 6f8a272008b722402a772acfc804524c9d1a483b	2017-08-09 15:58:13 -07:00
Maysam Yabandeh	bdc056f8aa	Refactor PessimisticTransaction Summary: This patch splits Commit and Prepare into lock-related logic and db-write-related logic. It moves lock-related logic to PessimisticTransaction to be reused by all children classes and movies the existing impl of db-write-related to PrepareInternal, CommitSingleInternal, and CommitInternal in WriteCommittedTxnImpl. Closes https://github.com/facebook/rocksdb/pull/2691 Differential Revision: D5569464 Pulled By: maysamyabandeh fbshipit-source-id: d1b8698e69801a4126c7bc211745d05c636f5325	2017-08-07 16:12:29 -07:00
Andrew Kryczka	cc01985db0	Introduce bottom-pri thread pool for large universal compactions Summary: When we had a single thread pool for compactions, a thread could be busy for a long time (minutes) executing a compaction involving the bottom level. In multi-instance setups, the entire thread pool could be consumed by such bottom-level compactions. Then, top-level compactions (e.g., a few L0 files) would be blocked for a long time ("head-of-line blocking"). Such top-level compactions are critical to prevent compaction stalls as they can quickly reduce number of L0 files / sorted runs. This diff introduces a bottom-priority queue for universal compactions including the bottom level. This alleviates the head-of-line blocking situation for fast, top-level compactions. - Added `Env::Priority::BOTTOM` thread pool. This feature is only enabled if user explicitly configures it to have a positive number of threads. - Changed `ThreadPoolImpl`'s default thread limit from one to zero. This change is invisible to users as we call `IncBackgroundThreadsIfNeeded` on the low-pri/high-pri pools during `DB::Open` with values of at least one. It is necessary, though, for bottom-pri to start with zero threads so the feature is disabled by default. - Separated `ManualCompaction` into two parts in `PrepickedCompaction`. `PrepickedCompaction` is used for any compaction that's picked outside of its execution thread, either manual or automatic. - Forward universal compactions involving last level to the bottom pool (worker thread's entry point is `BGWorkBottomCompaction`). - Track `bg_bottom_compaction_scheduled_` so we can wait for bottom-level compactions to finish. We don't count them against the background jobs limits. So users of this feature will get an extra compaction for free. Closes https://github.com/facebook/rocksdb/pull/2580 Differential Revision: D5422916 Pulled By: ajkr fbshipit-source-id: a74bd11f1ea4933df3739b16808bb21fcd512333	2017-08-03 15:43:29 -07:00
Maysam Yabandeh	c3d5c4d38a	Refactor TransactionImpl Summary: This patch refactors TransactionImpl by separating the logic for pessimistic concurrency control from the implementation of how to write the data to rocksdb. The existing implementation is named WriteCommittedTxnImpl as it writes committed data to the db. A template named WritePreparedTxnImpl is also added which will be later completed to provide a an alternative implementation. Closes https://github.com/facebook/rocksdb/pull/2676 Differential Revision: D5549998 Pulled By: maysamyabandeh fbshipit-source-id: 16298e86b43ca4849324c1f35c731913c6d17bec	2017-08-03 08:57:22 -07:00
Siying Dong	21696ba502	Replace dynamic_cast<> Summary: Replace dynamic_cast<> so that users can choose to build with RTTI off, so that they can save several bytes per object, and get tiny more memory available. Some nontrivial changes: 1. Add Comparator::GetRootComparator() to get around the internal comparator hack 2. Add the two experiemental functions to DB 3. Add TableFactory::GetOptionString() to avoid unnecessary casting to get the option string 4. Since 3 is done, move the parsing option functions for table factory to table factory files too, to be symmetric. Closes https://github.com/facebook/rocksdb/pull/2645 Differential Revision: D5502723 Pulled By: siying fbshipit-source-id: fd13cec5601cf68a554d87bfcf056f2ffa5fbf7c	2017-07-28 16:27:16 -07:00
Siying Dong	e67b35c076	Add Iterator::Refresh() Summary: Add and implement Iterator::Refresh(). When this function is called, if the super version doesn't change, update the sequence number of the iterator to the latest one and invalidate the iterator. If the super version changed, recreated the whole iterator. This can help users reuse the iterator more easily. Closes https://github.com/facebook/rocksdb/pull/2621 Differential Revision: D5464500 Pulled By: siying fbshipit-source-id: f548bd35e85c1efca2ea69273802f6704eba6ba9	2017-07-24 10:54:37 -07:00
Siying Dong	3c327ac2d0	Change RocksDB License Summary: Closes https://github.com/facebook/rocksdb/pull/2589 Differential Revision: D5431502 Pulled By: siying fbshipit-source-id: 8ebf8c87883daa9daa54b2303d11ce01ab1f6f75	2017-07-15 16:11:23 -07:00
Maysam Yabandeh	1e34d07e18	Simplify and document sync rules for logs_ etc Summary: Adding/Correcting inline comments and clarify the sync rules. To make it simple to reason, the rules are a big general which ended up to some extra synchronizations. However such synchronizations are not on the fast path, and they are worth the simplicity. Closes https://github.com/facebook/rocksdb/pull/2517 Differential Revision: D5348239 Pulled By: maysamyabandeh fbshipit-source-id: ff2e59fb1e568c122d2cdbf598310f3613b7d212	2017-06-30 09:42:28 -07:00
Maysam Yabandeh	e9f91a5176	Add a fetch_add variation to AddDBStats Summary: AddDBStats is in two steps of load and store, which is more efficient than fetch_add. This is however not thread-safe. Currently we have to protect concurrent access to AddDBStats with a mutex which is less efficient that fetch_add. This patch adds the option to do fetch_add when AddDBStats. The results for my 2pc benchmark on sysbench is: - vanilla: 68618 tps - removing mutex on AddDBStats (unsafe): 69767 tps - fetch_add for all AddDBStats: 69200 tps - fetch_add only for concurrently access AddDBStats (this patch): 69579 tps Closes https://github.com/facebook/rocksdb/pull/2505 Differential Revision: D5330656 Pulled By: maysamyabandeh fbshipit-source-id: af64d7bee135b0e86b4fac323a4f9d9113eaa383	2017-06-29 17:12:19 -07:00
Maysam Yabandeh	499ebb3ab5	Optimize for serial commits in 2PC Summary: Throughput: 46k tps in our sysbench settings (filling the details later) The idea is to have the simplest change that gives us a reasonable boost in 2PC throughput. Major design changes: 1. The WAL file internal buffer is not flushed after each write. Instead it is flushed before critical operations (WAL copy via fs) or when FlushWAL is called by MySQL. Flushing the WAL buffer is also protected via mutex_. 2. Use two sequence numbers: last seq, and last seq for write. Last seq is the last visible sequence number for reads. Last seq for write is the next sequence number that should be used to write to WAL/memtable. This allows to have a memtable write be in parallel to WAL writes. 3. BatchGroup is not used for writes. This means that we can have parallel writers which changes a major assumption in the code base. To accommodate for that i) allow only 1 WriteImpl that intends to write to memtable via mem_mutex_--which is fine since in 2PC almost all of the memtable writes come via group commit phase which is serial anyway, ii) make all the parts in the code base that assumed to be the only writer (via EnterUnbatched) to also acquire mem_mutex_, iii) stat updates are protected via a stat_mutex_. Note: the first commit has the approach figured out but is not clean. Submitting the PR anyway to get the early feedback on the approach. If we are ok with the approach I will go ahead with this updates: 0) Rebase with Yi's pipelining changes 1) Currently batching is disabled by default to make sure that it will be consistent with all unit tests. Will make this optional via a config. 2) A couple of unit tests are disabled. They need to be updated with the serial commit of 2PC taken into account. 3) Replacing BatchGroup with mem_mutex_ got a bit ugly as it requires releasing mutex_ beforehand (the same way EnterUnbatched does). This needs to be cleaned up. Closes https://github.com/facebook/rocksdb/pull/2345 Differential Revision: D5210732 Pulled By: maysamyabandeh fbshipit-source-id: 78653bd95a35cd1e831e555e0e57bdfd695355a4	2017-06-24 14:11:29 -07:00
Andrew Kryczka	71f5bcb730	Introduce OnBackgroundError callback Summary: Some users want to prevent rocksdb from entering read-only mode in certain error cases. This diff gives them a callback, `OnBackgroundError`, that they can use to achieve it. - call `OnBackgroundError` every time we consider setting `bg_error_`. Use its result to assign `bg_error_` but not to change the function's return status. - classified calls using `BackgroundErrorReason` to give the callback some info about where the error happened - renamed `ParanoidCheck` to something more specific so we can provide a clear `BackgroundErrorReason` - unit tests for the most common cases: flush or compaction errors Closes https://github.com/facebook/rocksdb/pull/2477 Differential Revision: D5300190 Pulled By: ajkr fbshipit-source-id: a0ea4564249719b83428e3f4c6ca2c49e366e9b3	2017-06-22 19:41:50 -07:00
Siying Dong	52a7f38b19	WriteOptions.low_pri which can throttle low pri writes if needed Summary: If ReadOptions.low_pri=true and compaction is behind, the write will either return immediate or be slowed down based on ReadOptions.no_slowdown. Closes https://github.com/facebook/rocksdb/pull/2369 Differential Revision: D5127619 Pulled By: siying fbshipit-source-id: d30e1cff515890af0eff32dfb869d2e4c9545eb0	2017-06-05 15:02:35 -07:00
Andrew Kryczka	bb01c1880c	Introduce max_background_jobs mutable option Summary: - `max_background_flushes` and `max_background_compactions` are still supported for backwards compatibility - `base_background_compactions` is completely deprecated. Now we just throttle to one background compaction when there's no pressure. - `max_background_jobs` is added to automatically partition the concurrent background jobs into flushes vs compactions. Currently it's very simple as we just allocate one-fourth of the jobs to flushes, and the remaining can be used for compactions. - The test cases that set `base_background_compactions > 1` needed to be updated. I just grab the pressure token such that the desired number of compactions can be scheduled. Closes https://github.com/facebook/rocksdb/pull/2205 Differential Revision: D4937461 Pulled By: ajkr fbshipit-source-id: df52cbbd497e13bbc9a60560a5ac2a2526b3f1f9	2017-05-24 11:29:08 -07:00
Yi Wu	07bdcb91fe	New WriteImpl to pipeline WAL/memtable write Summary: PipelineWriteImpl is an alternative approach to WriteImpl. In WriteImpl, only one thread is allow to write at the same time. This thread will do both WAL and memtable writes for all write threads in the write group. Pending writers wait in queue until the current writer finishes. In the pipeline write approach, two queue is maintained: one WAL writer queue and one memtable writer queue. All writers (regardless of whether they need to write WAL) will still need to first join the WAL writer queue, and after the house keeping work and WAL writing, they will need to join memtable writer queue if needed. The benefit of this approach is that 1. Writers without memtable writes (e.g. the prepare phase of two phase commit) can exit write thread once WAL write is finish. They don't need to wait for memtable writes in case of group commit. 2. Pending writers only need to wait for previous WAL writer finish to be able to join the write thread, instead of wait also for previous memtable writes. Merging #2056 and #2058 into this PR. Closes https://github.com/facebook/rocksdb/pull/2286 Differential Revision: D5054606 Pulled By: yiwu-arbug fbshipit-source-id: ee5b11efd19d3e39d6b7210937b11cefdd4d1c8d	2017-05-19 14:26:42 -07:00
yizhu.sun	f5ba131bf8	Fixed some spelling mistakes Summary: Closes https://github.com/facebook/rocksdb/pull/2314 Differential Revision: D5079601 Pulled By: sagar0 fbshipit-source-id: ae5696fd735718f544435c64c3179c49b8c04349	2017-05-17 23:12:36 -07:00
hyunwoo	0ebdd70579	fixed typo Summary: fixed typo Closes https://github.com/facebook/rocksdb/pull/2312 Differential Revision: D5079631 Pulled By: sagar0 fbshipit-source-id: e4c8d1d89b244ee69e9dea1dd013227cc5241026	2017-05-17 16:41:49 -07:00
Anirban Rahut	d85ff4953c	Blob storage pr Summary: The final pull request for Blob Storage. Closes https://github.com/facebook/rocksdb/pull/2269 Differential Revision: D5033189 Pulled By: yiwu-arbug fbshipit-source-id: 6356b683ccd58cbf38a1dc55e2ea400feecd5d06	2017-05-10 15:14:44 -07:00
Yi Wu	2cd00773c7	Add bulk create/drop column family API Summary: Adding DB::CreateColumnFamilie() and DB::DropColumnFamilies() to bulk create/drop column families. This is to address the problem creating/dropping 1k column families takes minutes. The bottleneck is we persist options files for every single column family create/drop, and it parses the persisted options file for verification, which take a lot CPU time. The new APIs simply create/drop column families individually, and persist options file once at the end. This improves create 1k column families to within ~0.1s. Further improvement can be merge manifest write to one IO. Closes https://github.com/facebook/rocksdb/pull/2248 Differential Revision: D5001578 Pulled By: yiwu-arbug fbshipit-source-id: d4e00bda671451e0b314c13e12ad194b1704aa03	2017-05-07 23:20:46 -07:00
Siying Dong	d616ebea23	Add GPLv2 as an alternative license. Summary: Closes https://github.com/facebook/rocksdb/pull/2226 Differential Revision: D4967547 Pulled By: siying fbshipit-source-id: dd3b58ae1e7a106ab6bb6f37ab5c88575b125ab4	2017-04-27 18:06:12 -07:00
Siying Dong	c49d704656	Add DB:ResetStats() Summary: Add a function to allow users to reset internal stats without restarting the DB. Closes https://github.com/facebook/rocksdb/pull/2167 Differential Revision: D4907939 Pulled By: siying fbshipit-source-id: ab2dd85b88aabe9380da7485320a1d460d3e1f68	2017-04-18 16:56:48 -07:00
Yi Wu	0fcdccc33e	Blob storage helper methods Summary: Split out interfaces needed for blob storage from #1560, including * CompactionEventListener and OnFlushBegin listener interfaces. * Blob filename support. Closes https://github.com/facebook/rocksdb/pull/2169 Differential Revision: D4905463 Pulled By: yiwu-arbug fbshipit-source-id: 564e73448f1b7a367e5e46216a521e57ea9011b5	2017-04-18 12:42:38 -07:00
Yi Wu	e9e6e53247	Simplify write thread logic Summary: The concept about early exit in write thread implementation is a confusing one. It means that if early exit is allowed, batch group leader will not responsible to exit the batch group, but the last finished writer do. In case we need to mark log synced, or encounter memtable insert error, early exit is disallowed. This patch remove such a concept by: * In all cases, the last finished writer (not necessary leader) is responsible to exit batch group. * In case of parallel memtable write, leader will also mark log synced after memtable insert and before signal finish (call `CompleteParallelWorker()`). The purpose is to allow mark log synced (which require locking mutex) can run in parallel to memtable insert in other writers. * The last finish writer should handle memtable insert error (update bg_error_) before exiting batch group. Closes https://github.com/facebook/rocksdb/pull/2134 Differential Revision: D4869667 Pulled By: yiwu-arbug fbshipit-source-id: aec170847c85b90f4179d6a4608a4fe1361544e3	2017-04-13 16:12:04 -07:00
Siying Dong	d2dce5611a	Move some files under util/ to separate dirs Summary: Move some files under util/ to new directories env/, monitoring/ options/ and cache/ Closes https://github.com/facebook/rocksdb/pull/2090 Differential Revision: D4833681 Pulled By: siying fbshipit-source-id: 2fd8bef	2017-04-05 19:09:16 -07:00
Siying Dong	ce64b8b719	Divide db/db_impl.cc Summary: db_impl.cc is too large to manage. Divide db_impl.cc into db/db_impl.cc, db/db_impl_compaction_flush.cc, db/db_impl_files.cc, db/db_impl_open.cc and db/db_impl_write.cc. Closes https://github.com/facebook/rocksdb/pull/2095 Differential Revision: D4838188 Pulled By: siying fbshipit-source-id: c5f3059	2017-04-05 17:24:19 -07:00
Yi Wu	9e44531803	Refactor WriteImpl (pipeline write part 1) Summary: Refactor WriteImpl() so when I plug-in the pipeline write code (which is an alternative approach for WriteThread), some of the logic can be reuse. I split out the following methods from WriteImpl(): * PreprocessWrite() * HandleWALFull() (previous MaybeFlushColumnFamilies()) * HandleWriteBufferFull() * WriteToWAL() Also adding a constructor to WriteThread::Writer, and move WriteContext into db_impl.h. No real logic change in this patch. Closes https://github.com/facebook/rocksdb/pull/2042 Differential Revision: D4781014 Pulled By: yiwu-arbug fbshipit-source-id: d45ca18	2017-04-04 10:24:32 -07:00
Siying Dong	67d7623794	Expose the stalling information through DB::GetProperty() Summary: Add two DB properties: rocksdb.actual_delayed_write_rate and rocksdb.is_write_stooped, for people to know whether current writes are being throttled. Closes https://github.com/facebook/rocksdb/pull/2043 Differential Revision: D4782975 Pulled By: siying fbshipit-source-id: 6b2f5cf	2017-03-29 11:54:20 -07:00
Aaron Gao	3e56c7e0c4	make total_log_size_ atomic Summary: make total_log_size_ atomic to avoid overflow caused by data race. Closes https://github.com/facebook/rocksdb/pull/2019 Differential Revision: D4751391 Pulled By: siying fbshipit-source-id: fac01dd	2017-03-22 11:54:40 -07:00
Maysam Yabandeh	11526252cc	Pinnableslice (2nd attempt) Summary: PinnableSlice Summary: Currently the point lookup values are copied to a string provided by the user. This incures an extra memcpy cost. This patch allows doing point lookup via a PinnableSlice which pins the source memory location (instead of copying their content) and releases them after the content is consumed by the user. The old API of Get(string) is translated to the new API underneath. Here is the summary for improvements: value 100 byte: 1.8% regular, 1.2% merge values value 1k byte: 11.5% regular, 7.5% merge values value 10k byte: 26% regular, 29.9% merge values The improvement for merge could be more if we extend this approach to pin the merge output and delay the full merge operation until the user actually needs it. We have put that for future work. PS: Sometimes we observe a small decrease in performance when switching from t5452014 to this patch but with the old Get(string) API. The d Closes https://github.com/facebook/rocksdb/pull/1756 Differential Revision: D4391738 Pulled By: maysamyabandeh fbshipit-source-id: 6f3edd3	2017-03-13 11:54:10 -07:00
Aaron Gao	e877afa08b	Remove bulk loading and auto_roll_logger in rocksdb_lite Summary: shrink lite size Closes https://github.com/facebook/rocksdb/pull/1929 Differential Revision: D4622059 Pulled By: siying fbshipit-source-id: 050b796	2017-02-28 11:09:11 -08:00
Yi Wu	c2247dc1c7	Make DBImpl::has_unpersisted_data_ atomic Summary: Seems to me `has_unpersisted_data_` is read from read thread and write from write thread concurrently without synchronization. Making it an atomic. I update the logic not because seeing any problem with it, but it just feel confusing. Closes https://github.com/facebook/rocksdb/pull/1869 Differential Revision: D4555837 Pulled By: yiwu-arbug fbshipit-source-id: eff2ab8	2017-02-13 18:54:13 -08:00
Vitaliy Liptchinsky	1aaa898cf1	Adding GetApproximateMemTableStats method Summary: Added method that returns approx num of entries as well as size for memtables. Closes https://github.com/facebook/rocksdb/pull/1841 Differential Revision: D4511990 Pulled By: VitaliyLi fbshipit-source-id: 9a4576e	2017-02-06 14:54:16 -08:00

1 2 3 4 5 ...

440 Commits