rocksdb

Author	SHA1	Message	Date
Yi Wu	f90ced92f5	Blob DB: fix snapshot handling Summary: Blob db will keep blob file if data in the file is visible to an active snapshot. Before this patch it checks whether there is an active snapshot has sequence number greater than the earliest sequence in the file. This is problematic since we take snapshot on every read, if it keep having reads, old blob files will not be cleanup. Change to check if there is an active snapshot falls in the range of [earliest_sequence, obsolete_sequence) where obsolete sequence is 1. if data is relocated to another file by garbage collection, it is the latest sequence at the time garbage collection finish 2. otherwise, it is the latest sequence of the file Closes https://github.com/facebook/rocksdb/pull/3087 Differential Revision: D6182519 Pulled By: yiwu-arbug fbshipit-source-id: cdf4c35281f782eb2a9ad6a87b6727bbdff27a45	2017-11-02 23:40:01 -07:00
Yi Wu	11bacd5787	Blob DB: Fix flaky BlobDBTest::GCExpiredKeyWhileOverwriting test Summary: The test intent to wait until key being overwritten until proceed with garbage collection. It failed to wait for `PutUntil` finally finish. Fixing it. Closes https://github.com/facebook/rocksdb/pull/3116 Differential Revision: D6222833 Pulled By: yiwu-arbug fbshipit-source-id: fa9b57a772b92a66cf250b44e7975c43f62f45c5	2017-11-02 23:39:36 -07:00
Sagar Vemuri	f98efcb1e3	Blob DB: Evict oldest blob file when close to blob db size limit Summary: Evict oldest blob file and put it in obsolete_files list when close to blob db size limit. The file will be delete when the `DeleteObsoleteFiles` background job runs next time. For now I set `kEvictOldestFileAtSize` constant, which controls when to evict the oldest file, at 90%. It could be tweaked or made into an option if really needed; I didn't want to expose it as an option pre-maturely as there are already too many :) . Closes https://github.com/facebook/rocksdb/pull/3094 Differential Revision: D6187340 Pulled By: sagar0 fbshipit-source-id: 687f8262101b9301bf964b94025a2fe9d8573421	2017-11-02 23:39:21 -07:00
Yi Wu	c1e99eddc8	Blob DB: cleanup unused options Summary: * cleanup num_concurrent_simple_blobs. We don't do concurrent writes (by taking write_mutex_) so it doesn't make sense to have multiple non TTL files open. We can revisit later when we want to improve writes. * cleanup eviction callback. we don't have plan to use it now. * rename s/open_simple_blob_files_/open_non_ttl_file_/ and s/open_blob_files_/open_ttl_files_/ to avoid confusion. Closes https://github.com/facebook/rocksdb/pull/3088 Differential Revision: D6182598 Pulled By: yiwu-arbug fbshipit-source-id: 99e6f5e01fa66d31309cdb06ce48502464bac6ad	2017-11-02 23:39:05 -07:00
Yi Wu	9e82540901	Blob DB: update blob file format Summary: Changing blob file format and some code cleanup around the change. The change with blob log format are: * Remove timestamp field in blob file header, blob file footer and blob records. The field is not being use and often confuse with expiration field. * Blob file header now come with column family id, which always equal to default column family id. It leaves room for future support of column family. * Compression field in blob file header now is a standalone byte (instead of compact encode with flags field) * Blob file footer now come with its own crc. * Key length now being uint64_t instead of uint32_t * Blob CRC now checksum both key and value (instead of value only). * Some reordering of the fields. The list of cleanups: * Better inline comments in blob_log_format.h * rename ttlrange_t and snrange_t to ExpirationRange and SequenceRange respectively. * simplify blob_db::Reader * Move crc checking logic to inside blob_log_format.cc Closes https://github.com/facebook/rocksdb/pull/3081 Differential Revision: D6171304 Pulled By: yiwu-arbug fbshipit-source-id: e4373e0d39264441b7e2fbd0caba93ddd99ea2af	2017-11-02 23:37:56 -07:00
Yi Wu	d66bb21e18	Blob DB: Inline small values in base DB Summary: Adding the `min_blob_size` option to allow storing small values in base db (in LSM tree) together with the key. The goal is to improve performance for small values, while taking advantage of blob db's low write amplification for large values. Also adding expiration timestamp to blob index. It will be useful to evict stale blob indexes in base db by adding a compaction filter. I'll work on the compaction filter in future patches. See blob_index.h for the new blob index format. There are 4 cases when writing a new key: * small value w/o TTL: put in base db as normal value (i.e. ValueType::kTypeValue) * small value w/ TTL: put (type, expiration, value) to base db. * large value w/o TTL: write value to blob log and put (type, file, offset, size, compression) to base db. * large value w/TTL: write value to blob log and put (type, expiration, file, offset, size, compression) to base db. Closes https://github.com/facebook/rocksdb/pull/3066 Differential Revision: D6142115 Pulled By: yiwu-arbug fbshipit-source-id: 9526e76e19f0839310a3f5f2a43772a4ad182cd0	2017-11-02 23:37:16 -07:00
Sagar Vemuri	05d5c575ac	Return write error on reaching blob dir size limit Summary: I found that we continue accepting writes even when the blob db goes beyond the configured blob directory size limit. Now, we return an error for writes on reaching `blob_dir_size` limit and if `is_fifo` is set to false. (We cannot just drop any file when `is_fifo` is true.) Deleting the oldest file when `is_fifo` is true will be handled in a later PR. Closes https://github.com/facebook/rocksdb/pull/3060 Differential Revision: D6136156 Pulled By: sagar0 fbshipit-source-id: 2f11cb3f2eedfa94524fbfa2613dd64bfad7a23c	2017-11-02 23:37:16 -07:00
Yi Wu	2b8893b9e4	Blob DB: Store blob index as kTypeBlobIndex in base db Summary: Blob db insert blob index to base db as kTypeBlobIndex type, to tell apart values written by plain rocksdb or blob db. This is to make it possible to migrate from existing rocksdb to blob db. Also with the patch blob db garbage collection get away from OptimisticTransaction. Instead it use a custom write callback to achieve similar behavior as OptimisticTransaction. This is because we need to pass the is_blob_index flag to DBImpl::Get but OptimisticTransaction don't support it. Closes https://github.com/facebook/rocksdb/pull/3000 Differential Revision: D6050044 Pulled By: yiwu-arbug fbshipit-source-id: 61dc72ab9977625e75f78cd968e7d8a3976e3632	2017-11-02 23:37:07 -07:00
Yi Wu	419b93c56f	Blob DB: not writing sequence number as blob record footer Summary: Previously each time we write a blob we write blog_record_header + key + value + blob_record_footer to blob log. The footer only contains a sequence and a crc for the sequence number. The sequence number was used in garbage collection to verify the value is recent. After #2703 we moved to use optimistic transaction and no longer use sequence number from the footer. Remove the footer altogether. There's another usage of sequence number and we are keeping it: Each blob log file keep track of sequence number range of keys in it, and use it to check if it is reference by a snapshot, before being deleted. Closes https://github.com/facebook/rocksdb/pull/3005 Differential Revision: D6057585 Pulled By: yiwu-arbug fbshipit-source-id: d6da53c457a316e9723f359a1b47facfc3ffe090	2017-11-02 23:07:27 -07:00
Zhongyi Xie	3747361235	add GetLiveFiles and GetLiveFilesMetaData for BlobDB Summary: Closes https://github.com/facebook/rocksdb/pull/2976 Differential Revision: D5994759 Pulled By: miasantreble fbshipit-source-id: 985c31dccb957cb970c302f813cd07a1e8cb6438	2017-11-02 23:03:54 -07:00
Yi Wu	c293472908	Add ValueType::kTypeBlobIndex Summary: Add kTypeBlobIndex value type, which will be used by blob db only, to insert a (key, blob_offset) KV pair. The purpose is to 1. Make it possible to open existing rocksdb instance as blob db. Existing value will be of kTypeIndex type, while value inserted by blob db will be of kTypeBlobIndex. 2. Make rocksdb able to detect if the db contains value written by blob db, if so return error. 3. Make it possible to have blob db optionally store value in SST file (with kTypeValue type) or as a blob value (with kTypeBlobIndex type). The root db (DBImpl) basically pretended kTypeBlobIndex are normal value on write. On Get if is_blob is provided, return whether the value read is of kTypeBlobIndex type, or return Status::NotSupported() status if is_blob is not provided. On scan allow_blob flag is pass and if the flag is true, return wether the value is of kTypeBlobIndex type via iter->IsBlob(). Changes on blob db side will be in a separate patch. Closes https://github.com/facebook/rocksdb/pull/2886 Differential Revision: D5838431 Pulled By: yiwu-arbug fbshipit-source-id: 3c5306c62bc13bb11abc03422ec5cbcea1203cca	2017-11-02 23:02:50 -07:00
Yi Wu	eae53de3b5	Make it explicit blob db doesn't support CF Summary: Blob db doesn't currently support column families. Return NotSupported status explicitly. Closes https://github.com/facebook/rocksdb/pull/2825 Differential Revision: D5757438 Pulled By: yiwu-arbug fbshipit-source-id: 44de9408fd032c98e8ae337d4db4ed37169bd9fa	2017-11-02 22:29:42 -07:00
Yi Wu	88595c882a	Add DB::Properties::kEstimateOldestKeyTime Summary: With FIFO compaction we would like to get the oldest data time for monitoring. The problem is we don't have timestamp for each key in the DB. As an approximation, we expose the earliest of sst file "creation_time" property. My plan is to override the property with a more accurate value with blob db, where we actually have timestamp. Closes https://github.com/facebook/rocksdb/pull/2842 Differential Revision: D5770600 Pulled By: yiwu-arbug fbshipit-source-id: 03833c8f10bbfbee62f8ea5c0d03c0cafb5d853a	2017-10-23 22:12:30 -07:00
Yi Wu	503db684f7	make blob file close synchronous Summary: Fixing flaky blob_db_test. To close a blob file, blob db used to add a CloseSeqWrite job to the background thread to close it. Changing file close to be synchronous in order to simplify logic, and fix flaky blob_db_test. Closes https://github.com/facebook/rocksdb/pull/2787 Differential Revision: D5699387 Pulled By: yiwu-arbug fbshipit-source-id: dd07a945cd435cd3808fce7ee4ea57817409474a	2017-08-25 10:41:49 -07:00
yiwu-arbug	5b68b114f1	Blob db create a snapshot before every read Summary: If GC kicks in between * A Get() reads index entry from base db. * The Get() read from a blob file The GC can delete the corresponding blob file, making the key not found. Fortunately we have existing logic to avoid deleting a blob file if it is referenced by a snapshot. So the fix is to explicitly create a snapshot before reading index entry from base db. Closes https://github.com/facebook/rocksdb/pull/2754 Differential Revision: D5655956 Pulled By: yiwu-arbug fbshipit-source-id: e4ccbc51331362542e7343175bbcbdea5830f544	2017-08-20 18:26:19 -07:00
yiwu-arbug	4624ae52c9	GC the oldest file when out of space Summary: When out of space, blob db should GC the oldest file. The current implementation GC the newest one instead. Fixing it. Closes https://github.com/facebook/rocksdb/pull/2757 Differential Revision: D5657611 Pulled By: yiwu-arbug fbshipit-source-id: 56c30a4c52e6ab04551dda8c5c46006d4070b28d	2017-08-20 17:11:06 -07:00
follitude	ac8fb77afd	fix some misspellings Summary: PTAL ajkr Closes https://github.com/facebook/rocksdb/pull/2750 Differential Revision: D5648052 Pulled By: ajkr fbshipit-source-id: 7cd1ddd61364d5a55a10fdd293fa74b2bf89dd98	2017-08-16 21:57:20 -07:00
yiwu-arbug	e5a1b727c0	Fix blob DB transaction usage while GC Summary: While GC, blob DB use optimistic transaction to delete or replace the index entry in LSM, to guarantee correctness if there's a normal write writing to the same key. However, the previous implementation doesn't call SetSnapshot() nor use GetForUpdate() of transaction API, instead it do its own sequence number checking before beginning the transaction. A normal write can sneak in after the sequence number check and overwrite the key, and the GC will delete or relocate the old version of the key by mistake. Update the code to property use GetForUpdate() to check the existing index entry. After the patch the sequence number store with each blob record is useless, So I'm considering remove the sequence number from blob record, in another patch. Closes https://github.com/facebook/rocksdb/pull/2703 Differential Revision: D5589178 Pulled By: yiwu-arbug fbshipit-source-id: 8dc960cd5f4e61b36024ba7c32d05584ce149c24	2017-08-11 12:43:17 -07:00
Yi Wu	92afe830f9	Update all blob db TTL and timestamps to uint64_t Summary: The current blob db implementation use mix of int32_t, uint32_t and uint64_t for TTL and expiration. Update all timestamps to uint64_t for consistency. Closes https://github.com/facebook/rocksdb/pull/2683 Differential Revision: D5557103 Pulled By: yiwu-arbug fbshipit-source-id: e4eab2691629a755e614e8cf1eed9c3a681d0c42	2017-08-03 17:57:30 -07:00
Yi Wu	0b814ba92d	Allow concurrent writes to blob db Summary: I'm going with brute-force solution, just letting Put() and Write() holding a mutex before writing. May improve concurrent writing with finer granularity locking later. Closes https://github.com/facebook/rocksdb/pull/2682 Differential Revision: D5552690 Pulled By: yiwu-arbug fbshipit-source-id: 039abd675b5d274a7af6428198d1733cafecef4c	2017-08-03 15:11:26 -07:00
Yi Wu	2c45ada4c4	Blob DB garbage collection should keep keys with newer version Summary: Fix the bug where if blob db garbage collection revmoe keys with newer version. It shouldn't delete the key from base db when sequence number in base db is not equal to the one in blob log. Closes https://github.com/facebook/rocksdb/pull/2678 Differential Revision: D5549752 Pulled By: yiwu-arbug fbshipit-source-id: abb8649260963b5c389748023970fd746279d227	2017-08-03 13:12:12 -07:00
Yi Wu	1900771bd2	Dump Blob DB options to info log Summary: * Dump blob db options to info log * Remove BlobDBOptionsImpl to disallow dynamic cast BlobDBOptions into BlobDBOptionsImpl. Move options there to be constants or into BlobDBOptions. The dynamic cast is broken after #2645 * Change some of the default options * Remove blob_db_options.min_blob_size, which is unimplemented. Will implement it soon. Closes https://github.com/facebook/rocksdb/pull/2671 Differential Revision: D5529912 Pulled By: yiwu-arbug fbshipit-source-id: dcd58ca981db5bcc7f123b65a0d6f6ae0dc703c7	2017-08-01 13:01:47 -07:00
Yi Wu	6083bc79f8	Blob DB TTL extractor Summary: Introducing blob_db::TTLExtractor to replace extract_ttl_fn. The TTL extractor can be use to extract TTL from keys insert with Put or WriteBatch. Change over existing extract_ttl_fn are: * If value is changed, it will be return via std::string* (rather than Slice). With Slice the new value has to be part of the existing value. With std::string* the limitation is removed. * It can optionally return TTL or expiration. Other changes in this PR: * replace `std::chrono::system_clock` with `Env::NowMicros` so that I can mock time in tests. * add several TTL tests. * other minor naming change. Closes https://github.com/facebook/rocksdb/pull/2659 Differential Revision: D5512627 Pulled By: yiwu-arbug fbshipit-source-id: 0dfcb00d74d060b8534c6130c808e4d5d0a54440	2017-07-27 23:26:04 -07:00
Siying Dong	3c327ac2d0	Change RocksDB License Summary: Closes https://github.com/facebook/rocksdb/pull/2589 Differential Revision: D5431502 Pulled By: siying fbshipit-source-id: 8ebf8c87883daa9daa54b2303d11ce01ab1f6f75	2017-07-15 16:11:23 -07:00
foolenough	21b17d7686	Fix BlobDB::Get which only get out the value offset Summary: Blob db use StackableDB::get which only get out the value offset, but not the value. Fix by making BlobDB::Get override the designated getter. Closes https://github.com/facebook/rocksdb/pull/2553 Differential Revision: D5396823 Pulled By: yiwu-arbug fbshipit-source-id: 5a7d1cf77ee44490f836a6537225955382296878	2017-07-12 17:57:40 -07:00
Siying Dong	18c63af6ef	Make "make analyze" happy Summary: "make analyze" is reporting some errors. It's complicated to look but it seems to me that they are all false positive. Anyway, I think cleaning them up is a good idea. Some of the changes are hacky but I don't know a better way. Closes https://github.com/facebook/rocksdb/pull/2508 Differential Revision: D5341710 Pulled By: siying fbshipit-source-id: 6070e430e0e41a080ef441e05e8ec827d45efab6	2017-06-28 15:42:27 -07:00
Dmitri Smirnov	a21db161c9	Implement ReopenWritibaleFile on Windows and other fixes Summary: Make default impl return NoSupported so the db_blob tests exist in a meaningful manner. Replace std::thread to port::Thread Closes https://github.com/facebook/rocksdb/pull/2465 Differential Revision: D5275563 Pulled By: yiwu-arbug fbshipit-source-id: cedf1a18a2c05e20d768c1308b3f3224dbd70ab6	2017-06-20 10:31:13 -07:00
Yi Wu	ae8571f5c2	Fix blob db compression bug Summary: `CompressBlock()` will return the uncompressed slice (i.e. `Slice(value_unc)`) if compression ratio is not good enough. This is undesired. We need to always assign the compressed slice to `value`. Closes https://github.com/facebook/rocksdb/pull/2447 Differential Revision: D5244682 Pulled By: yiwu-arbug fbshipit-source-id: 6989dd8852c9622822ba9acec9beea02007dff09	2017-06-14 13:56:42 -07:00
Yi Wu	7a380deff7	Update blob_db_test Summary: I'm trying to improve unit test of blob db. I'm rewriting blob db test. In this patch: * Rewrite tests of basic put/write/delete operations. * Add disable_background_tasks to BlobDBOptionsImpl to allow me not running any background job for basic unit tests. * Move DestroyBlobDB out from BlobDBImpl to be a standalone function. * Remove all garbage collection related tests. Will rewrite them in following patch. * Disabled compression test since it is failing. Will fix in a followup patch. Closes https://github.com/facebook/rocksdb/pull/2446 Differential Revision: D5243306 Pulled By: yiwu-arbug fbshipit-source-id: 157c71ad3b699307cb88baa3830e9b6e74f8e939	2017-06-14 13:12:34 -07:00
Yi Wu	91e2aa3ce2	write exact sequence number for each put in write batch Summary: At the beginning of write batch write, grab the latest sequence from base db and assume sequence number will increment by 1 for each put and delete, and write the exact sequence number with each put. This is assuming we are the only writer to increment sequence number (no external file ingestion, etc) and there should be no holes in the sequence number. Also having some minor naming changes. Closes https://github.com/facebook/rocksdb/pull/2402 Differential Revision: D5176134 Pulled By: yiwu-arbug fbshipit-source-id: cb4712ee44478d5a2e5951213a10b72f08fe8c88	2017-06-13 12:42:36 -07:00
Yi Wu	ad19eb8686	Fixing blob db sequence number handling Summary: Blob db rely on base db returning sequence number through write batch after DB::Write(). However after recent changes to the write path, DB::Writ()e no longer return sequence number in some cases. Fixing it by have WriteBatchInternal::InsertInto() always encode sequence number into write batch. Stacking on #2375. Closes https://github.com/facebook/rocksdb/pull/2385 Differential Revision: D5148358 Pulled By: yiwu-arbug fbshipit-source-id: 8bda0aa07b9334ed03ed381548b39d167dc20c33	2017-05-31 10:56:45 -07:00
Yi Wu	345878a7fb	update blob_db_test Summary: Re-enable blob_db_test with some update: * Commented out delay at the end of GC tests. Will update the logic later with sync point to properly trigger GC. * Added some helper functions. Also update make files to include blob_dump tool. Closes https://github.com/facebook/rocksdb/pull/2375 Differential Revision: D5133793 Pulled By: yiwu-arbug fbshipit-source-id: 95470b26d0c1f9592ba4b7637e027fdd263f425c	2017-05-30 22:26:13 -07:00
Anirban Rahut	d85ff4953c	Blob storage pr Summary: The final pull request for Blob Storage. Closes https://github.com/facebook/rocksdb/pull/2269 Differential Revision: D5033189 Pulled By: yiwu-arbug fbshipit-source-id: 6356b683ccd58cbf38a1dc55e2ea400feecd5d06	2017-05-10 15:14:44 -07:00
sdong	8b79422b52	[Proof-Of-Concept] RocksDB Blob Storage with a blob log file. Summary: This is a proof of concept of a RocksDB blob log file. The actual value of the Put() is appended to a blob log using normal data block format, and the handle of the block is written as the value of the key in RocksDB. The prototype only supports Put() and Get(). It doesn't support DB restart, garbage collection, Write() call, iterator, snapshots, etc. Test Plan: Add unit tests. Reviewers: arahut Reviewed By: arahut Subscribers: kradhakrishnan, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D61485	2016-08-10 17:05:17 -07:00

34 Commits