Commit Graph

865 Commits

Author SHA1 Message Date
Yi Wu
78279350aa Blob DB: Add statistics
Summary:
Adding a list of blob db counters.

Also remove WaStats() which doesn't expose the stats and can be substitute by (BLOB_DB_BYTES_WRITTEN / BLOB_DB_BLOB_FILE_BYTES_WRITTEN).
Closes https://github.com/facebook/rocksdb/pull/3193

Differential Revision: D6394216

Pulled By: yiwu-arbug

fbshipit-source-id: 017508c8ff3fcd7ea7403c64d0f9834b24816803
2017-11-28 11:58:49 -08:00
Maysam Yabandeh
b72b3c6f51 WritePrepared Txn: Add MultiGet to DB
Summary:
This patch implements MultiGet API for WritePreparedTxnDB and update the existing unit tests.
Closes https://github.com/facebook/rocksdb/pull/3196

Differential Revision: D6401493

Pulled By: maysamyabandeh

fbshipit-source-id: 51501a1e32645fc2da8680e77a50035f6530f2cc
2017-11-27 08:56:21 -08:00
Yi Wu
f0dde49cda Blob DB: Fix GC handling for inlined blob
Summary:
Garbage collection checks if the offset in blob index matches the offset of the blob value in the file. If it is a mismatch, the value is the current version. However it failed to check if the blob index is an inlined type, which don't even have an offset. Fixing it.
Closes https://github.com/facebook/rocksdb/pull/3194

Differential Revision: D6394270

Pulled By: yiwu-arbug

fbshipit-source-id: 7c2b9d795f1116f55f4d728086980f9b6e88ea78
2017-11-24 11:56:47 -08:00
Sagar Vemuri
578f36e431 Blob DB: Remove some redundant log lines
Summary:
Saw some redundant log lines when trying to benchmark blob db. So, removed the lines from blob_file.cc, and let the lines in blob_db_impl.cc take the lead.
Closes https://github.com/facebook/rocksdb/pull/3189

Differential Revision: D6381726

Pulled By: sagar0

fbshipit-source-id: 5f0b1e56fe4bc3b715d89ea9b5749bd935cd0606
2017-11-20 21:11:34 -08:00
Maysam Yabandeh
53863b76f9 WritePrepared Txn: fix bug with Rollback seq
Summary:
The sequence number was not properly advanced after a rollback marker. The patch extends the existing unit tests to detect the bug and also fixes it.
Closes https://github.com/facebook/rocksdb/pull/3157

Differential Revision: D6304291

Pulled By: maysamyabandeh

fbshipit-source-id: 1b519c44a5371b802da49c9e32bd00087a8da401
2017-11-15 08:27:06 -08:00
Yi Wu
42564ada53 Blob DB: not using PinnableSlice move assignment
Summary:
The current implementation of PinnableSlice move assignment have an issue #3163. We are moving away from it instead of try to get the move assignment right, since it is too tricky.
Closes https://github.com/facebook/rocksdb/pull/3164

Differential Revision: D6319201

Pulled By: yiwu-arbug

fbshipit-source-id: 8f3279021f3710da4a4caa14fd238ed2df902c48
2017-11-13 18:12:20 -08:00
Maysam Yabandeh
2515266725 WritePrepared Txn: Refactoring TrackKeys
Summary:
This patch clarifies and refactors the logic around tracked keys in transactions.
Closes https://github.com/facebook/rocksdb/pull/3140

Differential Revision: D6290258

Pulled By: maysamyabandeh

fbshipit-source-id: 03b50646264cbcc550813c060b180fc7451a55c1
2017-11-11 13:14:20 -08:00
Maysam Yabandeh
2edc92bc28 WritePrepared Txn: cross-compatibility test
Summary:
Add tests to ensure that WritePrepared and WriteCommitted policies are cross compatible when the db WAL is empty. This is important when the admin want to switch between the policies. In such case, before the switch the admin needs to empty the WAL by i) committing/rollbacking all the pending transactions, ii) FlushMemTables
Closes https://github.com/facebook/rocksdb/pull/3118

Differential Revision: D6227247

Pulled By: maysamyabandeh

fbshipit-source-id: bcde3d92c1e89cda3b9cfa69f6a20af5d8993db7
2017-11-11 11:28:37 -08:00
Maysam Yabandeh
857adf388f WritePrepared Txn: Refactor conf params
Summary:
Summary of changes:
- Move seq_per_batch out of Options
- Rename concurrent_prepare to two_write_queues
- Add allocate_seq_only_for_data_
Closes https://github.com/facebook/rocksdb/pull/3136

Differential Revision: D6304458

Pulled By: maysamyabandeh

fbshipit-source-id: 08e685bfa82bbc41b5b1c5eb7040a8ca6e05e58c
2017-11-10 17:28:12 -08:00
Dmitri Smirnov
f8e2db0717 Fix crashes, address test issues and adjust windows test script
Summary:
Add per-exe execution capability
  Add fix parsing of groups/tests
  Add timer test exclusion

 Fix unit tests
  Ifdef threadpool specific tests that do not pass on Vista threadpool.
  Remove spurious outout from prefix_test so test case listing works
  properly.
  Fix not using standard test directories results in file creation errors
  in sst_dump_test.

  BlobDb fixes:
    In C++ end() iterators can not be dereferenced. They are not valid.
	When deleting blob_db_ set it to nullptr before any other code executes.
	Not fixed:. On Windows you can not delete a file while it is open.
	[ RUN      ] BlobDBTest.ReadWhileGC
	d:\dev\rocksdb\rocksdb\utilities\blob_db\blob_db_test.cc(75): error: DestroyBlobDB(dbname_, options, bdb_options)
	IO error: Failed to delete: d:/mnt/db\testrocksdb-17444/blob_db_test/blob_dir/000001.blob: Permission denied
	d:\dev\rocksdb\rocksdb\utilities\blob_db\blob_db_test.cc(75): error: DestroyBlobDB(dbname_, options, bdb_options)
	IO error: Failed to delete: d:/mnt/db\testrocksdb-17444/blob_db_test/blob_dir/000001.blob: Permission denied

  write_batch
    Should not call front() if there is a chance the container is empty
Closes https://github.com/facebook/rocksdb/pull/3152

Differential Revision: D6293274

Pulled By: sagar0

fbshipit-source-id: 318c3717c22087fae13b18715dffb24565dbd956
2017-11-10 10:41:57 -08:00
Yi Wu
5e9e5a4702 Blob DB: Fix race condition between flush and write
Summary:
A race condition will happen when:
* a user thread writes a value, but it hits the write stop condition because there are too many un-flushed memtables, while holding blob_db_impl.write_mutex_.
* Flush is triggered and call flush begin listener and try to acquire blob_db_impl.write_mutex_.

Fixing it.
Closes https://github.com/facebook/rocksdb/pull/3149

Differential Revision: D6279805

Pulled By: yiwu-arbug

fbshipit-source-id: 0e3c58afb78795ebe3360a2c69e05651e3908c40
2017-11-08 19:42:22 -08:00
Yi Wu
ca75f0a64a Blob DB: Fix release build
Summary:
`compression` shadow the method name in `BlobFile`. Rename it.
Closes https://github.com/facebook/rocksdb/pull/3148

Differential Revision: D6274498

Pulled By: yiwu-arbug

fbshipit-source-id: 7d293596530998b23b6b8a8940f983f9b6343a98
2017-11-08 13:14:20 -08:00
Yi Wu
4f9f124347 Blob DB: use compression in file header instead of global options
Summary:
To fix the issue of failing to decompress existing value after reopen DB with a different compression settings.
Closes https://github.com/facebook/rocksdb/pull/3142

Differential Revision: D6267260

Pulled By: yiwu-arbug

fbshipit-source-id: c7cf7f3e33b0cd25520abf4771cdf9180cc02a5f
2017-11-07 17:42:17 -08:00
Manuel Ung
e03377c7fd Add lock wait time as a perf context counter
Summary:
Adds two new counters:

`key_lock_wait_count` counts how many times a lock was blocked by another transaction and had to wait, instead of being granted the lock immediately.
`key_lock_wait_time` counts the time spent acquiring locks.
Closes https://github.com/facebook/rocksdb/pull/3107

Differential Revision: D6217332

Pulled By: lth

fbshipit-source-id: 55d4f46da5550c333e523263422fd61d6a46deb9
2017-11-06 10:57:19 -08:00
Yi Wu
be410dede8 Fix PinnableSlice move assignment
Summary:
After move assignment, we need to re-initialized the moved PinnableSlice.

Also update blob_db_impl.cc to not reuse the moved PinnableSlice since it is supposed to be in an undefined state after move.
Closes https://github.com/facebook/rocksdb/pull/3127

Differential Revision: D6238585

Pulled By: yiwu-arbug

fbshipit-source-id: bd99f2e37406c4f7de160c7dee6a2e8126bc224e
2017-11-03 18:13:21 -07:00
Yi Wu
2581c0a5a1 Blob DB: Fix BlobDBTest::SnapshotAndGarbageCollection asan failure
Summary:
Fix unreleased snapshot at the end of the test.
Closes https://github.com/facebook/rocksdb/pull/3126

Differential Revision: D6232867

Pulled By: yiwu-arbug

fbshipit-source-id: 651ca3144fc573ea2ab0ab20f0a752fb4a101d26
2017-11-03 10:26:59 -07:00
Yi Wu
62578d80c1 Blob DB: Add compaction filter to remove expired blob index entries
Summary:
After adding expiration to blob index in #3066, we are now able to add a compaction filter to cleanup expired blob index entries.
Closes https://github.com/facebook/rocksdb/pull/3090

Differential Revision: D6183812

Pulled By: yiwu-arbug

fbshipit-source-id: 9cb03267a9702975290e758c9c176a2c03530b83
2017-11-02 17:27:38 -07:00
Yi Wu
7bfa88037e Blob DB: fix snapshot handling
Summary:
Blob db will keep blob file if data in the file is visible to an active snapshot. Before this patch it checks whether there is an active snapshot has sequence number greater than the earliest sequence in the file. This is problematic since we take snapshot on every read, if it keep having reads, old blob files will not be cleanup. Change to check if there is an active snapshot falls in the range of [earliest_sequence, obsolete_sequence) where obsolete sequence is
1. if data is relocated to another file by garbage collection, it is the latest sequence at the time garbage collection finish
2. otherwise, it is the latest sequence of the file
Closes https://github.com/facebook/rocksdb/pull/3087

Differential Revision: D6182519

Pulled By: yiwu-arbug

fbshipit-source-id: cdf4c35281f782eb2a9ad6a87b6727bbdff27a45
2017-11-02 15:58:27 -07:00
Yi Wu
f662f8f0b6 Blob DB: option to enable garbage collection
Summary:
Add an option to enable/disable auto garbage collection, where we keep counting how many keys have been evicted by either deletion or compaction and decide whether to garbage collect a blob file.

Default disable auto garbage collection for now since the whole logic is not fully tested and we plan to make major change to it.
Closes https://github.com/facebook/rocksdb/pull/3117

Differential Revision: D6224756

Pulled By: yiwu-arbug

fbshipit-source-id: cdf53bdccec96a4580a2b3a342110ad9e8864dfe
2017-11-02 15:58:27 -07:00
Yi Wu
167ba599ec Blob DB: Fix flaky BlobDBTest::GCExpiredKeyWhileOverwriting test
Summary:
The test intent to wait until key being overwritten until proceed with garbage collection. It failed to wait for `PutUntil` finally finish. Fixing it.
Closes https://github.com/facebook/rocksdb/pull/3116

Differential Revision: D6222833

Pulled By: yiwu-arbug

fbshipit-source-id: fa9b57a772b92a66cf250b44e7975c43f62f45c5
2017-11-02 13:27:34 -07:00
Sagar Vemuri
25ac1697b4 Blob DB: Evict oldest blob file when close to blob db size limit
Summary:
Evict oldest blob file and put it in obsolete_files list when close to blob db size limit. The file will be delete when the `DeleteObsoleteFiles` background job runs next time.
For now I set `kEvictOldestFileAtSize` constant, which controls when to evict the oldest file, at 90%. It could be tweaked or made into an option if really needed; I didn't want to expose it as an option pre-maturely as there are already too many :) .
Closes https://github.com/facebook/rocksdb/pull/3094

Differential Revision: D6187340

Pulled By: sagar0

fbshipit-source-id: 687f8262101b9301bf964b94025a2fe9d8573421
2017-11-02 12:11:21 -07:00
Maysam Yabandeh
60d83df23d WritePrepared Txn: Move DB class to its own file
Summary:
Move  WritePreparedTxnDB from pessimistic_transaction_db.h to its own header, write_prepared_txn_db.h
Closes https://github.com/facebook/rocksdb/pull/3114

Differential Revision: D6220987

Pulled By: maysamyabandeh

fbshipit-source-id: 18893fb4fdc6b809fe117dabb544080f9b4a301b
2017-11-02 11:14:30 -07:00
Maysam Yabandeh
02693f64fc WritePrepared Txn: ValidateSnapshot
Summary:
Implements ValidateSnapshot for WritePrepared txns and also adds a unit test to clarify the contract of this function.
Closes https://github.com/facebook/rocksdb/pull/3101

Differential Revision: D6199405

Pulled By: maysamyabandeh

fbshipit-source-id: ace509934c307ea5d26f4bbac5f836d7c80fd240
2017-11-01 19:11:09 -07:00
Maysam Yabandeh
17731a43a6 WritePrepared Txn: Optimize for recoverable state
Summary:
GetCommitTimeWriteBatch is currently used to store some state as part of commit in 2PC. In MyRocks it is specifically used to store some data that would be needed only during recovery. So it is not need to be stored in memtable right after each commit.
This patch enables an optimization to write the GetCommitTimeWriteBatch only to the WAL. The batch will be written to memtable during recovery when the WAL is replayed. To cover the case when WAL is deleted after memtable flush, the batch is also buffered and written to memtable right before each memtable flush.
Closes https://github.com/facebook/rocksdb/pull/3071

Differential Revision: D6148023

Pulled By: maysamyabandeh

fbshipit-source-id: 2d09bae5565abe2017c0327421010d5c0d55eaa7
2017-11-01 17:26:46 -07:00
Maysam Yabandeh
c1cf94c787 WritePrepared Txn: sort indexes before batch collapse
Summary:
The collapse of duplicate keys in write batch needs to sort the indexes of duplicate keys since it only checks the index in the batch with the head of the list of duplicate keys.
Closes https://github.com/facebook/rocksdb/pull/3093

Differential Revision: D6186800

Pulled By: maysamyabandeh

fbshipit-source-id: abc9ae8c2f1840445a5584f925cf86ecc6f37154
2017-11-01 08:56:57 -07:00
Yi Wu
f6082d1944 Blob DB: cleanup unused options
Summary:
* cleanup num_concurrent_simple_blobs. We don't do concurrent writes (by taking write_mutex_) so it doesn't make sense to have multiple non TTL files open. We can revisit later when we want to improve writes.
* cleanup eviction callback. we don't have plan to use it now.
* rename s/open_simple_blob_files_/open_non_ttl_file_/ and s/open_blob_files_/open_ttl_files_/ to avoid confusion.
Closes https://github.com/facebook/rocksdb/pull/3088

Differential Revision: D6182598

Pulled By: yiwu-arbug

fbshipit-source-id: 99e6f5e01fa66d31309cdb06ce48502464bac6ad
2017-10-31 16:42:08 -07:00
Sagar Vemuri
f5078dde2d Blob DB: Initialize all fields in Blob Header, Footer and Record structs
Summary:
Fixing un-itializations caught by valgrind.
Closes https://github.com/facebook/rocksdb/pull/3103

Differential Revision: D6200195

Pulled By: sagar0

fbshipit-source-id: bf35a3fb03eb1d308e4c5ce30dee1e345d7b03b3
2017-10-31 16:42:08 -07:00
Yi Wu
3ebb7ba7b9 Blob DB: update blob file format
Summary:
Changing blob file format and some code cleanup around the change. The change with blob log format are:
* Remove timestamp field in blob file header, blob file footer and blob records. The field is not being use and often confuse with expiration field.
* Blob file header now come with column family id, which always equal to default column family id. It leaves room for future support of column family.
* Compression field in blob file header now is a standalone byte (instead of compact encode with flags field)
* Blob file footer now come with its own crc.
* Key length now being uint64_t instead of uint32_t
* Blob CRC now checksum both key and value (instead of value only).
* Some reordering of the fields.

The list of cleanups:
* Better inline comments in blob_log_format.h
* rename ttlrange_t and snrange_t to ExpirationRange and SequenceRange respectively.
* simplify blob_db::Reader
* Move crc checking logic to inside blob_log_format.cc
Closes https://github.com/facebook/rocksdb/pull/3081

Differential Revision: D6171304

Pulled By: yiwu-arbug

fbshipit-source-id: e4373e0d39264441b7e2fbd0caba93ddd99ea2af
2017-10-27 13:27:12 -07:00
Yi Wu
5a2a6483dc Blob DB: Inline small values in base DB
Summary:
Adding the `min_blob_size` option to allow storing small values in base db (in LSM tree) together with the key. The goal is to improve performance for small values, while taking advantage of blob db's low write amplification for large values.

Also adding expiration timestamp to blob index. It will be useful to evict stale blob indexes in base db by adding a compaction filter. I'll work on the compaction filter in future patches.

See blob_index.h for the new blob index format. There are 4 cases when writing a new key:
* small value w/o TTL: put in base db as normal value (i.e. ValueType::kTypeValue)
* small value w/ TTL: put (type, expiration, value) to base db.
* large value w/o TTL: write value to blob log and put (type, file, offset, size, compression) to base db.
* large value w/TTL: write value to blob log and put (type, expiration, file, offset, size, compression) to base db.
Closes https://github.com/facebook/rocksdb/pull/3066

Differential Revision: D6142115

Pulled By: yiwu-arbug

fbshipit-source-id: 9526e76e19f0839310a3f5f2a43772a4ad182cd0
2017-10-26 12:30:54 -07:00
Sagar Vemuri
96e3a600ba Return write error on reaching blob dir size limit
Summary:
I found that we continue accepting writes even when the blob db goes beyond the configured blob directory size limit. Now, we return an error for writes on reaching `blob_dir_size` limit and if `is_fifo` is set to false. (We cannot just drop any file when `is_fifo` is true.)

Deleting the oldest file when `is_fifo` is true will be handled in a later PR.
Closes https://github.com/facebook/rocksdb/pull/3060

Differential Revision: D6136156

Pulled By: sagar0

fbshipit-source-id: 2f11cb3f2eedfa94524fbfa2613dd64bfad7a23c
2017-10-25 16:30:37 -07:00
zach shipko
386a57e6ef Fix build on OpenBSD
Summary:
A few simple changes to allow RocksDB to be built on OpenBSD. Let me know if any further changes are needed.
Closes https://github.com/facebook/rocksdb/pull/3061

Differential Revision: D6138800

Pulled By: ajkr

fbshipit-source-id: a13a17b5dc051e6518bd56a8c5efd1d24dd81b0c
2017-10-24 13:27:38 -07:00
Yi Wu
66a2c44ef4 Add DB::Properties::kEstimateOldestKeyTime
Summary:
With FIFO compaction we would like to get the oldest data time for monitoring. The problem is we don't have timestamp for each key in the DB. As an approximation, we expose the earliest of sst file "creation_time" property.

My plan is to override the property with a more accurate value with blob db, where we actually have timestamp.
Closes https://github.com/facebook/rocksdb/pull/2842

Differential Revision: D5770600

Pulled By: yiwu-arbug

fbshipit-source-id: 03833c8f10bbfbee62f8ea5c0d03c0cafb5d853a
2017-10-23 15:27:27 -07:00
Dmitri Smirnov
d2a65c59e1 Fix unused var warnings in Release mode
Summary:
MSVC does not support unused attribute at this time. A separate assignment line fixes the issue probably by being counted as usage for MSVC and it no longer complains about unused var.
Closes https://github.com/facebook/rocksdb/pull/3048

Differential Revision: D6126272

Pulled By: maysamyabandeh

fbshipit-source-id: 4907865db45fd75a39a15725c0695aaa17509c1f
2017-10-23 14:27:04 -07:00
Maysam Yabandeh
63822eb761 Enable two write queues for transactions
Summary:
Enable concurrent_prepare flag for WritePrepared transactions and extend the existing transaction tests with this config.
Closes https://github.com/facebook/rocksdb/pull/3046

Differential Revision: D6106534

Pulled By: maysamyabandeh

fbshipit-source-id: 88c8d21d45bc492beb0a131caea84a2ac5e7d38c
2017-10-23 14:27:04 -07:00
Dmitri Smirnov
ebab2e2d42 Enable MSVC W4 with a few exceptions. Fix warnings and bugs
Summary: Closes https://github.com/facebook/rocksdb/pull/3018

Differential Revision: D6079011

Pulled By: yiwu-arbug

fbshipit-source-id: 988a721e7e7617967859dba71d660fc69f4dff57
2017-10-19 10:57:12 -07:00
Maysam Yabandeh
7e38238981 WritePrepared Txn: Disable GC during recovery
Summary:
Disables GC during recovery of a WritePrepared txn db to avoid GCing uncommitted key values.
Closes https://github.com/facebook/rocksdb/pull/2980

Differential Revision: D6000191

Pulled By: maysamyabandeh

fbshipit-source-id: fc4d522c643d24ebf043f811fe4ecd0dd0294675
2017-10-18 09:11:50 -07:00
Yi Wu
eaaef91178 Blob DB: Store blob index as kTypeBlobIndex in base db
Summary:
Blob db insert blob index to base db as kTypeBlobIndex type, to tell apart values written by plain rocksdb or blob db. This is to make it possible to migrate from existing rocksdb to blob db.

Also with the patch blob db garbage collection get away from OptimisticTransaction. Instead it use a custom write callback to achieve similar behavior as OptimisticTransaction. This is because we need to pass the is_blob_index flag to DBImpl::Get but OptimisticTransaction don't support it.
Closes https://github.com/facebook/rocksdb/pull/3000

Differential Revision: D6050044

Pulled By: yiwu-arbug

fbshipit-source-id: 61dc72ab9977625e75f78cd968e7d8a3976e3632
2017-10-17 17:28:11 -07:00
Yi Wu
0552029b5c Blob DB: not writing sequence number as blob record footer
Summary:
Previously each time we write a blob we write blog_record_header + key + value + blob_record_footer to blob log. The footer only contains a sequence and a crc for the sequence number. The sequence number was used in garbage collection to verify the value is recent. After #2703 we moved to use optimistic transaction and no longer use sequence number from the footer. Remove the footer altogether.

There's another usage of sequence number and we are keeping it: Each blob log file keep track of sequence number range of keys in it, and use it to check if it is reference by a snapshot, before being deleted.
Closes https://github.com/facebook/rocksdb/pull/3005

Differential Revision: D6057585

Pulled By: yiwu-arbug

fbshipit-source-id: d6da53c457a316e9723f359a1b47facfc3ffe090
2017-10-17 12:13:08 -07:00
Yi Wu
10ba50e9eb Blob DB: Move BlobFile definition to a separate file
Summary:
simply move BlobFile definition from blob_db_impl.h to blob_file.h.
Closes https://github.com/facebook/rocksdb/pull/3002

Differential Revision: D6050143

Pulled By: yiwu-arbug

fbshipit-source-id: a8fb6e094fe39bdeace6279569834bc65aa64a34
2017-10-13 14:42:26 -07:00
Zhongyi Xie
e2548366e1 add GetLiveFiles and GetLiveFilesMetaData for BlobDB
Summary: Closes https://github.com/facebook/rocksdb/pull/2976

Differential Revision: D5994759

Pulled By: miasantreble

fbshipit-source-id: 985c31dccb957cb970c302f813cd07a1e8cb6438
2017-10-09 19:56:04 -07:00
Yi Wu
8c392a31d7 WritePrepared Txn: Iterator
Summary:
On iterator create, take a snapshot, create a ReadCallback and pass the ReadCallback to the underlying DBIter to check if key is committed.
Closes https://github.com/facebook/rocksdb/pull/2981

Differential Revision: D6001471

Pulled By: yiwu-arbug

fbshipit-source-id: 3565c4cdaf25370ba47008b0e0cb65b31dfe79fe
2017-10-09 17:15:28 -07:00
Yi Wu
17c6325e8a WritePrepare Txn: Cancel flush/compaction before destruction
Summary:
On WritePreparedTxnDB destruct there could be running compaction/flush holding a SnapshotChecker, which holds a pointer back to WritePreparedTxnDB. Make sure those jobs finished before destructing WritePreparedTxnDB.

This is caught by TransactionTest::SeqAdvanceTest.
Closes https://github.com/facebook/rocksdb/pull/2982

Differential Revision: D6002957

Pulled By: yiwu-arbug

fbshipit-source-id: f1e70390c9798d1bd7959f5c8e2a1c14100773c3
2017-10-06 20:55:53 -07:00
Maysam Yabandeh
ec6c5383d0 WritePrepared Txn: end-to-end tests
Summary:
Enable WritePrepared policy for existing transaction tests.
Closes https://github.com/facebook/rocksdb/pull/2972

Differential Revision: D5993614

Pulled By: maysamyabandeh

fbshipit-source-id: d1eb53e2920c4e2a56434bb001231c98426f3509
2017-10-06 14:26:45 -07:00
Sagar Vemuri
da29eba43b Enable WAL for blob index
Summary:
Enabled WAL, during GC, for blob index which is stored on regular RocksDB.
Closes https://github.com/facebook/rocksdb/pull/2975

Differential Revision: D5997384

Pulled By: sagar0

fbshipit-source-id: b76c1487d8b5be0e36c55e8d77ffe3d37d63d85b
2017-10-06 10:59:31 -07:00
Yi Wu
d1b74b0c82 WritePrepared Txn: Compaction/Flush
Summary:
Update Compaction/Flush to support WritePreparedTxnDB: Add SnapshotChecker which is a proxy to query WritePreparedTxnDB::IsInSnapshot. Pass SnapshotChecker to DBImpl on WritePreparedTxnDB open. CompactionIterator use it to check if a key has been committed and if it is visible to a snapshot. In CompactionIterator:
* check if key has been committed. If not, output uncommitted keys AS-IS.
* use SnapshotChecker to check if key is visible to a snapshot when in need.
* do not output key with seq = 0 if the key is not committed.
Closes https://github.com/facebook/rocksdb/pull/2926

Differential Revision: D5902907

Pulled By: yiwu-arbug

fbshipit-source-id: 945e037fdf0aa652dc5ba0ad879461040baa0320
2017-10-06 10:41:53 -07:00
Yi Wu
cc20ec3689 WritePrepared Txn: Test sequence number 0 is visible
Summary:
Compaction will output keys with sequence number 0, if it is visible to
earliest snapshot. Adding a test to make sure IsInSnapshot() report sequence number 0 is
visible to any snapshot.
Closes https://github.com/facebook/rocksdb/pull/2974

Differential Revision: D5990665

Pulled By: yiwu-arbug

fbshipit-source-id: ef50ebc777ff8ca688771f3ab598c7a609b0b65e
2017-10-05 16:26:44 -07:00
Maysam Yabandeh
4e3c3d8c6a WritePrepared Txn: duplicate keys
Summary:
With WriteCommitted, when the write batch has duplicate keys, the txn db simply inserts them to the db with different seq numbers and let the db ignore/merge the duplicate values at the read time. With WritePrepared all the entries of the batch are inserted with the same seq number which prevents us from benefiting from this simple solution.

This patch applies a hackish solution to unblock the end-to-end testing. The hack is to be replaced with a proper solution soon. The patch simply detects the duplicate key insertions, and mark the previous one as obsolete. Then before writing to the db it rewrites the batch eliminating the obsolete keys. This would incur a memcpy cost. Furthermore handing duplicate merge would require to do FullMerge instead of simply ignoring the previous value, which is not handled by this patch.
Closes https://github.com/facebook/rocksdb/pull/2969

Differential Revision: D5976337

Pulled By: maysamyabandeh

fbshipit-source-id: 114e65b66f137d8454ff2d1d782b8c05da95f989
2017-10-05 07:41:02 -07:00
Maysam Yabandeh
283d60761e fix valgrind leak report in unit test
Summary:
I cannot locally reproduce the valgrind leak report but based on my code inspection not deleting txn1 might be the reason.
```
==197848== 2,990 (544 direct, 2,446 indirect) bytes in 1 blocks are definitely lost in loss record 15 of 16
==197848==    at 0x4C2D06F: operator new(unsigned long) (in /usr/local/fbcode/gcc-5-glibc-2.23/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==197848==    by 0x7D5B31: rocksdb::WritePreparedTxnDB::BeginTransaction(rocksdb::WriteOptions const&, rocksdb::TransactionOptions const&, rocksdb::Transaction*) (pessimistic_transaction_db.cc:173)
==197848==    by 0x7D80C1: rocksdb::PessimisticTransactionDB::Initialize(std::vector<unsigned long, std::allocator<unsigned long> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> > const&) (pessimistic_transaction_db.cc:115)
==197848==    by 0x7DC42F: rocksdb::WritePreparedTxnDB::Initialize(std::vector<unsigned long, std::allocator<unsigned long> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> > const&) (pessimistic_transaction_db.cc:151)
==197848==    by 0x7D8CA0: rocksdb::TransactionDB::WrapDB(rocksdb::DB*, rocksdb::TransactionDBOptions const&, std::vector<unsigned long, std::allocator<unsigned long> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> > const&, rocksdb::TransactionDB**) (pessimistic_transaction_db.cc:275)
==197848==    by 0x7D9F26: rocksdb::TransactionDB::Open(rocksdb::DBOptions const&, rocksdb::TransactionDBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::TransactionDB**) (pessimistic_transaction_db.cc:227)
==197848==    by 0x7DB349: rocksdb::TransactionDB::Open(rocksdb::Options const&, rocksdb::TransactionDBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::TransactionDB**) (pessimistic_transaction_db.cc:198)
==197848==    by 0x52ABD2: rocksdb::TransactionTest::ReOpenNoDelete() (transaction_test.h:87)
==197848==    by 0x51F7B8: rocksdb::WritePreparedTransactionTest_BasicRecoveryTest_Test::TestBody() (write_prepared_transaction_test.cc:843)
==197848==    by 0x857557: HandleSehExceptionsInMethodIfSupported<testing::Test, void> (gtest-all.cc:3824)
==197848==    by 0x857557: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest-all.cc:3860)
==197848==    by 0x84E7EB: testing::Test::Run() [clone .part.485] (gtest-all.cc:3897)
==197848==    by 0x84E9BC: Run (gtest-all.cc:3888)
==197848==    by 0x84E9BC: testing::TestInfo::Run() [clone .part.486] (gtest-all.cc:4072)
```
Closes https://github.com/facebook/rocksdb/pull/2963

Differential Revision: D5968856

Pulled By: maysamyabandeh

fbshipit-source-id: 2ac512bbcad37dc8eeeffe4f363978913354180c
2017-10-03 14:58:07 -07:00
Maysam Yabandeh
d27258d3a6 WritePrepared Txn: Rollback
Summary:
Implement the rollback of WritePrepared txns. For each modified value, it reads the value before the txn and write it back. This would cancel out the effect of transaction. It also remove the rolled back txn from prepared heap.
Closes https://github.com/facebook/rocksdb/pull/2946

Differential Revision: D5937575

Pulled By: maysamyabandeh

fbshipit-source-id: a6d3c47f44db3729f44b287a80f97d08dc4e888d
2017-10-02 19:59:27 -07:00
Sagar Vemuri
bb38cd03a9 Limit number of merge operands in Cassandra merge operator
Summary:
Now that RocksDB supports conditional merging during point lookups (introduced in #2923), Cassandra value merge operator can be updated to pass in a limit. The limit needs to be passed in from the Cassandra code.
Closes https://github.com/facebook/rocksdb/pull/2947

Differential Revision: D5938454

Pulled By: sagar0

fbshipit-source-id: d64a72d53170d8cf202b53bd648475c3952f7d7f
2017-10-02 16:11:40 -07:00
Andrew Kryczka
5df172da2f fix deletion-triggered compaction in table builder
Summary:
It was broken when `NotifyCollectTableCollectorsOnFinish` was introduced. That function called `Finish` on each of the `TablePropertiesCollector`s, and `CompactOnDeletionCollector::Finish()` was resetting all its internal state. Then, when we checked whether compaction is necessary, the flag had already been cleared.

Fixed above issue by avoiding resetting internal state during `Finish()`. Multiple calls to `Finish()` are allowed, but callers cannot invoke `AddUserKey()` on the collector after any finishes.
Closes https://github.com/facebook/rocksdb/pull/2936

Differential Revision: D5918659

Pulled By: ajkr

fbshipit-source-id: 4f05e9d80e50ee762ba1e611d8d22620029dca6b
2017-09-28 18:17:30 -07:00
Maysam Yabandeh
385049baf2 WritePrepared Txn: Recovery
Summary:
Recover txns from the WAL. Also added some unit tests.
Closes https://github.com/facebook/rocksdb/pull/2901

Differential Revision: D5859596

Pulled By: maysamyabandeh

fbshipit-source-id: 6424967b231388093b4effffe0a3b1b7ec8caeb0
2017-09-28 16:56:45 -07:00
Quinn Jarrell
6a541afcc4 Make bytes_per_sync and wal_bytes_per_sync mutable
Summary:
SUMMARY
Moves the bytes_per_sync and wal_bytes_per_sync options from immutableoptions to mutable options. Also if wal_bytes_per_sync is changed, the wal file and memtables are flushed.
TEST PLAN
ran make check
all passed

Two new tests SetBytesPerSync, SetWalBytesPerSync check that after issuing setoptions with a new value for the var, the db options have the new value.
Closes https://github.com/facebook/rocksdb/pull/2893

Reviewed By: yiwu-arbug

Differential Revision: D5845814

Pulled By: TheRushingWookie

fbshipit-source-id: 93b52d779ce623691b546679dcd984a06d2ad1bd
2017-09-27 17:49:45 -07:00
Yi Wu
ec48e5c77f Add TransactionDB::SingleDelete()
Summary:
Looks like the API is simply missing. Adding it.
Closes https://github.com/facebook/rocksdb/pull/2937

Differential Revision: D5919955

Pulled By: yiwu-arbug

fbshipit-source-id: 6e2e9c96c29882b0bb4113d1f8efb72bffc57878
2017-09-27 10:27:26 -07:00
Yi Wu
be97dbb15c Fix WritePreparedTransactionTest::SeqAdvanceTest ASAN failure
Summary: Closes https://github.com/facebook/rocksdb/pull/2922

Differential Revision: D5895310

Pulled By: yiwu-arbug

fbshipit-source-id: 52c635a25d22478ec1eca49b6817551202babac2
2017-09-22 15:26:42 -07:00
Yi Wu
1480e6f7cf Fix TransactionTest::SeqAdvanceTest ASAN failure
Summary:
The test didn't delete txn before creating a new one.
Closes https://github.com/facebook/rocksdb/pull/2913

Differential Revision: D5880236

Pulled By: yiwu-arbug

fbshipit-source-id: 7a4fcaada3d86332292754502cd8f4341143bf4f
2017-09-21 09:56:54 -07:00
Pengchao Wang
e4234fbdcf collecting kValue type tombstone
Summary:
In our testing cluster, we found large amount tombstone has been promoted to kValue type from kMerge after reaching the top level of compaction. Since we used to only collecting tombstone in merge operator, those tombstones can never be collected.

This PR addresses the issue by adding a GC step in compaction filter, which is only for kValue type records. Since those record already reached the top of compaction (no earlier data exists) we can safely remove them in compaction filter without worrying old data appears.

This PR also removes an old optimization in cassandra merge operator for single merge operands.  We need to do GC even on a single operand, so the optimation does not make sense anymore.
Closes https://github.com/facebook/rocksdb/pull/2855

Reviewed By: sagar0

Differential Revision: D5806445

Pulled By: wpc

fbshipit-source-id: 6eb25629d4ce917eb5e8b489f64a6aa78c7d270b
2017-09-18 16:27:12 -07:00
Maysam Yabandeh
60beefd6e0 WritePrepared Txn: Advance seq one per batch
Summary:
By default the seq number in DB is increased once per written key. WritePrepared txns requires the seq to be increased once per the entire batch so that the seq would be used as the prepare timestamp by which the transaction is identified. Also we need to increase seq for the commit marker since it would give a unique id to the commit timestamp of transactions.

Two unit tests are added to verify our understanding of how the seq should be increased. The recovery path requires much more work and is left to another patch.
Closes https://github.com/facebook/rocksdb/pull/2885

Differential Revision: D5837843

Pulled By: maysamyabandeh

fbshipit-source-id: a08960b93d727e1cf438c254d0c2636fb133cc1c
2017-09-18 14:45:08 -07:00
Yi Wu
9a970c81af Fix WriteBatchWithIndex::GetFromBatchAndDB not allowing StackableDB
Summary: Closes https://github.com/facebook/rocksdb/pull/2881

Differential Revision: D5829682

Pulled By: yiwu-arbug

fbshipit-source-id: abb8fa14b58cea7c416282f9be19e8b1a7961c6e
2017-09-13 17:26:35 -07:00
Maysam Yabandeh
09713a64b3 WritePrepared Txn: Lock-free CommitMap
Summary:
We had two proposals for lock-free commit maps. This patch implements the latter one that was simpler. We can later experiment with both proposals.

In this impl each entry is an std::atomic of uint64_t, which are accessed via memory_order_acquire/release. In x86_64 arch this is compiled to simple reads and writes from memory.
Closes https://github.com/facebook/rocksdb/pull/2861

Differential Revision: D5800724

Pulled By: maysamyabandeh

fbshipit-source-id: 41abae9a4a5df050a8eb696c43de11c2770afdda
2017-09-13 12:12:11 -07:00
Amy Xu
5785b1fcb8 Fix naming in InternalKey
Summary:
- Switched all instances of SetMinPossibleForUserKey and SetMaxPossibleForUserKey in accordance to InternalKeyComparator's comparison logic
Closes https://github.com/facebook/rocksdb/pull/2868

Differential Revision: D5804152

Pulled By: axxufb

fbshipit-source-id: 80be35e04f2e8abc35cc64abe1fecb03af24e183
2017-09-12 17:17:42 -07:00
Andrew Kryczka
f5148ade10 support opening zero backups during engine init
Summary:
There are internal users who open BackupEngine for writing new backups only, and they don't care whether old backups can be read or not. The condition `BackupableDBOptions::max_valid_backups_to_open == 0` should be supported (previously in df74b775e6 I made the mistake of choosing 0 as a special value to disable the limit).
Closes https://github.com/facebook/rocksdb/pull/2819

Differential Revision: D5751599

Pulled By: ajkr

fbshipit-source-id: e73ac19eb5d756d6b68601eae8e43407ee4f2752
2017-09-12 13:26:34 -07:00
Siying Dong
64b6452e0c Make InternalKeyComparator final and directly use it in merging iterator
Summary:
Merging iterator invokes InternalKeyComparator.Compare() frequently to heap merge. By making InternalKeyComparator final and merging iterator to directly use InternalKeyComparator rather than through Iterator interface, we can give compiler a choice to avoid one more virtual function call if possible. I ran readseq benchmark in memory-only use case to make sure the performance at least doesn't regress.

I have to disable the final key word in debug build, as a hack test class depends on overriding the class.
Closes https://github.com/facebook/rocksdb/pull/2860

Differential Revision: D5800461

Pulled By: siying

fbshipit-source-id: ab876f22a09bb5c560740911412336e0e25ccb53
2017-09-11 12:04:21 -07:00
Maysam Yabandeh
f46464d383 write-prepared txn: call IsInSnapshot
Summary:
This patch instruments the read path to verify each read value against an optional ReadCallback class. If the value is rejected, the reader moves on to the next value. The WritePreparedTxn makes use of this feature to skip sequence numbers that are not in the read snapshot.
Closes https://github.com/facebook/rocksdb/pull/2850

Differential Revision: D5787375

Pulled By: maysamyabandeh

fbshipit-source-id: 49d808b3062ab35e7ae98ad388f659757794184c
2017-09-11 09:14:48 -07:00
Maysam Yabandeh
9a4df72994 WritePrepared Txn: CommitBatch
Summary:
Implements CommitBatch and CommitWithoutPrepare for WritePreparedTxn
Closes https://github.com/facebook/rocksdb/pull/2854

Differential Revision: D5793999

Pulled By: maysamyabandeh

fbshipit-source-id: d8b9858221162c6ac7a1f6912cbd3481d0d8a503
2017-09-08 15:56:39 -07:00
Maysam Yabandeh
fce6c892ab Advance max evicted seq in coarser granularity
Summary:
This patch advances the max_evicted_seq_ is larger granularities to reduce the overhead of updating the relevant data structures.

It also refactor the related code and adds testing to that. As part of this patch some of the TODOs for removing usage of non-static const members are also addressed.
Closes https://github.com/facebook/rocksdb/pull/2844

Differential Revision: D5772928

Pulled By: maysamyabandeh

fbshipit-source-id: f4fcc2948be69c034f10812cf922ce5ab82ef98c
2017-09-08 14:41:22 -07:00
Yi Wu
dcd36a6aee Make it explicit blob db doesn't support CF
Summary:
Blob db doesn't currently support column families. Return NotSupported status explicitly.
Closes https://github.com/facebook/rocksdb/pull/2825

Differential Revision: D5757438

Pulled By: yiwu-arbug

fbshipit-source-id: 44de9408fd032c98e8ae337d4db4ed37169bd9fa
2017-09-08 11:11:04 -07:00
Siying Dong
0e99323ac2 Fix CLANG Analyze
Summary:
clang analyze shows warnings after we upgrade the CLANG version. Fix them.
Closes https://github.com/facebook/rocksdb/pull/2839

Differential Revision: D5769060

Pulled By: siying

fbshipit-source-id: 3f8e4df715590d8984f6564b608fa08cfdfa5f14
2017-09-07 14:28:06 -07:00
Maysam Yabandeh
7e19a571e9 Remove unused TransactionCallback
Summary:
TransactionCallback was never used. Remove it to avoid confusion.
Closes https://github.com/facebook/rocksdb/pull/2853

Differential Revision: D5787219

Pulled By: maysamyabandeh

fbshipit-source-id: e2b6a89537e3770a269ad38be71c4b0b160a88ac
2017-09-07 12:17:18 -07:00
Maysam Yabandeh
79810e2d49 Skip write_prepared_transaction_test in travis
Summary:
The patch skips write_prepared_transaction_test from travis as they time out there. They are still covered in daily runs of tests.
Closes https://github.com/facebook/rocksdb/pull/2836

Differential Revision: D5767203

Pulled By: maysamyabandeh

fbshipit-source-id: 51045ef98a745197136e14b2ec02fc6f38081b75
2017-09-05 15:29:52 -07:00
Yi Wu
ab95e293d2 Fix memory leak on blob db open
Summary:
Fixes #2820
Closes https://github.com/facebook/rocksdb/pull/2826

Differential Revision: D5757527

Pulled By: yiwu-arbug

fbshipit-source-id: f495b63700495aeaade30a1da5e3675848f3d72f
2017-09-01 14:13:51 -07:00
Maysam Yabandeh
37ae8cc60f Signal progress of the test to avoid timeout
Summary: Closes https://github.com/facebook/rocksdb/pull/2824

Differential Revision: D5756457

Pulled By: maysamyabandeh

fbshipit-source-id: dff53e945d8ac4ffe6775a2176424fd1a27fc189
2017-09-01 11:28:52 -07:00
Dmitri Smirnov
0ec90a7cc2 Add -DPORTABLE=1 to MSVC CI build
Summary:
Add -DPORTABLE=1
  port::cacheline_aligned_alloc() has arguments swapped which prevents every single test from running.
Closes https://github.com/facebook/rocksdb/pull/2815

Differential Revision: D5751661

Pulled By: siying

fbshipit-source-id: e0857d6e138ec46035b3c23d7c3c751901a0a4a0
2017-08-31 16:42:48 -07:00
Andrew Kryczka
b97685aef6 fix backup engine when latest backup corrupt
Summary:
Backup engine is intentionally openable even when some backups are corrupt. Previously the engine could write new backups as long as the most recent backup wasn't corrupt. This PR makes the backup engine able to create new backups even when the most recent one is corrupt.

We now maintain two ID instance variables:

- `latest_backup_id_` is used when creating backup to choose the new ID
- `latest_valid_backup_id_` is used when restoring latest backup since we want most recent valid one
Closes https://github.com/facebook/rocksdb/pull/2804

Differential Revision: D5734148

Pulled By: ajkr

fbshipit-source-id: db440707b31df2c7b084188aa5f6368449e10bcf
2017-08-31 15:41:49 -07:00
Pengchao Wang
825a22c00c garbage collect tombstones in merge operator
Summary:
Remove cassandra tombstone when reaching the max compaction level (full merge). if all columns collected key will be removed in next compaction via compaction filter
Closes https://github.com/facebook/rocksdb/pull/2791

Reviewed By: sagar0

Differential Revision: D5722465

Pulled By: wpc

fbshipit-source-id: 61e9898a5686551653a16383255aeaab3197e65e
2017-08-31 10:11:54 -07:00
Maysam Yabandeh
26ac24f199 Add more unit test to write_prepared txns
Summary: Closes https://github.com/facebook/rocksdb/pull/2798

Differential Revision: D5724173

Pulled By: maysamyabandeh

fbshipit-source-id: fb6b782d933fb4be315b1a231a6a67a66fdc9c96
2017-08-31 09:41:27 -07:00
Maysam Yabandeh
fbfa3e7a43 WriteAtPrepare: Efficient read from snapshot list
Summary:
Divide the old snapshots to two lists: a few that fit into a cached array and the rest in a vector, which is expected to be empty in normal cases. The former is to optimize concurrent reads from snapshots without requiring locks. It is done by an array of std::atomic, from which std::memory_order_acquire reads are compiled to simple read instructions in most of the x86_64 architectures.
Closes https://github.com/facebook/rocksdb/pull/2758

Differential Revision: D5660504

Pulled By: maysamyabandeh

fbshipit-source-id: 524fcf9a8e7f90a92324536456912a99aaa6740c
2017-08-26 01:00:38 -07:00
Yi Wu
503db684f7 make blob file close synchronous
Summary:
Fixing flaky blob_db_test.

To close a blob file, blob db used to add a CloseSeqWrite job to the background thread to close it. Changing file close to be synchronous in order to simplify logic, and fix flaky blob_db_test.
Closes https://github.com/facebook/rocksdb/pull/2787

Differential Revision: D5699387

Pulled By: yiwu-arbug

fbshipit-source-id: dd07a945cd435cd3808fce7ee4ea57817409474a
2017-08-25 10:41:49 -07:00
Maysam Yabandeh
cd26af3476 Add unit test for WritePrepared skeleton
Summary: Closes https://github.com/facebook/rocksdb/pull/2756

Differential Revision: D5660516

Pulled By: maysamyabandeh

fbshipit-source-id: f3f3d3b5f544007a7fbdd78e49e4738b4437c7ee
2017-08-23 13:56:03 -07:00
Andrew Kryczka
234f33a3f9 allow nullptr Slice only as sentinel
Summary:
Allow `Slice` holding nullptr as a sentinel value but not in comparisons. This new restriction eliminates the need for the manual checks in 39ef900551, while still conforming to glibc's `memcmp` API. Thanks siying for the idea. Users may need to migrate, so mentioned it in HISTORY.md.
Closes https://github.com/facebook/rocksdb/pull/2777

Differential Revision: D5686016

Pulled By: ajkr

fbshipit-source-id: 03a2ca3fd9a0ebade9d0d5686c81d59a9534f563
2017-08-23 10:56:06 -07:00
Maysam Yabandeh
ccf7f833e3 Use PinnableSlice in Transactions
Summary:
The ::Get from DB is not augmented with an overload method that takes a PinnableSlice instead of a string. Transactions however are not yet upgraded to use the new API. As a result, transaction users such as MyRocks cannot benefit from it. This patch updates the transactional API with a PinnableSlice overload.
Closes https://github.com/facebook/rocksdb/pull/2736

Differential Revision: D5645770

Pulled By: maysamyabandeh

fbshipit-source-id: f6af520df902f842de1bcf99bed3e8dfc43ad96d
2017-08-23 10:11:45 -07:00
yiwu-arbug
5b68b114f1 Blob db create a snapshot before every read
Summary:
If GC kicks in between

* A Get() reads index entry from base db.
* The Get() read from a blob file

The GC can delete the corresponding blob file, making the key not found. Fortunately we have existing logic to avoid deleting a blob file if it is referenced by a snapshot. So the fix is to explicitly create a snapshot before reading index entry from base db.
Closes https://github.com/facebook/rocksdb/pull/2754

Differential Revision: D5655956

Pulled By: yiwu-arbug

fbshipit-source-id: e4ccbc51331362542e7343175bbcbdea5830f544
2017-08-20 18:26:19 -07:00
yiwu-arbug
4624ae52c9 GC the oldest file when out of space
Summary:
When out of space, blob db should GC the oldest file. The current implementation GC the newest one instead. Fixing it.
Closes https://github.com/facebook/rocksdb/pull/2757

Differential Revision: D5657611

Pulled By: yiwu-arbug

fbshipit-source-id: 56c30a4c52e6ab04551dda8c5c46006d4070b28d
2017-08-20 17:11:06 -07:00
Archit Mishra
bddd5d3630 Added mechanism to track deadlock chain
Summary:
Changes:
* extended the wait_txn_map to track additional information
* designed circular buffer to store n latest deadlocks' information
* added test coverage to verify the additional information tracked is accurately stored in the buffer
Closes https://github.com/facebook/rocksdb/pull/2630

Differential Revision: D5478025

Pulled By: armishra

fbshipit-source-id: 2b138de7b5a73f5ca554fc3ff8220a3be49f39e7
2017-08-17 18:56:21 -07:00
yiwu-arbug
29877ec7b4 Fix blob db crash during calculating write amp
Summary:
On initial call to BlobDBImpl::WaStats() `all_periods_write_` would be empty, so it will crash when we call pop_front() at line 1627. Apparently it is mean to pop only when `all_periods_write_.size() > kWriteAmplificationStatsPeriods`.

The whole write amp calculation doesn't seems to be correct and it is not being exposed. Will work on it later.

Test Plan
Change kWriteAmplificationStatsPeriodMillisecs to 1000 (1 second) and run db_bench --use_blob_db for 5 minutes.
Closes https://github.com/facebook/rocksdb/pull/2751

Differential Revision: D5648269

Pulled By: yiwu-arbug

fbshipit-source-id: b843d9a09bb5f9e1b713d101ec7b87e54b5115a4
2017-08-17 15:01:09 -07:00
Sagar Vemuri
8f2598ac9d Enable Cassandra merge operator to be called with a single merge operand
Summary:
Updating Cassandra merge operator to make use of a single merge operand when needed. Single merge operand support has been introduced in #2721.
Closes https://github.com/facebook/rocksdb/pull/2753

Differential Revision: D5652867

Pulled By: sagar0

fbshipit-source-id: b9fbd3196d3ebd0b752626dbf9bec9aa53e3e26a
2017-08-17 15:01:09 -07:00
follitude
ac8fb77afd fix some misspellings
Summary:
PTAL ajkr
Closes https://github.com/facebook/rocksdb/pull/2750

Differential Revision: D5648052

Pulled By: ajkr

fbshipit-source-id: 7cd1ddd61364d5a55a10fdd293fa74b2bf89dd98
2017-08-16 21:57:20 -07:00
Maysam Yabandeh
eb6425303e Update WritePrepared with the pseudo code
Summary:
Implement the main body of WritePrepared pseudo code. This includes PrepareInternal and CommitInternal, as well as AddCommitted which updates the commit map. It also provides a IsInSnapshot method that could be later called form the read path to decide if a version is in the read snapshot or it should other be skipped.

This patch lacks unit tests and does not attempt to offer an efficient implementation. The idea is that to have the API specified so that we can work on related tasks in parallel.
Closes https://github.com/facebook/rocksdb/pull/2713

Differential Revision: D5640021

Pulled By: maysamyabandeh

fbshipit-source-id: bfa7a05e8d8498811fab714ce4b9c21530514e1c
2017-08-16 16:57:47 -07:00
Sagar Vemuri
132306fbf0 Remove PartialMerge implementation from Cassandra merge operator
Summary:
`PartialMergeMulti` implementation is enough for Cassandra, and `PartialMerge` is not required. Implementing both will just duplicate the code.
As per https://github.com/facebook/rocksdb/blob/master/include/rocksdb/merge_operator.h#L130-L135 :

```
  // The default implementation of PartialMergeMulti will use this function
  // as a helper, for backward compatibility.  Any successor class of
  // MergeOperator should either implement PartialMerge or PartialMergeMulti,
  // although implementing PartialMergeMulti is suggested as it is in general
  // more effective to merge multiple operands at a time instead of two
  // operands at a time.
```
Closes https://github.com/facebook/rocksdb/pull/2737

Reviewed By: scv119

Differential Revision: D5633073

Pulled By: sagar0

fbshipit-source-id: ef4fa102c22fec6a0175ed12f5c44c15afe3c8ca
2017-08-15 14:59:34 -07:00
yiwu-arbug
e5a1b727c0 Fix blob DB transaction usage while GC
Summary:
While GC, blob DB use optimistic transaction to delete or replace the index entry in LSM, to guarantee correctness if there's a normal write writing to the same key. However, the previous implementation doesn't call SetSnapshot() nor use GetForUpdate() of transaction API, instead it do its own sequence number checking before beginning the transaction. A normal write can sneak in after the sequence number check and overwrite the key, and the GC will delete or relocate the old version of the key by mistake. Update the code to property use GetForUpdate() to check the existing index entry.

After the patch the sequence number store with each blob record is useless, So I'm considering remove the sequence number from blob record, in another patch.
Closes https://github.com/facebook/rocksdb/pull/2703

Differential Revision: D5589178

Pulled By: yiwu-arbug

fbshipit-source-id: 8dc960cd5f4e61b36024ba7c32d05584ce149c24
2017-08-11 12:43:17 -07:00
Daniel Black
64f8484356 block_cache_tier: fix gcc-7 warnings
Summary:
Error was:

utilities/persistent_cache/block_cache_tier.cc: In instantiation of ‘void rocksdb::Add(std::map<std::__cxx11::basic_string<char>, double>*, const string&, const T&) [with T = double; std::__cxx11::string = std::__cxx11::basic_string<char>]’:
utilities/persistent_cache/block_cache_tier.cc:147:40:   required from here
utilities/persistent_cache/block_cache_tier.cc:141:23: error: type qualifiers ignored on cast result type [-Werror=ignored-qualifiers]

   stats->insert({key, static_cast<const double>(t)});

fixing like #2562
Closes https://github.com/facebook/rocksdb/pull/2603

Differential Revision: D5600910

Pulled By: yiwu-arbug

fbshipit-source-id: 891a5ec7e451d2dec6ad1b6b7fac545657f87363
2017-08-10 11:58:53 -07:00
Maysam Yabandeh
bdc056f8aa Refactor PessimisticTransaction
Summary:
This patch splits Commit and Prepare into lock-related logic and db-write-related logic. It moves lock-related logic to PessimisticTransaction to be reused by all children classes and movies the existing impl of db-write-related to PrepareInternal, CommitSingleInternal, and CommitInternal in WriteCommittedTxnImpl.
Closes https://github.com/facebook/rocksdb/pull/2691

Differential Revision: D5569464

Pulled By: maysamyabandeh

fbshipit-source-id: d1b8698e69801a4126c7bc211745d05c636f5325
2017-08-07 16:12:29 -07:00
Maysam Yabandeh
c9804e007a Refactor TransactionDBImpl
Summary:
This opens space for the new implementations of TransactionDBImpl such as WritePreparedTxnDBImpl that has a different policy of how to write to DB.
Closes https://github.com/facebook/rocksdb/pull/2689

Differential Revision: D5568918

Pulled By: maysamyabandeh

fbshipit-source-id: f7eac866e175daf3793ae79da108f65cc7dc7b25
2017-08-05 17:26:15 -07:00
Yi Wu
0d4a2b7330 Avoid blob db call Sync() while writing
Summary:
The FsyncFiles background job call Fsync() periodically for blob files. However it can access WritableFileWriter concurrently with a Put() or Write(). And WritableFileWriter does not support concurrent access. It will lead to WritableFileWriter buffer being flush with same content twice, and blob file end up corrupted. Fixing by simply let FsyncFiles hold write_mutex_.
Closes https://github.com/facebook/rocksdb/pull/2685

Differential Revision: D5561908

Pulled By: yiwu-arbug

fbshipit-source-id: f0bb5bcab0e05694e053b8c49eab43640721e872
2017-08-04 13:12:07 -07:00
Yi Wu
92afe830f9 Update all blob db TTL and timestamps to uint64_t
Summary:
The current blob db implementation use mix of int32_t, uint32_t and uint64_t for TTL and expiration. Update all timestamps to uint64_t for consistency.
Closes https://github.com/facebook/rocksdb/pull/2683

Differential Revision: D5557103

Pulled By: yiwu-arbug

fbshipit-source-id: e4eab2691629a755e614e8cf1eed9c3a681d0c42
2017-08-03 17:57:30 -07:00
Yi Wu
0b814ba92d Allow concurrent writes to blob db
Summary:
I'm going with brute-force solution, just letting Put() and Write() holding a mutex before writing. May improve concurrent writing with finer granularity locking later.
Closes https://github.com/facebook/rocksdb/pull/2682

Differential Revision: D5552690

Pulled By: yiwu-arbug

fbshipit-source-id: 039abd675b5d274a7af6428198d1733cafecef4c
2017-08-03 15:11:26 -07:00
Yi Wu
2c45ada4c4 Blob DB garbage collection should keep keys with newer version
Summary:
Fix the bug where if blob db garbage collection revmoe keys with newer version. It shouldn't delete the key from base db when sequence number in base db is not equal to the one in blob log.
Closes https://github.com/facebook/rocksdb/pull/2678

Differential Revision: D5549752

Pulled By: yiwu-arbug

fbshipit-source-id: abb8649260963b5c389748023970fd746279d227
2017-08-03 13:12:12 -07:00
Maysam Yabandeh
c3d5c4d38a Refactor TransactionImpl
Summary:
This patch refactors TransactionImpl by separating the logic for pessimistic concurrency control from the implementation of how to write the data to rocksdb. The existing implementation is named WriteCommittedTxnImpl as it writes committed data to the db. A template named WritePreparedTxnImpl is also added which will be later completed to provide a an alternative implementation.
Closes https://github.com/facebook/rocksdb/pull/2676

Differential Revision: D5549998

Pulled By: maysamyabandeh

fbshipit-source-id: 16298e86b43ca4849324c1f35c731913c6d17bec
2017-08-03 08:57:22 -07:00
Yi Wu
1900771bd2 Dump Blob DB options to info log
Summary:
* Dump blob db options to info log
* Remove BlobDBOptionsImpl to disallow dynamic cast *BlobDBOptions into *BlobDBOptionsImpl. Move options there to be constants or into BlobDBOptions. The dynamic cast is broken after #2645
* Change some of the default options
* Remove blob_db_options.min_blob_size, which is unimplemented. Will implement it soon.
Closes https://github.com/facebook/rocksdb/pull/2671

Differential Revision: D5529912

Pulled By: yiwu-arbug

fbshipit-source-id: dcd58ca981db5bcc7f123b65a0d6f6ae0dc703c7
2017-08-01 13:01:47 -07:00
Siying Dong
21696ba502 Replace dynamic_cast<>
Summary:
Replace dynamic_cast<> so that users can choose to build with RTTI off, so that they can save several bytes per object, and get tiny more memory available.
Some nontrivial changes:
1. Add Comparator::GetRootComparator() to get around the internal comparator hack
2. Add the two experiemental functions to DB
3. Add TableFactory::GetOptionString() to avoid unnecessary casting to get the option string
4. Since 3 is done, move the parsing option functions for table factory to table factory files too, to be symmetric.
Closes https://github.com/facebook/rocksdb/pull/2645

Differential Revision: D5502723

Pulled By: siying

fbshipit-source-id: fd13cec5601cf68a554d87bfcf056f2ffa5fbf7c
2017-07-28 16:27:16 -07:00
Yi Wu
aaf42fe775 Move blob_db/ttl_extractor.h into blob_db/blob_db.h
Summary:
Move blob_db/ttl_extractor.h into blob_db/blob_db.h
Also exclude TTLExtractor from LITE build.
Closes https://github.com/facebook/rocksdb/pull/2665

Differential Revision: D5520009

Pulled By: yiwu-arbug

fbshipit-source-id: 4813dcc272c7cc4bf2cdac285256d9a17d78c7b7
2017-07-28 14:28:21 -07:00
Sagar Vemuri
aace46516b Fix license headers in Cassandra related files
Summary:
I might have missed these while doing some recent cassandra code reviews.
Closes https://github.com/facebook/rocksdb/pull/2663

Differential Revision: D5520138

Pulled By: sagar0

fbshipit-source-id: 340930afe9efe03c75f535a1da1f89bd3e53c1f9
2017-07-28 13:56:56 -07:00
Islam AbdelRahman
50a969131f CacheActivityLogger, component to log cache activity into a file
Summary:
Simple component that will add a new entry in a log file every time we lookup/insert a key in SimCache.
API:
```
SimCache::StartActivityLogging(<file_name>, <env>, <optional_max_size>)
SimCache::StopActivityLogging()
```

Sending for review, Still need to add more comments.

I was thinking about a better approach, but I ended up deciding I will use a mutex to sync the writes to the file, since this feature should not be heavily used and only used to collect info that will be analyzed offline. I think it's okay to hold the mutex every time we lookup/add to the SimCache.
Closes https://github.com/facebook/rocksdb/pull/2295

Differential Revision: D5063826

Pulled By: IslamAbdelRahman

fbshipit-source-id: f3b5daed8b201987c9a071146ddd5c5740a2dd8c
2017-07-28 12:36:48 -07:00
Yi Wu
6083bc79f8 Blob DB TTL extractor
Summary:
Introducing blob_db::TTLExtractor to replace extract_ttl_fn. The TTL
extractor can be use to extract TTL from keys insert with Put or
WriteBatch. Change over existing extract_ttl_fn are:
* If value is changed, it will be return via std::string* (rather than Slice*). With Slice* the new value has to be part of the existing value. With std::string* the limitation is removed.
* It can optionally return TTL or expiration.

Other changes in this PR:
* replace `std::chrono::system_clock` with `Env::NowMicros` so that I can mock time in tests.
* add several TTL tests.
* other minor naming change.
Closes https://github.com/facebook/rocksdb/pull/2659

Differential Revision: D5512627

Pulled By: yiwu-arbug

fbshipit-source-id: 0dfcb00d74d060b8534c6130c808e4d5d0a54440
2017-07-27 23:26:04 -07:00
Maysam Yabandeh
2b259c9d49 Lower num of iterations in DeadlockCycle test
Summary:
Currently this test times out with tsan. This is likely due to decreased speed with tsan. By lowering the number of iterations we can still catch a bug as the test is run regularly and multiple runs of the test is equivalent with running the test with more iterations.
Closes https://github.com/facebook/rocksdb/pull/2639

Differential Revision: D5490549

Pulled By: maysamyabandeh

fbshipit-source-id: bd69c42a9728d337ac95a06a401088384e51731a
2017-07-25 11:42:26 -07:00
Siying Dong
e67b35c076 Add Iterator::Refresh()
Summary:
Add and implement Iterator::Refresh(). When this function is called, if the super version doesn't change, update the sequence number of the iterator to the latest one and invalidate the iterator. If the super version changed, recreated the whole iterator. This can help users reuse the iterator more easily.
Closes https://github.com/facebook/rocksdb/pull/2621

Differential Revision: D5464500

Pulled By: siying

fbshipit-source-id: f548bd35e85c1efca2ea69273802f6704eba6ba9
2017-07-24 10:54:37 -07:00
Sagar Vemuri
72502cf227 Revert "comment out unused parameters"
Summary:
This reverts the previous commit 1d7048c598, which broke the build.

Did a `git revert 1d7048c`.
Closes https://github.com/facebook/rocksdb/pull/2627

Differential Revision: D5476473

Pulled By: sagar0

fbshipit-source-id: 4756ff5c0dfc88c17eceb00e02c36176de728d06
2017-07-21 18:26:26 -07:00
Victor Gao
1d7048c598 comment out unused parameters
Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually.

Reviewed By: igorsugak

Differential Revision: D5454343

fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2
2017-07-21 14:57:44 -07:00
Pengchao Wang
534c255c7a Cassandra compaction filter for purge expired columns and rows
Summary:
Major changes in this PR:
* Implement CassandraCompactionFilter to remove expired columns and rows (if all column expired)
* Move cassandra related code from utilities/merge_operators/cassandra to utilities/cassandra/*
* Switch to use shared_ptr<> from uniqu_ptr for Column membership management in RowValue. Since columns do have multiple owners in Merge and GC process, use shared_ptr helps make RowValue immutable.
* Rename cassandra_merge_test to cassandra_functional_test and add two TTL compaction related tests there.
Closes https://github.com/facebook/rocksdb/pull/2588

Differential Revision: D5430010

Pulled By: wpc

fbshipit-source-id: 9566c21e06de17491d486a68c70f52d501f27687
2017-07-21 14:57:44 -07:00
Yi Wu
0302da47a7 Reduce blob db noisy logging
Summary:
Remove some of the per-key logging by blob db to reduce noise.
Closes https://github.com/facebook/rocksdb/pull/2587

Differential Revision: D5429115

Pulled By: yiwu-arbug

fbshipit-source-id: b89328282fb8b3c64923ce48738c16017ce7feaf
2017-07-20 15:02:31 -07:00
Yedidya Feldblum
f1a056e005 CodeMod: Prefer ADD_FAILURE() over EXPECT_TRUE(false), et cetera
Summary:
CodeMod: Prefer `ADD_FAILURE()` over `EXPECT_TRUE(false)`, et cetera.

The tautologically-conditioned and tautologically-contradicted boolean expectations/assertions have better alternatives: unconditional passes and failures.

Reviewed By: Orvid

Differential Revision:
D5432398

Tags: codemod, codemod-opensource

fbshipit-source-id: d16b447e8696a6feaa94b41199f5052226ef6914
2017-07-16 21:26:02 -07:00
Siying Dong
3c327ac2d0 Change RocksDB License
Summary: Closes https://github.com/facebook/rocksdb/pull/2589

Differential Revision: D5431502

Pulled By: siying

fbshipit-source-id: 8ebf8c87883daa9daa54b2303d11ce01ab1f6f75
2017-07-15 16:11:23 -07:00
Yi Wu
26ce69b195 Update blob db to use ROCKS_LOG_* macro
Summary:
Update blob db to use the newer ROCKS_LOG_* macro.
Closes https://github.com/facebook/rocksdb/pull/2574

Differential Revision: D5414526

Pulled By: yiwu-arbug

fbshipit-source-id: e428753aa5917e8b435cead2db26df586e5d1def
2017-07-13 10:14:04 -07:00
foolenough
21b17d7686 Fix BlobDB::Get which only get out the value offset
Summary:
Blob db use StackableDB::get which only get out the
value offset, but not the value.
Fix by making BlobDB::Get override the designated getter.
Closes https://github.com/facebook/rocksdb/pull/2553

Differential Revision: D5396823

Pulled By: yiwu-arbug

fbshipit-source-id: 5a7d1cf77ee44490f836a6537225955382296878
2017-07-12 17:57:40 -07:00
Andrew Kryczka
33042573db Fix GetCurrentTime() initialization for valgrind
Summary:
Valgrind had false positive complaints about the initialization pattern for `GetCurrentTime()`'s argument in #2480. We can instead have the client initialize the time variable before calling `GetCurrentTime()`, and have `GetCurrentTime()` promise to only overwrite it in success case.
Closes https://github.com/facebook/rocksdb/pull/2526

Differential Revision: D5358689

Pulled By: ajkr

fbshipit-source-id: 857b189f24c19196f6bb299216f3e23e7bc4be42
2017-07-05 12:12:00 -07:00
Mike Kolupaev
397ab11152 Improve Status message for block checksum mismatches
Summary:
We've got some DBs where iterators return Status with message "Corruption: block checksum mismatch" all the time. That's not very informative. It would be much easier to investigate if the error message contained the file name - then we would know e.g. how old the corrupted file is, which would be very useful for finding the root cause. This PR adds file name, offset and other stuff to some block corruption-related status messages.

It doesn't improve all the error messages, just a few that were easy to improve. I'm mostly interested in "block checksum mismatch" and "Bad table magic number" since they're the only corruption errors that I've ever seen in the wild.
Closes https://github.com/facebook/rocksdb/pull/2507

Differential Revision: D5345702

Pulled By: al13n321

fbshipit-source-id: fc8023d43f1935ad927cef1b9c55481ab3cb1339
2017-06-28 21:27:01 -07:00
Siying Dong
18c63af6ef Make "make analyze" happy
Summary:
"make analyze" is reporting some errors. It's complicated to look but it seems to me that they are all false positive. Anyway, I think cleaning them up is a good idea. Some of the changes are hacky but I don't know a better way.
Closes https://github.com/facebook/rocksdb/pull/2508

Differential Revision: D5341710

Pulled By: siying

fbshipit-source-id: 6070e430e0e41a080ef441e05e8ec827d45efab6
2017-06-28 15:42:27 -07:00
Yi Wu
982cec22af Fix TARGETS file tests list
Summary:
1. The buckifier script assume each test "foo" comes with a .cc file of the same name (i.e. foo.cc). Update cassandra tests to follow this pattern so that the buckifier script can recognize them.
2. add blob_db_test
Closes https://github.com/facebook/rocksdb/pull/2506

Differential Revision: D5331517

Pulled By: yiwu-arbug

fbshipit-source-id: 86f3eba471fc621186ab44cbd073b6162cde8e57
2017-06-27 14:12:02 -07:00
Maysam Yabandeh
499ebb3ab5 Optimize for serial commits in 2PC
Summary:
Throughput: 46k tps in our sysbench settings (filling the details later)

The idea is to have the simplest change that gives us a reasonable boost
in 2PC throughput.

Major design changes:
1. The WAL file internal buffer is not flushed after each write. Instead
it is flushed before critical operations (WAL copy via fs) or when
FlushWAL is called by MySQL. Flushing the WAL buffer is also protected
via mutex_.
2. Use two sequence numbers: last seq, and last seq for write. Last seq
is the last visible sequence number for reads. Last seq for write is the
next sequence number that should be used to write to WAL/memtable. This
allows to have a memtable write be in parallel to WAL writes.
3. BatchGroup is not used for writes. This means that we can have
parallel writers which changes a major assumption in the code base. To
accommodate for that i) allow only 1 WriteImpl that intends to write to
memtable via mem_mutex_--which is fine since in 2PC almost all of the memtable writes
come via group commit phase which is serial anyway, ii) make all the
parts in the code base that assumed to be the only writer (via
EnterUnbatched) to also acquire mem_mutex_, iii) stat updates are
protected via a stat_mutex_.

Note: the first commit has the approach figured out but is not clean.
Submitting the PR anyway to get the early feedback on the approach. If
we are ok with the approach I will go ahead with this updates:
0) Rebase with Yi's pipelining changes
1) Currently batching is disabled by default to make sure that it will be
consistent with all unit tests. Will make this optional via a config.
2) A couple of unit tests are disabled. They need to be updated with the
serial commit of 2PC taken into account.
3) Replacing BatchGroup with mem_mutex_ got a bit ugly as it requires
releasing mutex_ beforehand (the same way EnterUnbatched does). This
needs to be cleaned up.
Closes https://github.com/facebook/rocksdb/pull/2345

Differential Revision: D5210732

Pulled By: maysamyabandeh

fbshipit-source-id: 78653bd95a35cd1e831e555e0e57bdfd695355a4
2017-06-24 14:11:29 -07:00
Siying Dong
88cd2d96e7 Downgrade option sanitiy check level for prefix_extractor
Summary:
With c7004840d2, it's safe to open a DB with different prefix extractor. So it's safe to skip prefix extractor check.
Closes https://github.com/facebook/rocksdb/pull/2474

Differential Revision: D5294700

Pulled By: siying

fbshipit-source-id: eeb500da795eecb29b8c9c56a14cfd4afda12ecc
2017-06-22 16:26:36 -07:00
Andrew Kryczka
048446fc74 Fix cassandra ASAN use-after-free
Summary:
When we create a column based on the `string::c_str()`, we need to make sure that char array doesn't get deleted when calls to `string::append()` cause the string to expand.
Closes https://github.com/facebook/rocksdb/pull/2470

Differential Revision: D5285049

Pulled By: ajkr

fbshipit-source-id: f918dd426ff3c024e7a293dcb10448f10b6c98e8
2017-06-20 13:27:16 -07:00
Dmitri Smirnov
a21db161c9 Implement ReopenWritibaleFile on Windows and other fixes
Summary:
Make default impl return NoSupported so the db_blob
  tests exist in a meaningful manner.
  Replace std::thread to port::Thread
Closes https://github.com/facebook/rocksdb/pull/2465

Differential Revision: D5275563

Pulled By: yiwu-arbug

fbshipit-source-id: cedf1a18a2c05e20d768c1308b3f3224dbd70ab6
2017-06-20 10:31:13 -07:00
Chen Shen
cbd825deea Create a MergeOperator for Cassandra Row Value
Summary:
This PR implements the MergeOperator for Cassandra Row Values.
Closes https://github.com/facebook/rocksdb/pull/2289

Differential Revision: D5055464

Pulled By: scv119

fbshipit-source-id: 45f276ef8cbc4704279202f6a20c64889bc1adef
2017-06-16 14:27:00 -07:00
Yi Wu
ae8571f5c2 Fix blob db compression bug
Summary:
`CompressBlock()` will return the uncompressed slice (i.e. `Slice(value_unc)`) if compression ratio is not good enough. This is undesired. We need to always assign the compressed slice to `value`.
Closes https://github.com/facebook/rocksdb/pull/2447

Differential Revision: D5244682

Pulled By: yiwu-arbug

fbshipit-source-id: 6989dd8852c9622822ba9acec9beea02007dff09
2017-06-14 13:56:42 -07:00
Yi Wu
7a380deff7 Update blob_db_test
Summary:
I'm trying to improve unit test of blob db. I'm rewriting blob db test. In this patch:
* Rewrite tests of basic put/write/delete operations.
* Add disable_background_tasks to BlobDBOptionsImpl to allow me not running any background job for basic unit tests.
* Move DestroyBlobDB out from BlobDBImpl to be a standalone function.
* Remove all garbage collection related tests. Will rewrite them in following patch.
* Disabled compression test since it is failing. Will fix in a followup patch.
Closes https://github.com/facebook/rocksdb/pull/2446

Differential Revision: D5243306

Pulled By: yiwu-arbug

fbshipit-source-id: 157c71ad3b699307cb88baa3830e9b6e74f8e939
2017-06-14 13:12:34 -07:00
Sagar Vemuri
89ad9f3adb Allow ignoring unknown options when loading options from a file
Summary:
Added a flag, `ignore_unknown_options`, to skip unknown options when loading an options file (using `LoadLatestOptions`/`LoadOptionsFromFile`) or while verifying options (using `CheckOptionsCompatibility`). This will help in downgrading the db to an older version.

Also added `--ignore_unknown_options` flag to ldb

**Example Use case:**
In MyRocks, if copying from newer version to older version, it is often impossible to start because of new RocksDB options that don't exist in older version, even though data format is compatible.
MyRocks uses these load and verify functions in [ha_rocksdb.cc::check_rocksdb_options_compatibility](e004fd9f41/storage/rocksdb/ha_rocksdb.cc (L3348-L3401)).

**Test Plan:**
Updated the unit tests.
`make check`

ldb:
$ ./ldb --db=/tmp/test_db --create_if_missing put a1 b1
OK

Now edit /tmp/test_db/<OPTIONS-file> and add an unknown option.

Try loading the options now, and it fails:
$ ./ldb --db=/tmp/test_db --try_load_options get a1
Failed: Invalid argument: Unrecognized option DBOptions:: abcd

Passes with the new --ignore_unknown_options flag
$ ./ldb --db=/tmp/test_db --try_load_options --ignore_unknown_options get a1
b1
Closes https://github.com/facebook/rocksdb/pull/2423

Differential Revision: D5212091

Pulled By: sagar0

fbshipit-source-id: 2ec17636feb47dc0351b53a77e5f15ef7cbf2ca7
2017-06-13 16:58:01 -07:00
Andrew Kryczka
c217e0b9c7 Call RateLimiter for compaction reads
Summary:
Allow users to rate limit background work based on read bytes, written bytes, or sum of read and written bytes. Support these by changing the RateLimiter API, so no additional options were needed.
Closes https://github.com/facebook/rocksdb/pull/2433

Differential Revision: D5216946

Pulled By: ajkr

fbshipit-source-id: aec57a8357dbb4bfde2003261094d786d94f724e
2017-06-13 14:56:46 -07:00
Yi Wu
91e2aa3ce2 write exact sequence number for each put in write batch
Summary:
At the beginning of write batch write, grab the latest sequence from base db and assume sequence number will increment by 1 for each put and delete, and write the exact sequence number with each put. This is assuming we are the only writer to increment sequence number (no external file ingestion, etc) and there should be no holes in the sequence number.

Also having some minor naming changes.
Closes https://github.com/facebook/rocksdb/pull/2402

Differential Revision: D5176134

Pulled By: yiwu-arbug

fbshipit-source-id: cb4712ee44478d5a2e5951213a10b72f08fe8c88
2017-06-13 12:42:36 -07:00
Maysam Yabandeh
550a1df72c Fix clang errors by asserting the precondition
Summary:
USE_CLANG=1 make -j32 analyze
The two errors would disappear after the assertion.
Closes https://github.com/facebook/rocksdb/pull/2416

Differential Revision: D5193526

Pulled By: maysamyabandeh

fbshipit-source-id: 16a21f18f68023f862764dd3ab9e00ca60b0eefa
2017-06-06 12:56:52 -07:00
hyunwoo
c7662a44a4 fixed typo
Summary:
fixed typo
Closes https://github.com/facebook/rocksdb/pull/2376

Differential Revision: D5183630

Pulled By: ajkr

fbshipit-source-id: 133cfd0445959e70aa2cd1a12151bf3c0c5c3ac5
2017-06-05 11:27:34 -07:00
Aaron Gao
7f6c02dda1 using ThreadLocalPtr to hide ROCKSDB_SUPPORT_THREAD_LOCAL from public…
Summary:
… headers

https://github.com/facebook/rocksdb/pull/2199 should not reference RocksDB-specific macros (like ROCKSDB_SUPPORT_THREAD_LOCAL in this case) to public headers, `iostats_context.h` and `perf_context.h`. We shouldn't do that because users have to provide these compiler flags when building their binary with RocksDB.

We should hide the thread local global variable inside our implementation and just expose a function api to retrieve these variables. It may break some users for now but good for long term.

make check -j64
Closes https://github.com/facebook/rocksdb/pull/2380

Differential Revision: D5177896

Pulled By: lightmark

fbshipit-source-id: 6fcdfac57f2e2dcfe60992b7385c5403f6dcb390
2017-06-02 17:26:19 -07:00
Yi Wu
ad19eb8686 Fixing blob db sequence number handling
Summary:
Blob db rely on base db returning sequence number through write batch after DB::Write(). However after recent changes to the write path, DB::Writ()e no longer return sequence number in some cases. Fixing it by have WriteBatchInternal::InsertInto() always encode sequence number into write batch.

Stacking on #2375.
Closes https://github.com/facebook/rocksdb/pull/2385

Differential Revision: D5148358

Pulled By: yiwu-arbug

fbshipit-source-id: 8bda0aa07b9334ed03ed381548b39d167dc20c33
2017-05-31 10:56:45 -07:00
Yi Wu
345878a7fb update blob_db_test
Summary:
Re-enable blob_db_test with some update:
* Commented out delay at the end of GC tests. Will update the logic later with sync point to properly trigger GC.
* Added some helper functions.

Also update make files to include blob_dump tool.
Closes https://github.com/facebook/rocksdb/pull/2375

Differential Revision: D5133793

Pulled By: yiwu-arbug

fbshipit-source-id: 95470b26d0c1f9592ba4b7637e027fdd263f425c
2017-05-30 22:26:13 -07:00
Tamir Duberstein
103d0692ea Avoid unsupported attributes when not building with UBSAN
Summary:
yiwu-arbug see individual commits.
Closes https://github.com/facebook/rocksdb/pull/2318

Differential Revision: D5141520

Pulled By: yiwu-arbug

fbshipit-source-id: 7987c92ab4461eef36afce5a133d3a0ee0c96300
2017-05-30 11:13:01 -07:00
Sagar Vemuri
02594b5f11 Fix build errors in blob_dump_tool with GCC 4.8
Summary:
Fixing the build errors seen with GCC 4.8.1.
```
Makefile:105: Warning: Compiling in debug mode. Don't use the resulting binary in production
utilities/blob_db/blob_dump_tool.cc: In member function ‘rocksdb::Status rocksdb::blob_db::BlobDumpTool::DumpBlobLogFooter(uint64_t, uint64_t*)’:
utilities/blob_db/blob_dump_tool.cc:149:42: error: expected ‘)’ before ‘PRIu64’
   fprintf(stdout, "  Blob count     : %" PRIu64 "\n", footer.GetBlobCount());
                                          ^
utilities/blob_db/blob_dump_tool.cc:149:76: error: spurious trailing ‘%’ in format [-Werror=format=]
   fprintf(stdout, "  Blob count     : %" PRIu64 "\n", footer.GetBlobCount());
                                                                            ^
utilities/blob_db/blob_dump_tool.cc:149:76: error: too many arguments for format [-Werror=format-extra-args]
utilities/blob_db/blob_dump_tool.cc: In member function ‘rocksdb::Status rocksdb::blob_db::BlobDumpTool::DumpRecord(rocksdb::blob_db::BlobDumpTool::DisplayType, rocksdb::blob_db::BlobDumpTool::DisplayType, uint64_t*)’:
utilities/blob_db/blob_dump_tool.cc:161:49: error: expected ‘)’ before ‘PRIx64’
   fprintf(stdout, "Read record with offset 0x%" PRIx64 " (%" PRIu64 "):\n",
                                                 ^
utilities/blob_db/blob_dump_tool.cc:162:27: error: spurious trailing ‘%’ in format [-Werror=format=]
           *offset, *offset);
                           ^
utilities/blob_db/blob_dump_tool.cc:162:27: error: too many arguments for format [-Werror=format-extra-args]
utilities/blob_db/blob_dump_tool.cc:176:38: error: expected ‘)’ before ‘PRIu64’
   fprintf(stdout, "  blob size  : %" PRIu64 "\n", record.GetBlobSize());
                                      ^
utilities/blob_db/blob_dump_tool.cc:176:71: error: spurious trailing ‘%’ in format [-Werror=format=]
   fprintf(stdout, "  blob size  : %" PRIu64 "\n", record.GetBlobSize());
                                                                       ^
utilities/blob_db/blob_dump_tool.cc:176:71: error: too many arguments for format [-Werror=format-extra-args]
utilities/blob_db/blob_dump_tool.cc:178:38: error: expected ‘)’ before ‘PRIu64’
   fprintf(stdout, "  time       : %" PRIu64 "\n", record.GetTimeVal());
                                      ^
utilities/blob_db/blob_dump_tool.cc:178:70: error: spurious trailing ‘%’ in format [-Werror=format=]
   fprintf(stdout, "  time       : %" PRIu64 "\n", record.GetTimeVal());
                                                                      ^
utilities/blob_db/blob_dump_tool.cc:178:70: error: too many arguments for format [-Werror=format-extra-args]
utilities/blob_db/blob_dump_tool.cc:214:38: error: expected ‘)’ before ‘PRIu64’
   fprintf(stdout, "  sequence   : %" PRIu64 "\n", record.GetSN());
                                      ^
utilities/blob_db/blob_dump_tool.cc:214:65: error: spurious trailing ‘%’ in format [-Werror=format=]
   fprintf(stdout, "  sequence   : %" PRIu64 "\n", record.GetSN());
```
Closes https://github.com/facebook/rocksdb/pull/2359

Differential Revision: D5117684

Pulled By: sagar0

fbshipit-source-id: 7480346bcd96205fcae890927c5e68cf004e87be
2017-05-24 00:11:36 -07:00
Yi Wu
578fb0b1dc Simple blob file dumper
Summary:
A simple blob file dumper.
Closes https://github.com/facebook/rocksdb/pull/2242

Differential Revision: D5097553

Pulled By: yiwu-arbug

fbshipit-source-id: c6e00d949fcd3658f9f68da9352f06339fac418d
2017-05-23 10:42:59 -07:00
Aaron Gao
3e86c0f07c disable direct reads for log and manifest and add direct io to tests
Summary:
Disable direct reads for log and manifest. Direct reads should not affect sequential_file
Also add kDirectIO for option_config_ in db_test_util
Closes https://github.com/facebook/rocksdb/pull/2337

Differential Revision: D5100261

Pulled By: lightmark

fbshipit-source-id: 0ebfd13b93fa1b8f9acae514ac44f8125a05868b
2017-05-22 18:41:28 -07:00
Sagar Vemuri
228f49d20a Fix data races caught by tsan
Summary:
This fixes the tsan build failures in:
- write_callback_test
- persistent_cache_test.*
Closes https://github.com/facebook/rocksdb/pull/2339

Differential Revision: D5101190

Pulled By: sagar0

fbshipit-source-id: 537e19ed05272b1f34cfbf793aa822b2264a1643
2017-05-22 10:27:23 -07:00
Yi Wu
d746aead1a Suppress clang-analyzer false positive
Summary:
Fixing two types of clang-analyzer false positives:
* db is deleted and then reopen, and clang-analyzer thinks we are reusing the pointer after it has been deleted. Adding asserts to hint clang-analyzer the pointer is recreated.
* ParsedInternalKey is (intentionally) uninitialized. Initialize the struct only when clang-analyzer is running.
Closes https://github.com/facebook/rocksdb/pull/2334

Differential Revision: D5093801

Pulled By: yiwu-arbug

fbshipit-source-id: f51355382098eb3da5ab9f64e094c6d03e6bdf7d
2017-05-19 10:56:28 -07:00
yizhu.sun
f5ba131bf8 Fixed some spelling mistakes
Summary: Closes https://github.com/facebook/rocksdb/pull/2314

Differential Revision: D5079601

Pulled By: sagar0

fbshipit-source-id: ae5696fd735718f544435c64c3179c49b8c04349
2017-05-17 23:12:36 -07:00
hyunwoo
0ebdd70579 fixed typo
Summary:
fixed typo
Closes https://github.com/facebook/rocksdb/pull/2312

Differential Revision: D5079631

Pulled By: sagar0

fbshipit-source-id: e4c8d1d89b244ee69e9dea1dd013227cc5241026
2017-05-17 16:41:49 -07:00
Yi Wu
445f1235bf s/std::snprintf/snprintf
Summary:
Looks like std::snprintf is not available on all platforms (e.g. MSVC 2010). Change it back to snprintf, where we have a macro in port.h to workaround compatibility.
Closes https://github.com/facebook/rocksdb/pull/2308

Differential Revision: D5070988

Pulled By: yiwu-arbug

fbshipit-source-id: bedfc1660bab0431c583ad434b7e68265e1211b1
2017-05-16 12:01:04 -07:00
Yi Wu
86d5492530 Fix build error with blob DB.
Summary:
snprintf is in <stdio.h> and not in namespace std.
Closes https://github.com/facebook/rocksdb/pull/2287

Reviewed By: anirbanr-fb

Differential Revision: D5054752

Pulled By: yiwu-arbug

fbshipit-source-id: 356807ec38f3c7d95951cdb41f31a3d3ae0714d4
2017-05-15 14:05:46 -07:00
Andrew Kryczka
3fa9a39c68 Add GetAllKeyVersions API
Summary:
- Introduced an include/ file dedicated to db-related debug functions to avoid making db.h more complex
- Added debugging function, `GetAllKeyVersions()`, to return a listing of internal data for a range of user keys. The new `struct KeyVersion` exposes data similar to internal key without exposing any internal type.
- Migrated the "ldb idump" subcommand to use this function
- The API takes an inclusive-exclusive range to match behavior of "ldb idump". This will be quite annoying for users who want to query a single user key's versions :(.
Closes https://github.com/facebook/rocksdb/pull/2232

Differential Revision: D4976007

Pulled By: ajkr

fbshipit-source-id: cab375da53a7595d6575af2b7e3b776aa3ad793e
2017-05-12 15:54:06 -07:00
Anirban Rahut
d85ff4953c Blob storage pr
Summary:
The final pull request for Blob Storage.
Closes https://github.com/facebook/rocksdb/pull/2269

Differential Revision: D5033189

Pulled By: yiwu-arbug

fbshipit-source-id: 6356b683ccd58cbf38a1dc55e2ea400feecd5d06
2017-05-10 15:14:44 -07:00
siddontang
b551104e04 support PopSavePoint for WriteBatch
Summary:
Try to fix https://github.com/facebook/rocksdb/issues/1969
Closes https://github.com/facebook/rocksdb/pull/2170

Differential Revision: D4907333

Pulled By: yiwu-arbug

fbshipit-source-id: 417b420ff668e6c2fd0dad42a94c57385012edc5
2017-05-03 10:57:45 -07:00
Yi Wu
da4b2070b3 Fix WriteBatchWithIndex address use after scope error
Summary:
Fix use after scope error caught by ASAN.
Closes https://github.com/facebook/rocksdb/pull/2228

Differential Revision: D4968028

Pulled By: yiwu-arbug

fbshipit-source-id: a2a266c98634237494ab4fb2d666bc938127aeb2
2017-04-28 13:12:10 -07:00
Siying Dong
d616ebea23 Add GPLv2 as an alternative license.
Summary: Closes https://github.com/facebook/rocksdb/pull/2226

Differential Revision: D4967547

Pulled By: siying

fbshipit-source-id: dd3b58ae1e7a106ab6bb6f37ab5c88575b125ab4
2017-04-27 18:06:12 -07:00
Dmitri Smirnov
cdad04b051 Remove double buffering on RandomRead on Windows.
Summary:
Remove double buffering on RandomRead on Windows.
  With more logic appear in file reader/write Read no longer
  obeys forwarding calls to Windows implementation.
  Previously direct_io (unbuffered) was only available on Windows
  but now is supported as generic.
  We remove intermediate buffering on Windows.
  Remove random_access_max_buffer_size option which was windows specific.
  Non-zero values for that opton introduced unnecessary lock contention.
  Remove Env::EnableReadAhead(), Env::ShouldForwardRawRequest() that are
  no longer necessary.
  Add aligned buffer reads for cases when requested reads exceed read ahead size.
Closes https://github.com/facebook/rocksdb/pull/2105

Differential Revision: D4847770

Pulled By: siying

fbshipit-source-id: 8ab48f8e854ab498a4fd398a6934859792a2788f
2017-04-27 12:30:05 -07:00
Andrew Kryczka
e5e545a021 Reunite checkpoint and backup core logic
Summary:
These code paths forked when checkpoint was introduced by copy/pasting the core backup logic. Over time they diverged and bug fixes were sometimes applied to one but not the other (like fix to include all relevant WALs for 2PC), or it required extra effort to fix both (like fix to forge CURRENT file). This diff reunites the code paths by extracting the core logic into a function, CreateCustomCheckpoint(), that is customizable via callbacks to implement both checkpoint and backup.

Related changes:

- flush_before_backup is now forcibly enabled when 2PC is enabled
- Extracted CheckpointImpl class definition into a header file. This is so the function, CreateCustomCheckpoint(), can be called by internal rocksdb code but not exposed to users.
- Implemented more functions in DummyDB/DummyLogFile (in backupable_db_test.cc) that are used by CreateCustomCheckpoint().
Closes https://github.com/facebook/rocksdb/pull/1932

Differential Revision: D4622986

Pulled By: ajkr

fbshipit-source-id: 157723884236ee3999a682673b64f7457a7a0d87
2017-04-24 15:06:46 -07:00
Maysam Yabandeh
4c9447d889 Add erase option to release cache
Summary:
This is useful when we put the entries in the block cache for accounting
purposes and do not expect it to be used after it is released. If the cache does not
erase the item in such cases not only the performance of cache is
negatively affected but the item's destructor not being called at the
time of release might violate the assumptions about the lifetime of the
object.

The new change adds a force_erase option to the Release method and
returns a boolean to indicate whehter the item is successfully deleted.
Closes https://github.com/facebook/rocksdb/pull/2180

Differential Revision: D4916032

Pulled By: maysamyabandeh

fbshipit-source-id: 94409a346069923cac9de8e57adc313b4ed46f28
2017-04-24 11:28:36 -07:00
Tomas Kolda
04d58970cb AIX and Solaris Sparc Support
Summary:
Replacement of #2147

The change was squashed due to a lot of conflicts.
Closes https://github.com/facebook/rocksdb/pull/2194

Differential Revision: D4929799

Pulled By: siying

fbshipit-source-id: 5cd49c254737a1d5ac13f3c035f128e86524c581
2017-04-21 20:48:04 -07:00
Siying Dong
7534ba7bde StackableDB should pass ResetStats()
Summary: Closes https://github.com/facebook/rocksdb/pull/2190

Differential Revision: D4922688

Pulled By: siying

fbshipit-source-id: eaa3d122f8d389ae0508ec8b61f7780fd8b0a7ef
2017-04-20 16:11:56 -07:00
Andrew Kryczka
df74b775e6 Limit backups opened
Summary:
This was requested by a customer who wants to proactively monitor whether any valid backups are available. The existing performance was poor because Open() serially reads every small meta-file (one per backup), which was slow on HDFS.

Now we only read the minimum number of meta-files to find `max_valid_backups_to_open` valid backups. The customer mentioned above can just set it to one.
Closes https://github.com/facebook/rocksdb/pull/2151

Differential Revision: D4882564

Pulled By: ajkr

fbshipit-source-id: cb0edf9e8ac693e4d5f24902e725a011ed8c0c2f
2017-04-19 13:26:47 -07:00
Siying Dong
ca96654d85 Change Build Env to gcc-5
Summary:
Default to build using gcc-5. Only apply to Facebook-only environments.
Closes https://github.com/facebook/rocksdb/pull/2158

Differential Revision: D4887568

Pulled By: siying

fbshipit-source-id: 53496c9af3273ccd44441bd0bef9d29beefbc00b
2017-04-14 11:12:56 -07:00
Manuel Ung
9300ef5455 Fix shared lock upgrades
Summary:
Upgrading a shared lock was silently succeeding because the actual locking code was skipped. This is because if the keys are tracked, it is assumed that they are already locked and do not require locking. Fix this by recording in tracked keys whether the key was locked exclusively or not.

Note that lock downgrades are impossible, which is the behaviour we expect.

This fixes facebook/mysql-5.6#587.
Closes https://github.com/facebook/rocksdb/pull/2122

Differential Revision: D4861489

Pulled By: IslamAbdelRahman

fbshipit-source-id: 58c7ebe7af098bf01b9774b666d3e9867747d8fd
2017-04-10 16:06:00 -07:00
Manuel Ung
1f8b119ed6 Limit maximum memory used in the WriteBatch representation
Summary:
Extend TransactionOptions to include max_write_batch_size which determines the maximum size of the writebatch representation. If memory limit is exceeded, the operation will abort with subcode kMemoryLimit.
Closes https://github.com/facebook/rocksdb/pull/2124

Differential Revision: D4861842

Pulled By: lth

fbshipit-source-id: 46fd172ea67cc90bbba829bf0d70cfab2261c161
2017-04-10 15:42:26 -07:00
Sagar Vemuri
7124268a09 Reduce the number of params needed to construct DBIter
Summary:
DBIter, and in-turn NewDBIterator and NewArenaWrappedDBIterator, take a  bunch of params. They can be reduced by passing in ReadOptions directly instead of passing in every new param separately. It also seems much cleaner as a bunch of the params towards the end seem to be optional.

(Recently I introduced max_skippable_internal_keys, which added one more to the already huge count).

Idea courtesy IslamAbdelRahman
Closes https://github.com/facebook/rocksdb/pull/2116

Differential Revision: D4857128

Pulled By: sagar0

fbshipit-source-id: 7d239df094b94bd9ea79d145cdf825478ac037a8
2017-04-10 11:14:14 -07:00
Sagar Vemuri
343b59d6ee Move various string utility functions into string_util
Summary:
This is an effort to club all string related utility functions into one common place, in string_util, so that it is easier for everyone to know what string processing functions are available. Right now they seem to be spread out across multiple modules, like logging and options_helper.

Check the sub-commits for easier reviewing.
Closes https://github.com/facebook/rocksdb/pull/2094

Differential Revision: D4837730

Pulled By: sagar0

fbshipit-source-id: 344278a
2017-04-06 14:54:12 -07:00
Yi Wu
df6f5a3772 Move memtable related files into memtable directory
Summary:
Move memtable related files into memtable directory.
Closes https://github.com/facebook/rocksdb/pull/2087

Differential Revision: D4829242

Pulled By: yiwu-arbug

fbshipit-source-id: ca70ab6
2017-04-06 14:09:13 -07:00
Siying Dong
d2dce5611a Move some files under util/ to separate dirs
Summary:
Move some files under util/ to new directories env/, monitoring/ options/ and cache/
Closes https://github.com/facebook/rocksdb/pull/2090

Differential Revision: D4833681

Pulled By: siying

fbshipit-source-id: 2fd8bef
2017-04-05 19:09:16 -07:00
Andrew Kryczka
d659faad54 Level-based L0->L0 compaction
Summary:
Level-based L0->L0 compaction operates on spans of files that aren't currently being compacted. It reduces the number of L0 files, thus making write stall conditions harder to reach.

- L0->L0 is triggered when base level is unavailable due to pending compactions
- L0->L0 always outputs one file of at most `max_level0_burst_file_size` bytes.
- Subcompactions are disabled for L0->L0 since we want to output one file.
- Input files are chosen as the longest span of available files that will fit within the size limit. This minimizes number of files in L0.
Closes https://github.com/facebook/rocksdb/pull/2027

Differential Revision: D4760318

Pulled By: ajkr

fbshipit-source-id: 9d07183
2017-04-04 18:09:11 -07:00
Andrew Kryczka
e2c6c06366 add TimedEnv
Summary:
I've needed Env timing measurements a few times now, so finally built something for it.
Closes https://github.com/facebook/rocksdb/pull/2073

Differential Revision: D4811231

Pulled By: ajkr

fbshipit-source-id: 218a249
2017-04-04 11:24:12 -07:00
Yi Wu
9e44531803 Refactor WriteImpl (pipeline write part 1)
Summary:
Refactor WriteImpl() so when I plug-in the pipeline write code (which is
an alternative approach for WriteThread), some of the logic can be
reuse. I split out the following methods from WriteImpl():

* PreprocessWrite()
* HandleWALFull() (previous MaybeFlushColumnFamilies())
* HandleWriteBufferFull()
* WriteToWAL()

Also adding a constructor to WriteThread::Writer, and move WriteContext into db_impl.h.
No real logic change in this patch.
Closes https://github.com/facebook/rocksdb/pull/2042

Differential Revision: D4781014

Pulled By: yiwu-arbug

fbshipit-source-id: d45ca18
2017-04-04 10:24:32 -07:00
Siying Dong
6ef8c620d3 Move auto_roll_logger and filename out of db/
Summary:
It is confusing to have auto_roll_logger to stay under db/, which has nothing to do with database. Move filename together as it is a dependency.
Closes https://github.com/facebook/rocksdb/pull/2080

Differential Revision: D4821141

Pulled By: siying

fbshipit-source-id: ca7d768
2017-04-03 18:39:14 -07:00
Andrew Kryczka
4e0065015d make all DB::Get overloads virtual
Summary:
some fbcode services override it, we need to keep it virtual.

original change: #1756
Closes https://github.com/facebook/rocksdb/pull/2065

Differential Revision: D4808123

Pulled By: ajkr

fbshipit-source-id: 5eaeea7
2017-03-30 23:39:14 -07:00
Orgad Shaneh
6401a8b76b Fix build with MinGW
Summary:
There still are many warnings (most of them about invalid printf format
for long long), but it builds if FAIL_ON_WARNINGS is disabled.
Closes https://github.com/facebook/rocksdb/pull/2052

Differential Revision: D4807355

Pulled By: siying

fbshipit-source-id: ef03786
2017-03-30 16:54:52 -07:00
Andrew Kryczka
a9c86f51b7 backup garbage collect shared_checksum tmp files
Summary:
previously we only cleaned up .tmp files under "shared/" and "private/" directories in case the previous backup failed. we need to do the same for "shared_checksum/"; otherwise, the subsequent backup will fail if it tries to backup at least one of the same files.
Closes https://github.com/facebook/rocksdb/pull/2062

Differential Revision: D4805599

Pulled By: ajkr

fbshipit-source-id: eaa6088
2017-03-30 14:54:12 -07:00
Sharan Suryanarayanan
8d3cb4f207 Added naming of backup engine threads
Summary:
Changed the naming of backup engine threads from "ldb" to "backup_engine"
Closes https://github.com/facebook/rocksdb/pull/2053

Differential Revision: D4799325

Pulled By: ajkr

fbshipit-source-id: 046893f
2017-03-29 17:10:46 -07:00
Siying Dong
9ef3627fd3 Allow checkpointing without flushing
Summary:
Add a parameter to Checkpoint::CreateCheckpoint() so that flush can be skipped if total log file size is within a threshold.
Closes https://github.com/facebook/rocksdb/pull/1993

Differential Revision: D4719842

Pulled By: siying

fbshipit-source-id: 4f9d9e1
2017-03-21 18:09:13 -07:00
Islam AbdelRahman
e19163688b Add macros to include file name and line number during Logging
Summary:
current logging
```
2017/03/14-14:20:30.393432 7fedde9f5700 (Original Log Time 2017/03/14-14:20:30.393414) [default] Level summary: base level 1 max bytes base 268435456 files[1 0 0 0 0 0 0] max score 0.25
2017/03/14-14:20:30.393438 7fedde9f5700 [JOB 2] Try to delete WAL files size 61417909, prev total WAL file size 73820858, number of live WAL files 2.
2017/03/14-14:20:30.393464 7fedde9f5700 [DEBUG] [JOB 2] Delete /dev/shm/old_logging//MANIFEST-000001 type=3 #1 -- OK
2017/03/14-14:20:30.393472 7fedde9f5700 [DEBUG] [JOB 2] Delete /dev/shm/old_logging//000003.log type=0 #3 -- OK
2017/03/14-14:20:31.427103 7fedd49f1700 [default] New memtable created with log file: #9. Immutable memtables: 0.
2017/03/14-14:20:31.427179 7fedde9f5700 [JOB 3] Syncing log #6
2017/03/14-14:20:31.427190 7fedde9f5700 (Original Log Time 2017/03/14-14:20:31.427170) Calling FlushMemTableToOutputFile with column family [default], flush slots available 1, compaction slots allowed 1, compaction slots scheduled 1
2017/03/14-14:20:31.
Closes https://github.com/facebook/rocksdb/pull/1990

Differential Revision: D4708695

Pulled By: IslamAbdelRahman

fbshipit-source-id: cb8968f
2017-03-15 19:39:12 -07:00
Maysam Yabandeh
11526252cc Pinnableslice (2nd attempt)
Summary:
PinnableSlice

    Summary:
    Currently the point lookup values are copied to a string provided by the
    user. This incures an extra memcpy cost. This patch allows doing point lookup
    via a PinnableSlice which pins the source memory location (instead of
    copying their content) and releases them after the content is consumed
    by the user. The old API of Get(string) is translated to the new API
    underneath.

    Here is the summary for improvements:

    value 100 byte: 1.8% regular, 1.2% merge values
    value 1k byte: 11.5% regular, 7.5% merge values
    value 10k byte: 26% regular, 29.9% merge values
    The improvement for merge could be more if we extend this approach to
    pin the merge output and delay the full merge operation until the user
    actually needs it. We have put that for future work.

    PS:
    Sometimes we observe a small decrease in performance when switching from
    t5452014 to this patch but with the old Get(string) API. The d
Closes https://github.com/facebook/rocksdb/pull/1756

Differential Revision: D4391738

Pulled By: maysamyabandeh

fbshipit-source-id: 6f3edd3
2017-03-13 11:54:10 -07:00
Andrew Kryczka
7c80a6d7d1 Statistic for how often rate limiter is drained
Summary:
This is the metric I plan to use for adaptive rate limiting. The statistics are updated only if the rate limiter is drained by flush or compaction. I believe (but am not certain) that this is the normal case.

The Statistics object is passed in RateLimiter::Request() to avoid requiring changes to client code, which would've been necessary if we passed it in the RateLimiter constructor.
Closes https://github.com/facebook/rocksdb/pull/1946

Differential Revision: D4646489

Pulled By: ajkr

fbshipit-source-id: d8e0161
2017-03-02 17:54:15 -08:00
Islam AbdelRahman
be3e5568be Fix unaligned reads in read cache
Summary:
- Fix unaligned reads in read cache by using RandomAccessFileReader
- Allow read cache flags in db_bench
Closes https://github.com/facebook/rocksdb/pull/1916

Differential Revision: D4610885

Pulled By: IslamAbdelRahman

fbshipit-source-id: 2aa1dc8
2017-02-27 13:09:12 -08:00
Siying Dong
1ba2804b7f Remove XFunc tests
Summary:
Xfunc is hardly used. Remove it to keep the code simple.
Closes https://github.com/facebook/rocksdb/pull/1905

Differential Revision: D4603220

Pulled By: siying

fbshipit-source-id: 731f96d
2017-02-23 12:09:11 -08:00
Andrew Kryczka
ed50308d20 check backup directory exists before listing children
Summary:
InsertPathnameToSizeBytes() is called on shared/ and shared_checksum/ directories, which only exist for certain configurations. If we try to list a non-existent directory's contents, some Envs will dump an error message. Let's avoid this by checking whether the directory exists before listing its contents.
Closes https://github.com/facebook/rocksdb/pull/1895

Differential Revision: D4596301

Pulled By: ajkr

fbshipit-source-id: c809679
2017-02-23 10:54:10 -08:00
Giuseppe Ottaviano
4d7c06cedf Make WriteBatchWithIndex moveble
Summary:
`WriteBatchWithIndex` has an incorrect implicitly-generated move constructor (it will copy the pointer causing a double-free on destruction). Just switch to `unique_ptr` so we get correct move semantics for free.
Closes https://github.com/facebook/rocksdb/pull/1899

Differential Revision: D4598896

Pulled By: ajkr

fbshipit-source-id: 2373d47
2017-02-22 17:54:11 -08:00
Andrew Kryczka
5040414e6f Gracefully handle previous backup interrupted
Summary:
As the last step in backup creation, the .tmp directory is renamed omitting the .tmp suffix. In case the process terminates before this, the .tmp directory will be left behind. Even if this happens, we want future backups to succeed, so I added some checks/cleanup for this case.
Closes https://github.com/facebook/rocksdb/pull/1896

Differential Revision: D4597323

Pulled By: ajkr

fbshipit-source-id: 48900d8
2017-02-22 17:39:12 -08:00
Sagar Vemuri
eb912a927e Remove disableDataSync option
Summary:
Remove disableDataSync, and another similarly named disable_data_sync options.
This is being done to simplify options, and also because the performance gains of this feature can be achieved by other methods.
Closes https://github.com/facebook/rocksdb/pull/1859

Differential Revision: D4541292

Pulled By: sagar0

fbshipit-source-id: 5b3a6ca
2017-02-13 11:09:13 -08:00
Dmitri Smirnov
0a4cdde50a Windows thread
Summary:
introduce new methods into a public threadpool interface,
- allow submission of std::functions as they allow greater flexibility.
- add Joining methods to the implementation to join scheduled and submitted jobs with
  an option to cancel jobs that did not start executing.
- Remove ugly `#ifdefs` between pthread and std implementation, make it uniform.
- introduce pimpl for a drop in replacement of the implementation
- Introduce rocksdb::port::Thread typedef which is a replacement for std::thread.  On Posix Thread defaults as before std::thread.
- Implement WindowsThread that allocates memory in a more controllable manner than windows std::thread with a replaceable implementation.
- should be no functionality changes.
Closes https://github.com/facebook/rocksdb/pull/1823

Differential Revision: D4492902

Pulled By: siying

fbshipit-source-id: c74cb11
2017-02-06 14:54:18 -08:00
Islam AbdelRahman
574b543f80 Rename merger.h -> merging_iterator.h
Summary:
merger.h was always a confusing name for me, simply give the file a better name
Closes https://github.com/facebook/rocksdb/pull/1836

Differential Revision: D4505357

Pulled By: IslamAbdelRahman

fbshipit-source-id: 07b28d8
2017-02-02 16:54:19 -08:00
Andrew Kryczka
17c1180603 Generalize Env registration framework
Summary:
The Env registration framework supports registering client Envs and selecting which one to instantiate according to a text field. This enabled things like adding the -env_uri argument to db_bench, so the same binary could be reused with different Envs just by changing CLI config.

Now this problem has come up again in a non-Env context, as I want to instantiate a client Statistics implementation from db_bench, which is configured entirely via text parameters. Also, in the future we may wish to use it for deserializing client objects when loading OPTIONS file.

This diff generalizes the Env registration logic to work with arbitrary types.

- Generalized registration and instantiation code by templating them
- The entire implementation is in a header file as that's Google style guide's recommendation for template definitions
- Pattern match with std::regex_match rather than checking prefix, which was the previous behavior
- Rename functions/files to be non-Env-specific
Closes https://github.com/facebook/rocksdb/pull/1776

Differential Revision: D4421933

Pulled By: ajkr

fbshipit-source-id: 34647d1
2017-01-25 16:09:14 -08:00
Hyeonseok Oh
f2b4939da4 fixed typo
Summary:
I fixed exisit -> exist
Closes https://github.com/facebook/rocksdb/pull/1799

Differential Revision: D4451466

Pulled By: yiwu-arbug

fbshipit-source-id: b447c3a
2017-01-23 12:54:13 -08:00
jsteemann
aebfd1703b fix non-portable behavior in encoder
Summary:
using ~0UL for mask uses a uint32_t at least in MSVC, but a uint64_t is required for it to work properly
Closes https://github.com/facebook/rocksdb/pull/1777

Differential Revision: D4444004

Pulled By: yiwu-arbug

fbshipit-source-id: 057cc42
2017-01-20 16:39:22 -08:00
Reid Horuff
5cf176ca15 Fix for 2PC causing WAL to grow too large
Summary:
Consider the following single column family scenario:
prepare in log A
commit in log B
*WAL is too large, flush all CFs to releast log A*
*CFA is on log B so we do not see CFA is depending on log A so no flush is requested*

To fix this we must also consider the log containing the prepare section when determining what log a CF is dependent on.
Closes https://github.com/facebook/rocksdb/pull/1768

Differential Revision: D4403265

Pulled By: reidHoruff

fbshipit-source-id: ce800ff
2017-01-19 15:39:12 -08:00
Aaron Gao
3e6899d116 change UseDirectIO() to use_direct_io()
Summary:
also change variable name `direct_io_` to `use_direct_io_` in WritableFile to make it consistent with read path.
Closes https://github.com/facebook/rocksdb/pull/1770

Differential Revision: D4416435

Pulled By: lightmark

fbshipit-source-id: 4143c53
2017-01-13 12:09:15 -08:00
Andrew Kryczka
fe395fb63d Allow incrementing refcount on cache handles
Summary:
Previously the only way to increment a handle's refcount was to invoke Lookup(), which (1) did hash table lookup to get cache handle, (2) incremented that handle's refcount. For a future DeleteRange optimization, I added a function, Ref(), for when the caller already has a cache handle and only needs to do (2).
Closes https://github.com/facebook/rocksdb/pull/1761

Differential Revision: D4397114

Pulled By: ajkr

fbshipit-source-id: 9addbe5
2017-01-10 16:54:20 -08:00
Sunpoet Po-Chuan Hsieh
2172b660eb Fix build on FreeBSD
Summary:
```
  CC       utilities/column_aware_encoding_exp.o
utilities/column_aware_encoding_exp.cc:149:5: error: use of undeclared identifier 'exit'
    exit(1);
    ^
utilities/column_aware_encoding_exp.cc:154:5: error: use of undeclared identifier 'exit'
    exit(1);
    ^
utilities/column_aware_encoding_exp.cc:158:5: error: use of undeclared identifier 'exit'
    exit(1);
    ^
3 errors generated.
```
Closes https://github.com/facebook/rocksdb/pull/1754

Differential Revision: D4399044

Pulled By: IslamAbdelRahman

fbshipit-source-id: fbab5a2
2017-01-10 11:39:12 -08:00
Dmitri Smirnov
3c233ca4ea Fix Windows environment issues
Summary:
Enable directIO on WritableFileImpl::Append
     with offset being current length of the file.
     Enable UniqueID tests on Windows, disable others but
     leeting them to compile. Unique tests are valuable to
     detect failures on different filesystems and upcoming
     ReFS.
     Clear output in WinEnv Getchildren.This is different from
     previous strategy, do not touch output on failure.
     Make sure DBTest.OpenWhenOpen works with windows error message
Closes https://github.com/facebook/rocksdb/pull/1746

Differential Revision: D4385681

Pulled By: IslamAbdelRahman

fbshipit-source-id: c07b702
2017-01-09 15:54:12 -08:00
Maysam Yabandeh
7631734563 Fix the error in ColumnFamiliesTest
Summary:
In the test the last change to AAAZZZ in handles[1] is deleting it. The
result of the get must be NotFound then. Previosuly the test did not
check for the return value of Get and assumed that the status is ok. It
then move ahead asserting the returned value. The passed-by-reference
string value however was not changed (since the key was not found) and
the asserted value is what it contained before doing the Get.
Closes https://github.com/facebook/rocksdb/pull/1753

Differential Revision: D4390982

Pulled By: maysamyabandeh

fbshipit-source-id: dd55a34
2017-01-09 14:09:13 -08:00
Siying Dong
60c509ff18 Fix valgrind failure in test CurrentFileModifiedWhileCheckpointing2PC
Summary:
Fix some memory leaks in the test. Also rename the test class name from DBTest to CheckpointTest to avoid confusion.
Closes https://github.com/facebook/rocksdb/pull/1752

Differential Revision: D4390355

Pulled By: siying

fbshipit-source-id: 0fa388a
2017-01-09 11:54:13 -08:00
Maysam Yabandeh
d0ba8ec8f9 Revert "PinnableSlice"
Summary:
This reverts commit 54d94e9c2c.

The pull request was landed by mistake.
Closes https://github.com/facebook/rocksdb/pull/1755

Differential Revision: D4391678

Pulled By: maysamyabandeh

fbshipit-source-id: 36d5149
2017-01-08 14:24:12 -08:00
Maysam Yabandeh
54d94e9c2c PinnableSlice
Summary:
Currently the point lookup values are copied to a string provided by the user.
This incures an extra memcpy cost. This patch allows doing point lookup
via a PinnableSlice which pins the source memory location (instead of
copying their content) and releases them after the content is consumed
by the user. The old API of Get(string) is translated to the new API
underneath.

 Here is the summary for improvements:
 1. value 100 byte: 1.8%  regular, 1.2% merge values
 2. value 1k   byte: 11.5% regular, 7.5% merge values
 3. value 10k byte: 26% regular,    29.9% merge values

 The improvement for merge could be more if we extend this approach to
 pin the merge output and delay the full merge operation until the user
 actually needs it. We have put that for future work.

PS:
Sometimes we observe a small decrease in performance when switching from
t5452014 to this patch but with the old Get(string) API. The difference
is a little and could be noise. More importantly it is safely
cancelled
Closes https://github.com/facebook/rocksdb/pull/1732

Differential Revision: D4374613

Pulled By: maysamyabandeh

fbshipit-source-id: a077f1a
2017-01-08 13:54:13 -08:00
Andrew Kryczka
33c86d677f Fix backupable db test
Summary:
#1733 started using SizeFileBytes(), so our dummy log file implementation should stop asserting that this function isn't called.
Closes https://github.com/facebook/rocksdb/pull/1740

Differential Revision: D4376055

Pulled By: ajkr

fbshipit-source-id: 2854d89
2017-01-01 11:24:14 -08:00
Vincent Lee
e425ec1162 utilities/backupable: backup should limit the copy size of wal.
Summary:
Since the backup work as snapshot, we should only copy
 the bytes of the wal while we get the alive files.
Closes https://github.com/facebook/rocksdb/pull/1733

Differential Revision: D4373457

Pulled By: ajkr

fbshipit-source-id: 389318f
2016-12-31 10:54:20 -08:00
Siying Dong
17a4b75cc3 Always fsync the file after file copying
Summary:
File copying happens when creating checkpoints and bulkloading files from different FS partition. We should fsync the files when copying them to guarantee durability. A side effect will be that the dirty pages in file system buffers won't grow too large.
Closes https://github.com/facebook/rocksdb/pull/1728

Differential Revision: D4371083

Pulled By: siying

fbshipit-source-id: 579e14c
2016-12-28 19:09:16 -08:00
Siying Dong
438f22bc56 Fix bug of Checkpoint loses recent transactions with 2PC
Summary:
If 2PC is enabled, checkpoint may not copy previous log files that contain uncommitted prepare records. In this diff we keep those files.
Closes https://github.com/facebook/rocksdb/pull/1724

Differential Revision: D4368319

Pulled By: siying

fbshipit-source-id: cc2c746
2016-12-28 12:24:16 -08:00
Yi Wu
ab48c165a9 Print cache options to info log
Summary:
Improve cache options logging to info log.
Also print the value of
cache_index_and_filter_blocks_with_high_priority.
Closes https://github.com/facebook/rocksdb/pull/1709

Differential Revision: D4358776

Pulled By: yiwu-arbug

fbshipit-source-id: 8f030a0
2016-12-22 14:54:19 -08:00
Aaron Gao
972f96b3fb direct io write support
Summary:
rocksdb direct io support

```
[gzh@dev11575.prn2 ~/rocksdb] ./db_bench -benchmarks=fillseq --num=1000000
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB:    version 5.0
Date:       Wed Nov 23 13:17:43 2016
CPU:        40 * Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
CPUCache:   25600 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Write rate: 0 bytes/second
Compression: Snappy
Memtablerep: skip_list
Perf Level: 1
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
DB path: [/tmp/rocksdbtest-112628/dbbench]
fillseq      :       4.393 micros/op 227639 ops/sec;   25.2 MB/s

[gzh@dev11575.prn2 ~/roc
Closes https://github.com/facebook/rocksdb/pull/1564

Differential Revision: D4241093

Pulled By: lightmark

fbshipit-source-id: 98c29e3
2016-12-22 13:09:19 -08:00
Siying Dong
3d692822f8 persistent_cache: fix two timer
Summary:
In persistent_cache/block_cache_tier.cc, timers are never restarted, so the latency measured is not correct.
Closes https://github.com/facebook/rocksdb/pull/1707

Differential Revision: D4355828

Pulled By: siying

fbshipit-source-id: cd5f9e1
2016-12-21 13:39:16 -08:00