Commit Graph

584 Commits

Author SHA1 Message Date
Islam AbdelRahman
d062328977 Revert "Support SST files with Global sequence numbers"
This reverts commit ab01da5437.
2016-10-07 14:05:12 -07:00
Reid Horuff
37737c3a6b Expose Transaction State Publicly
Summary:
This exposes a transactions state through a public api rather than through a public member variable. I also do some name refactoring.
ExecutionStatus => TransactionState
exec_status_ => trx_state_

Test Plan: It compiles and transaction_test passes.

Reviewers: IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, mung, dhruba, sdong

Differential Revision: https://reviews.facebook.net/D64689
2016-10-07 11:58:53 -07:00
Reid Horuff
2c1f95291d Add facility to write only a portion of WriteBatch to WAL
Summary:
When constructing a write batch a client may now call MarkWalTerminationPoint() on that batch. No batch operations after this call will be added written to the WAL but will still be inserted into the Memtable. This facility is used to remove one of the three WriteImpl calls in 2PC transactions. This produces a ~1% perf improvement.

```
RocksDB - unoptimized 2pc, sync_binlog=1, disable_2pc=off
INFO 2016-08-31 14:30:38,814 [main]: REQUEST PHASE COMPLETED. 75000000 requests done in 2619 seconds. Requests/second = 28628

RocksDB - optimized 2pc , sync_binlog=1, disable_2pc=off
INFO 2016-08-31 16:26:59,442 [main]: REQUEST PHASE COMPLETED. 75000000 requests done in 2581 seconds. Requests/second = 29054
```

Test Plan: Two unit tests added.

Reviewers: sdong, yiwu, IslamAbdelRahman

Reviewed By: yiwu

Subscribers: hermanlee4, dhruba, andrewkr

Differential Revision: https://reviews.facebook.net/D64599
2016-10-07 11:32:10 -07:00
Islam AbdelRahman
ab01da5437 Support SST files with Global sequence numbers
Summary:
- Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno`
- Update TableProperties to be aware of the offset of each property in the file
- Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file

Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks

Test Plan: unit tests

Reviewers: andrewkr, yhchiang, yiwu, sdong

Reviewed By: sdong

Subscribers: hcz, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D62523
2016-10-03 16:12:39 -07:00
krad
e91b4d0cf6 Add factory method for creating persistent cache that is accessible from public
Summary:
Currently there is no mechanism to create persistent cache from
headers. Adding a simple factory method to create a simple persistent cache with
default or NVM optimized settings.

note: Any idea to test this factory is appreciated.

Test Plan: None

Reviewers: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64527
2016-10-03 10:55:46 -07:00
Manuel Ung
be1f1092c9 Expose transaction id, lock state information and transaction wait information
Summary:
This diff does 3 things:

Expose TransactionID so that we can identify transactions when we retrieve locking and lock wait information. This is exposed as `Transaction::GetID`.

Expose lock state information by locking all stripes in all column families and copying their contents to a data structure. This is exposed as `TransactionDB::GetLockStatusData`.

Adds support for tracking the transaction and the key being waited on, and exposes this as `Transaction::GetWaitingTxn`.

Test Plan: unit tests

Reviewers: horuff, sdong

Reviewed By: sdong

Subscribers: vasilep, hermanlee4, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D64413
2016-09-30 11:41:21 -07:00
Kefu Chai
21f4bb5a89 cmake support for linux and osx (#1358)
* enable cmake to work on linux and osx also

* port part of build_detect_platform not covered by thirdparty.inc
  to cmake.
  - detect fallocate()
  - detect malloc_usable_size()
  - detect JeMalloc
  - detect snappy
* check for asan,tsan,ubsan
* create 'build_version.cc' in build directory.
* add `check` target to support 'make check'.
* add `tools` target to match its counterpart in Makefile.
* use `date` on non-win32 platforms.
* pass different cflags on non-win32 platforms
* detect pthead library using FindThread cmake module.
* enable CMP0042 to silence the cmake warning on osx
* reorder the linked libraries. because testutillib references gtest, to
  enable the linker to find the referenced symbols, we need to put gtest
  after testutillib.

Signed-off-by: Marcus Watts <mwatts@redhat.com>
Signed-off-by: Kefu Chai <kchai@redhat.com>

* hash_table_bench.cc: fix build without gflags

Signed-off-by: Kefu Chai <kchai@redhat.com>

* remove gtest from librocksdb linkage

testharness.cc is included in librocksdb sources, and it uses gtest. but
gtest is not supposed to be part of the public API of librocksdb. so, in
this change, the testharness.cc is moved out out librocksdb, and is
built as an object target, then linked with the tools and tests instead.

Signed-off-by: Marcus Watts <mwatts@redhat.com>
Signed-off-by: Kefu Chai <kchai@redhat.com>
2016-09-28 11:53:15 -07:00
wenduo
4defe306fb fix typo in comments (#1360)
* fix typo in option.h's comment

* fix typo in checkpoint's comment
2016-09-27 20:39:15 -07:00
Aaron Gao
f517d9dd09 Add SeekForPrev() to Iterator
Summary:
Add new Iterator API, `SeekForPrev`: find the last key that <= target key
support prefix_extractor
support prefix_same_as_start
support upper_bound
not supported in iterators without Prev()

Also add tests in db_iter_test and db_iterator_test

Pass all tests
Cheers!

Test Plan: make all check -j64

Reviewers: andrewkr, yiwu, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64149
2016-09-27 18:20:57 -07:00
Yi Wu
17f76fc564 DB::GetOptions() reflect dynamic changed options
Summary: DB::GetOptions() reflect dynamic changed options.

Test Plan: See the new unit test.

Reviewers: yhchiang, sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D63903
2016-09-14 22:10:28 -07:00
sdong
607628d349 Support ZSTD with finalized format
Summary:
ZSTD 1.0.0 is coming. We can finally add a support of ZSTD without worrying about compatibility.
Still keep ZSTDNotFinal for compatibility reason.

Test Plan: Run all tests. Run db_bench with ZSTD version with RocksDB built with ZSTD 1.0 and older.

Reviewers: andrewkr, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: cyan, igor, IslamAbdelRahman, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D63141
2016-09-06 12:22:16 -07:00
Yi Wu
a88677d2cf Remove ImmutableCFOptions from public API
Summary: There's no reference to ImmutableCFOptions elsewhere in /include/rocksdb. ImmutableCFOptions was introduced in this commit (5665e5e285) but later its reference in /include/rocksdb/table.h is removed.

Test Plan:
  make all check

Reviewers: IslamAbdelRahman, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: yhchiang, andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D63177
2016-09-02 14:16:31 -07:00
sdong
32149059f9 Merge options source_compaction_factor, max_grandparent_overlap_bytes and expanded_compaction_factor into max_compaction_bytes
Summary: To reduce number of options, merge source_compaction_factor, max_grandparent_overlap_bytes and expanded_compaction_factor into max_compaction_bytes.

Test Plan: Add two new unit tests. Run all existing tests, including jtest.

Reviewers: yhchiang, igor, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59829
2016-09-01 14:33:24 -07:00
Aaron Gao
4590b53a4b add stats to Cache::LookUp()
Summary: basically for SimCache stats. I find most times it is hard to pass Statistics* to SimCache constructor.

Test Plan: make all check

Reviewers: andrewkr, sdong, yiwu

Reviewed By: yiwu

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D62193
2016-09-01 13:50:39 -07:00
Kien-hung Li
380e651afc Fix Mac build failure (#1309) 2016-08-31 15:03:10 -07:00
Yi Wu
72f8cc703c LRU cache mid-point insertion
Summary:
Add mid-point insertion functionality to LRU cache. Caller of `Cache::Insert()` can set an additional parameter to make a cache entry have higher priority. The LRU cache will reserve at most `capacity * high_pri_pool_pct` bytes for high-pri cache entries. If `high_pri_pool_pct` is zero, the cache degenerates to normal LRU cache.

Context: If we are to put index and filter blocks into RocksDB block cache, index/filter block can be swap out too early. We want to add an option to RocksDB to reserve some capacity in block cache just for index/filter blocks, to mitigate the issue.

In later diffs I'll update block based table reader to use the interface to cache index/filter blocks at high priority, and expose the option to `DBOptions` and make it dynamic changeable.

Test Plan: unit test.

Reviewers: IslamAbdelRahman, sdong, lightmark

Reviewed By: lightmark

Subscribers: andrewkr, dhruba, march, leveldb

Differential Revision: https://reviews.facebook.net/D61977
2016-08-19 16:43:31 -07:00
Wanning Jiang
78837f5d61 TableBuilder / TableReader support for range deletion
Summary: 1. Range Deletion Tombstone structure 2. Modify Add() in table_builder to make it usable for adding range del tombstones 3. Expose NewTombstoneIterator() API in table_reader

Test Plan: table_test.cc (now BlockBasedTableBuilder::Add() only accepts InternalKey. I make table_test only pass InternalKey to BlockBasedTableBuidler. Also test writing/reading range deletion tombstones in table_test )

Reviewers: sdong, IslamAbdelRahman, lightmark, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61473
2016-08-19 15:10:31 -07:00
Andrew Kryczka
3771e37970 WriteBatch support for range deletion
Summary:
Add API to WriteBatch to store range deletions in its buffer
which are later added to memtable. In the WriteBatch buffer, a range
deletion is encoded as "<optype><CF ID (optional)><begin key><end key>".

With this diff, the range tombstones are stored inline with the data in
the memtable. It's useful for now because the test cases rely on the
data being accessible via memtable. My next step is to store range
tombstones in a separate area in the memtable.

Test Plan: unit tests

Reviewers: IslamAbdelRahman, sdong, wanning

Reviewed By: wanning

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61401
2016-08-16 08:16:04 -07:00
Aaron Gao
76a67cf741 support stackableDB as the baseDB of transactionDB
Summary: make transactionDB working with StackableDB

Test Plan: make all check -j64

Reviewers: andrewkr, yiwu, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60705
2016-08-11 14:19:33 -07:00
krad
67c1ae8831 Travis build break fix
Summary:
Travis times out with persistent_cache_test, there is no way to increase the
timeout, so lower the test case for Travis

Test Plan: Run on travis and sandcastle

Subscribers: leveldb, dhruba, andrewkr

Differential Revision: https://reviews.facebook.net/D61863
2016-08-11 13:02:32 -07:00
krad
87c91bd876 Persistent Read Cache (8) Benchmark tooling
Summary:
Adding benchmark tool for persistent read cache.

TODO: Add integration to db_bench

Test Plan: Compile

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57711
2016-08-10 17:49:22 -07:00
Aaron Gao
2914de64e8 add sim_cache stats to Statistics
Summary:
add SIM_BLOCK_CACHE_HIT and SIM_BLOCK_CACHE_MISS tickers.
maybe can be combined with Histograms like DB_GET to evaluate the current setting of the size of block cache.

Test Plan: make all check

Reviewers: sdong, andrewkr, IslamAbdelRahman, yiwu

Reviewed By: yiwu

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61803
2016-08-10 17:42:24 -07:00
sdong
8b79422b52 [Proof-Of-Concept] RocksDB Blob Storage with a blob log file.
Summary:
This is a proof of concept of a RocksDB blob log file. The actual value of the Put() is appended to a blob log using normal data block format, and the handle of the block is written as the value of the key in RocksDB.

The prototype only supports Put() and Get(). It doesn't support DB restart, garbage collection, Write() call, iterator, snapshots, etc.

Test Plan: Add unit tests.

Reviewers: arahut

Reviewed By: arahut

Subscribers: kradhakrishnan, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D61485
2016-08-10 17:05:17 -07:00
krad
59ddb50591 Fix travis build break
Summary:
Travis provides a very low capability resource for testing and the
persistent cache tests are too aggressive. It causes the kernel to not have
enough time to do file-system meta data sync operation, causing open file
descriptors to increase and available space to decrease.

This change skips certain tests

Test Plan: Run on travis

Subscribers: leveldb, dhruba, andrewkr

Differential Revision: https://reviews.facebook.net/D61755
2016-08-08 18:04:45 -07:00
omegaga
44f5cc57a5 Add time series database (resubmitted)
Summary: Implement a time series database that supports DateTieredCompactionStrategy. It wraps a db object and separate SST files in different column families (time windows).

Test Plan: Add `date_tiered_test`.

Reviewers: dhruba, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61653
2016-08-05 15:56:22 -07:00
sdong
7c4615cf1f A utility function to help users migrate DB after options change
Summary: Add a utility function that trigger necessary full compaction and put output to the correct level by looking at new options and old options.

Test Plan: Add unit tests for it.

Reviewers: andrewkr, igor, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: muthu, sumeet, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60783
2016-08-05 15:39:55 -07:00
omegaga
8234faabfd Fix failed test
Summary: Fix failed persistent cache test introduced by recent PR to support cross-platform build.

Test Plan: Run persistent_cache_test (on multiple platforms)

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61467
2016-08-03 14:23:21 -07:00
omegaga
7323e4c8ab Fix clang on macOS
Summary: Fix problem when casting `size_t` to `uint64_t`. Avoid usage of `size_t` when possible.

Test Plan: Pass related test on Travis CI.

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61443
2016-08-03 10:41:27 -07:00
Dmitri Smirnov
e4609a749b Fix Windows build issues (#1253) 2016-08-02 17:15:18 -07:00
omegaga
2306167d30 Fix clang build failure and refactor unit test
Summary: <endian.h> is not platform independent. Switch to our own endianness transformation function instead.

Test Plan: Pass Travis CI. Refactor tests to make sure endianness transformation runs properly.

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61389
2016-08-02 15:16:39 -07:00
Jay Edgar
cdc4eb6892 Add a GetComparator() function to the ColumnFamilyHandle base class so that the user's comparator can be retrieved.
Summary: MyRocks is adding support for the user of the SstFileWriter which needs a comparator.  It would be more convenient to get the comparator from the column family (which already has to have it) than to have caller keep track of it.

Test Plan: Standard tests (adding one for the new method)

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D61155
2016-08-02 14:34:57 -07:00
krad
712dd27e67 Build break fixes
Summary: Regex support for c++ is very inconsistent across compilers, converting
the logic to simple string manipulation.

Test Plan: Local test

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61377
2016-08-02 10:30:29 -07:00
krad
726c2f7e5b Build break fix
Summary:
Fixing build break on Mac

(1) uint64_t fix
(2) O_DIRECT works only for Linux

Test Plan: Build and test on Mac and Unix

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61353
2016-08-01 15:24:13 -07:00
omegaga
d51dc96a79 Experiments on column-aware encodings
Summary:
Experiments on column-aware encodings. Supported features: 1) extract data blocks from SST file and encode with specified encodings; 2) Decode encoded data back into row format; 3) Directly extract data blocks and write in row format (without prefix encoding); 4) Get column distribution statistics for column format; 5) Dump data blocks separated by columns in human-readable format.

There is still on-going work on this diff. More refactoring is necessary.

Test Plan: Wrote tests in `column_aware_encoding_test.cc`. More tests should be added.

Reviewers: sdong

Reviewed By: sdong

Subscribers: arahut, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60027
2016-08-01 14:50:19 -07:00
krad
c116b47804 Persistent Read Cache (part 6) Block Cache Tier Implementation
Summary:
The patch is a continuation of part 5. It glues the abstraction for
file layout and metadata, and flush out the implementation of the API. It
adds unit tests for the implementation.

Test Plan: Run unit tests

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57549
2016-08-01 14:15:14 -07:00
ryneli
663afef884 Add EnvLibrados - RocksDB Env of RADOS (#1222)
EnvLibrados is a customized RocksDB Env to use RADOS as the backend file system of RocksDB. It overrides all file system related API of default Env. The easiest way to use it is just like following:

	std::string db_name = "test_db";
	std::string config_path = "path/to/ceph/config";
	DB* db;
	Options options;
	options.env = EnvLibrados(db_name, config_path);
	Status s = DB::Open(options, kDBPath, &db);

Then EnvLibrados will forward all file read/write operation to the RADOS cluster assigned by config_path. Default pool is db_name+"_pool".

There are some options that users could set for EnvLibrados.
- write_buffer_size. This variable is the max buffer size for WritableFile. After reaching the buffer_max_size, EnvLibrados will sync buffer content to RADOS, then clear buffer.
- db_pool. Rather than using default pool, users could set their own db pool name
- wal_dir. The dir for WAL files. Because RocksDB only has 2-level structure (dir_name/file_name), the format of wal_dir is "/dir_name"(CAN'T be "/dir1/dir2"). Default wal_dir is "/wal".
- wal_pool. Corresponding pool name for WAL files. Default value is db_name+"_wal_pool"

The example of setting options looks like following:

	db_name = "test_db";
	db_pool = db_name+"_pool";
	wal_dir = "/wal";
	wal_pool = db_name+"_wal_pool";
	write_buffer_size = 1 << 20;
	env_ = new EnvLibrados(db_name, config, db_pool, wal_dir, wal_pool, write_buffer_size);

	DB* db;
	Options options;
	options.env = env_;
	// The last level dir name should match the dir name in prefix_pool_map
	options.wal_dir = "/tmp/wal";

	// open DB
	Status s = DB::Open(options, kDBPath, &db);

Librados is required to compile EnvLibrados. Then use "$make LIBRADOS=1" to compile RocksDB. If you want to only compile EnvLibrados test, just run "$ make env_librados_test LIBRADOS=1". To run env_librados_test, you need to have a running RADOS cluster with the configure file located in "../ceph/src/ceph.conf" related to "rocksdb/".
2016-07-21 11:16:34 -07:00
John Alexander
9ab38c45ad Remove %z Format Specifier and Fix Windows Build of sim_cache.cc (#1224)
* Replace %zu format specifier with Windows-compatible macro 'ROCKSDB_PRIszt'

* Added "port/port.h" include to sim_cache.cc for call to snprintf().

* Applied cleaner fix to windows build, reverting part of 7bedd94
2016-07-20 15:28:04 -07:00
omegaga
e70020e4f6 Only cache level 0 indexes and filter when opening table reader
Summary: In T8216281 we decided to disable prefetching the index and filter during opening table handlers during startup (max_open_files = -1).

Test Plan: Rely on `IndexAndFilterBlocksOfNewTableAddedToCache` to guarantee L0 indexes and filters are still cached and change `PinL0IndexAndFilterBlocksTest` to make sure other levels are not cached (maybe add one more test to test we don't cache other levels?)

Reviewers: sdong, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59913
2016-07-20 11:23:31 -07:00
krad
7bedd94406 Build break fix
Summary:
(1) Integer size correction (mac build break)
(2) snprint usage in Windows (windows build break)

Test Plan: Build in windows and mac

Reviewers: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60927
2016-07-20 10:56:43 -07:00
Islam AbdelRahman
68a8e6b8fa Introduce FullMergeV2 (eliminate memcpy from merge operators)
Summary:
This diff update the code to pin the merge operator operands while the merge operation is done, so that we can eliminate the memcpy cost, to do that we need a new public API for FullMerge that replace the std::deque<std::string> with std::vector<Slice>

This diff is stacked on top of D56493 and D56511

In this diff we
- Update FullMergeV2 arguments to be encapsulated in MergeOperationInput and MergeOperationOutput which will make it easier to add new arguments in the future
- Replace std::deque<std::string> with std::vector<Slice> to pass operands
- Replace MergeContext std::deque with std::vector (based on a simple benchmark I ran https://gist.github.com/IslamAbdelRahman/78fc86c9ab9f52b1df791e58943fb187)
- Allow FullMergeV2 output to be an existing operand

```
[Everything in Memtable | 10K operands | 10 KB each | 1 operand per key]

DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=10000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000

[FullMergeV2]
readseq      :       0.607 micros/op 1648235 ops/sec; 16121.2 MB/s
readseq      :       0.478 micros/op 2091546 ops/sec; 20457.2 MB/s
readseq      :       0.252 micros/op 3972081 ops/sec; 38850.5 MB/s
readseq      :       0.237 micros/op 4218328 ops/sec; 41259.0 MB/s
readseq      :       0.247 micros/op 4043927 ops/sec; 39553.2 MB/s

[master]
readseq      :       3.935 micros/op 254140 ops/sec; 2485.7 MB/s
readseq      :       3.722 micros/op 268657 ops/sec; 2627.7 MB/s
readseq      :       3.149 micros/op 317605 ops/sec; 3106.5 MB/s
readseq      :       3.125 micros/op 320024 ops/sec; 3130.1 MB/s
readseq      :       4.075 micros/op 245374 ops/sec; 2400.0 MB/s
```

```
[Everything in Memtable | 10K operands | 10 KB each | 10 operand per key]

DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=1000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000

[FullMergeV2]
readseq      :       3.472 micros/op 288018 ops/sec; 2817.1 MB/s
readseq      :       2.304 micros/op 434027 ops/sec; 4245.2 MB/s
readseq      :       1.163 micros/op 859845 ops/sec; 8410.0 MB/s
readseq      :       1.192 micros/op 838926 ops/sec; 8205.4 MB/s
readseq      :       1.250 micros/op 800000 ops/sec; 7824.7 MB/s

[master]
readseq      :      24.025 micros/op 41623 ops/sec;  407.1 MB/s
readseq      :      18.489 micros/op 54086 ops/sec;  529.0 MB/s
readseq      :      18.693 micros/op 53495 ops/sec;  523.2 MB/s
readseq      :      23.621 micros/op 42335 ops/sec;  414.1 MB/s
readseq      :      18.775 micros/op 53262 ops/sec;  521.0 MB/s

```

```
[Everything in Block cache | 10K operands | 10 KB each | 1 operand per key]

[FullMergeV2]
$ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions
readseq      :      14.741 micros/op 67837 ops/sec;  663.5 MB/s
readseq      :       1.029 micros/op 971446 ops/sec; 9501.6 MB/s
readseq      :       0.974 micros/op 1026229 ops/sec; 10037.4 MB/s
readseq      :       0.965 micros/op 1036080 ops/sec; 10133.8 MB/s
readseq      :       0.943 micros/op 1060657 ops/sec; 10374.2 MB/s

[master]
readseq      :      16.735 micros/op 59755 ops/sec;  584.5 MB/s
readseq      :       3.029 micros/op 330151 ops/sec; 3229.2 MB/s
readseq      :       3.136 micros/op 318883 ops/sec; 3119.0 MB/s
readseq      :       3.065 micros/op 326245 ops/sec; 3191.0 MB/s
readseq      :       3.014 micros/op 331813 ops/sec; 3245.4 MB/s
```

```
[Everything in Block cache | 10K operands | 10 KB each | 10 operand per key]

DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10-operands-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions

[FullMergeV2]
readseq      :      24.325 micros/op 41109 ops/sec;  402.1 MB/s
readseq      :       1.470 micros/op 680272 ops/sec; 6653.7 MB/s
readseq      :       1.231 micros/op 812347 ops/sec; 7945.5 MB/s
readseq      :       1.091 micros/op 916590 ops/sec; 8965.1 MB/s
readseq      :       1.109 micros/op 901713 ops/sec; 8819.6 MB/s

[master]
readseq      :      27.257 micros/op 36687 ops/sec;  358.8 MB/s
readseq      :       4.443 micros/op 225073 ops/sec; 2201.4 MB/s
readseq      :       5.830 micros/op 171526 ops/sec; 1677.7 MB/s
readseq      :       4.173 micros/op 239635 ops/sec; 2343.8 MB/s
readseq      :       4.150 micros/op 240963 ops/sec; 2356.8 MB/s
```

Test Plan: COMPILE_WITH_ASAN=1 make check -j64

Reviewers: yhchiang, andrewkr, sdong

Reviewed By: sdong

Subscribers: lovro, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57075
2016-07-20 09:49:03 -07:00
krad
d9cfaa2b16 Persistent Read Cache (6) Persistent cache tier implentation - File layout
Summary:
Persistent cache tier is the tier abstraction that can work for any block
device based device mounted on a file system. The design/implementation can
handle any generic block device.

Any generic block support is achieved by generalizing the access patten as
{io-size, q-depth, direct-io/buffered}.

We have specifically tested and adapted the IO path for NVM and SSD.

Persistent cache tier consists of there parts :

1) File layout

Provides the implementation for handling IO path for reading and writing data
(key/value pair).

2) Meta-data
Provides the implementation for handling the index for persistent read cache.

3) Implementation
It binds (1) and (2) and flushed out the PersistentCacheTier interface

This patch provides implementation for (1)(2). Follow up patch will provide (3)
and tests.

Test Plan: Compile and run check

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57117
2016-07-19 12:01:46 -07:00
Yi Wu
4b95253587 Refactor cache.cc
Summary: Refactor cache.cc so that I can plugin clock cache (D55581). Mainly move `ShardedCache` to separate file, move `LRUHandle` back to cache.cc and rename it lru_cache.cc.

Test Plan:
    make check -j64

Reviewers: lightmark, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D59655
2016-07-15 10:41:36 -07:00
Yi Wu
296545a2c7 Fix clang analyzer errors
Summary:
Fixing erros reported by clang static analyzer.
* Removing some unused variables.
* Adding assertions to fix false positives reported by clang analyzer.
* Adding `__clang_analyzer__` macro to suppress false positive warnings.

Test Plan:
    USE_CLANG=1 OPT=-g make analyze -j64

Reviewers: andrewkr, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60549
2016-07-08 17:50:51 -07:00
Aaron Gao
7c190070b4 delete unnessary pointer cast in beginInternalTransaction() function
Summary: use of dynamic_cast<TransactionImpl*> is unnecessary and also introduce difficulty for fbrocksdb support of TransactionDB

Test Plan: ./transaction_test

Reviewers: sdong, IslamAbdelRahman, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60501
2016-07-07 16:34:41 -07:00
Zun Wang
eb53c05a35 Add comment for GetBackupInfo about returned BackupInfos order
Summary: #title

Test Plan: n/a

Reviewers: uddipta, ldemailly, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60393
2016-07-06 12:54:01 -07:00
Andrew Kryczka
1a11c934d2 Disable some persistent cache tests on linux/clang 2016-07-04 15:42:33 -07:00
Wanning Jiang
95d96eeeb1 remove LockFile
Summary: LockFile is unnecessary in unit test

Test Plan: env_basic_test.cc

Reviewers: andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60285
2016-07-01 15:28:08 -07:00
Wanning Jiang
ff45d1b547 if read only backup engine can't find meta dirs, return NotFound() instead of IOError()
Summary: Read only backup engine return NotFound() on missing meta dir (for e2e test)

Test Plan: backupable_db_test

Reviewers: andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60273
2016-07-01 11:36:33 -07:00
Wanning Jiang
95c192475d writable file close before reset
Summary: during backup, writable file should call close() before reset()

Test Plan: backupable_db_test.cc

Reviewers: andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60195
2016-06-29 15:33:27 -07:00
Reid Horuff
892e9d3047 make transaction WriteOptions modifiable 2016-06-27 12:53:30 -07:00
charsyam
4f2b0946d1 fix simple typos (#1183) 2016-06-25 08:29:40 +01:00
Islam AbdelRahman
7c919deccc Reuse TimedFullMerge instead of FullMerge + instrumentation
Summary:
We have alot of code duplication whenever we call FullMerge we keep duplicating the instrumentation and statistics code
This is a simple diff to refactor the code to use TimedFullMerge instead of FullMerge

Test Plan: COMPILE_WITH_ASAN=1 make check -j64

Reviewers: andrewkr, yhchiang, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59577
2016-06-13 16:17:26 -07:00
krad
81f6b33d93 Fix tsan error
Summary:
The tsan error was because the random implementation we have is not
thread safe, using Random::GetTLSInstance

Test Plan: Run tests in Linux

Reviewers: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D59559
2016-06-13 12:59:38 -07:00
Wanning Jiang
56887f6cb8 Backup Options
Summary: Backup options file to private directory

Test Plan:
backupable_db_test.cc, BackupOptions
	   Modify DB options by calling OpenDB for 3 times. Check the latest options file is in the right place. Also check no redundent files are backuped.

Reviewers: andrewkr

Reviewed By: andrewkr

Subscribers: leveldb, dhruba, andrewkr

Differential Revision: https://reviews.facebook.net/D59373
2016-06-09 19:03:10 -07:00
krad
5b197bff41 Enabled Windows build for volatile tier implementation
Summary: Enabled build in Windows and corresponding fixes

Test Plan:
Compile and run persistent_cache_test in Windows and make check in
Linux

Reviewers: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D59307
2016-06-09 11:10:35 -07:00
Islam AbdelRahman
281fbdddc2 Temporarily remove PersistentCacheOptions from persistent_cache_tier.h to fix unity build
Summary:
PersistentCacheOptions class in persistent_cache_tier.h is not used any where yet in the code base
but it break the unity build because it have the same name as PersistentCacheOptions in table/persistent_cache_helper.h

Remove it temporarily, and the @krad can add it again with a different name when we start using it

Test Plan:
make unity_test -j64
make check -j64

Reviewers: kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59349
2016-06-09 10:39:42 -07:00
Islam AbdelRahman
8ff59b2b40 Disable PersistentCacheTierTest.VolatileCacheInsertWithEviction test under TSAN temporarily
Summary: Disable the test under TSAN temporary to temporarily the build

Test Plan: run the test under TSAN

Reviewers: kradhakrishnan, andrewkr, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59337
2016-06-08 12:42:33 -07:00
sdong
098da83483 Fix CLANG build break caused by the recent Persistent Cache change
Test Plan: USE_CLANG=1 make all check -j32

Reviewers: andrewkr, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59313
2016-06-07 17:05:12 -07:00
krad
d755c62f92 Persistent Read Cache (5) Volatile cache tier implementation
Summary:
This provides provides an implementation of PersistentCacheTier that is
specialized for RAM. This tier does not persist data though.

Why do we need this tier ?

This is ideal as tier 0. This tier can host data that is too hot.

Why can't we use Cache variants ?

Yes you can use them instead. This tier can potentially outperform BlockCache
in RAW mode by virtue of compression and compressed cache in block cache doesn't
seem very popular. Potentially this tier can be modified to under stand the
disadvantage of the tier below and retain data that the tier below is bad at
handling (for example index and bloom data that is huge in size)

Test Plan: Run unit tests added

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57069
2016-06-07 11:10:44 -07:00
krad
0fee896841 Fix Windows build
Summary:
Fix 2 issues that was breaking Windows build
1) double to size_t potential downcast warning
2) port_posix is not ready for windows, avoiding building hash_table_bench to
avoid build break

Test Plan: compile in Windoes and make check

Reviewers: sdong, andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D59265
2016-06-06 18:38:12 -07:00
krad
3070ed9021 Persistent Read Cache (4) Interface definitions
Summary:
This diff provides the basic interface definitions of persistent read
cache system

PersistentCacheOptions captures the persistent read cache options used to
configure and control the system
PersistentCacheTier provides the basic building block for constructing tiered
cache
PersistentTieredCache provides a logical abstraction of tiers of cache layered
over one another

Test Plan: Compile

Reviewers: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57051
2016-06-06 13:17:09 -07:00
Uddipta Maity
1147e5b05a Adding support for sharing throttler between multiple backup and restores
Summary:
Rocksdb backup and restore rate limiting is currently done per backup/restore.
So, it is difficult to control rate across multiple backup/restores. With this
change, a throttler can be provided. If a throttler is provided, it is used.
Otherwise, a new throttler is created based on the actual rate limits specified
in the options.

Test Plan: Added unit tests

Reviewers: ldemailly, andrewkr, sdong

Reviewed By: andrewkr

Subscribers: igor, yiwu, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56265
2016-06-03 17:02:07 -07:00
Andrew Kryczka
af0c9ac01d Env registry for URI-based Env selection [pluggable Env part 1]
Summary:
This enables configurable Envs without recompiling. For example, my
next diff will make env_test test an Env created by NewEnvFromUri(). Then,
users can determine which Env is tested simply by providing the URI for
NewEnvFromUri() (e.g., through a CLI argument or environment variable).

The registration process allows us to register any Env that is linked with the
RocksDB library, so we can register our internal Envs as well.

The registration code is inspired by our internal InitRegistry.

Test Plan: new unit test

Reviewers: IslamAbdelRahman, lightmark, ldemailly, sdong

Reviewed By: sdong

Subscribers: leveldb, dhruba, andrewkr

Differential Revision: https://reviews.facebook.net/D58449
2016-06-03 08:15:16 -07:00
Aaron Gao
630b732cb3 fix flaky sim_cache_test
Summary: fix flaky test

Test Plan: `make all check`

Reviewers: sdong, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D59157
2016-06-02 18:39:25 -07:00
krad
62d5480984 Add persistent cache to Windows build system
Summary: Add hash table (under persistent cache) to CMake list

Test Plan: Run hash_test in windows and make check in Linux

Reviewers: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D59151
2016-06-02 17:48:25 -07:00
Zuoyan Qin
590e2617ee fix delete file bug when do checkpoint (#1138) 2016-05-29 14:26:55 -07:00
Aaron Gao
5d660258e7 add simulator Cache as class SimCache/SimLRUCache(with test)
Summary: add class SimCache(base class with instrumentation api) and SimLRUCache(derived class with detailed implementation) which is used as an instrumented block cache that can predict hit rate for different cache size

Test Plan:
Add a test case in `db_block_cache_test.cc` called `SimCacheTest` to test basic logic of SimCache.
Also add option `-simcache_size` in db_bench. if set with a value other than -1, then the benchmark will use this value as the size of the simulator cache and finally output the simulation result.
```
[gzh@dev9927.prn1 ~/local/rocksdb] ./db_bench -benchmarks "fillseq,readrandom" -cache_size 1000000 -simcache_size 1000000
RocksDB:    version 4.8
Date:       Tue May 17 16:56:16 2016
CPU:        32 * Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
CPUCache:   20480 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Write rate: 0 bytes/second
Compression: Snappy
Memtablerep: skip_list
Perf Level: 0
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
DB path: [/tmp/rocksdbtest-112628/dbbench]
fillseq      :       6.809 micros/op 146874 ops/sec;   16.2 MB/s
DB path: [/tmp/rocksdbtest-112628/dbbench]
readrandom   :       6.343 micros/op 157665 ops/sec;   17.4 MB/s (1000000 of 1000000 found)

SIMULATOR CACHE STATISTICS:
SimCache LOOKUPs: 986559
SimCache HITs:    264760
SimCache HITRATE: 26.84%

[gzh@dev9927.prn1 ~/local/rocksdb] ./db_bench -benchmarks "fillseq,readrandom" -cache_size 1000000 -simcache_size 10000000
RocksDB:    version 4.8
Date:       Tue May 17 16:57:10 2016
CPU:        32 * Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
CPUCache:   20480 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Write rate: 0 bytes/second
Compression: Snappy
Memtablerep: skip_list
Perf Level: 0
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
DB path: [/tmp/rocksdbtest-112628/dbbench]
fillseq      :       5.066 micros/op 197394 ops/sec;   21.8 MB/s
DB path: [/tmp/rocksdbtest-112628/dbbench]
readrandom   :       6.457 micros/op 154870 ops/sec;   17.1 MB/s (1000000 of 1000000 found)

SIMULATOR CACHE STATISTICS:
SimCache LOOKUPs: 1059764
SimCache HITs:    374501
SimCache HITRATE: 35.34%

[gzh@dev9927.prn1 ~/local/rocksdb] ./db_bench -benchmarks "fillseq,readrandom" -cache_size 1000000 -simcache_size 100000000
RocksDB:    version 4.8
Date:       Tue May 17 16:57:32 2016
CPU:        32 * Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
CPUCache:   20480 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Write rate: 0 bytes/second
Compression: Snappy
Memtablerep: skip_list
Perf Level: 0
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
DB path: [/tmp/rocksdbtest-112628/dbbench]
fillseq      :       5.632 micros/op 177572 ops/sec;   19.6 MB/s
DB path: [/tmp/rocksdbtest-112628/dbbench]
readrandom   :       6.892 micros/op 145094 ops/sec;   16.1 MB/s (1000000 of 1000000 found)

SIMULATOR CACHE STATISTICS:
SimCache LOOKUPs: 1150767
SimCache HITs:    1034535
SimCache HITRATE: 89.90%
```

Reviewers: IslamAbdelRahman, andrewkr, sdong

Reviewed By: sdong

Subscribers: MarkCallaghan, andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57999
2016-05-23 23:35:23 -07:00
krad
4e7e41ba77 Disable lite build/testing for persistent read cache
Summary:
Persistent read cache isn't very applicable for lite builds. Wrapping
the code with #ifndef ROCKSDB_LITE .. #endif

Test Plan: Run unit, lite, lite_test

Reviewers: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D58563
2016-05-22 21:39:17 -07:00
sdong
0e77246ba9 backupable_db.cc: lambada to explictly caputre "this" when escaping scope
Summary:
Google C++ Style writes: In particular, prefer to write lambda captures explicitly when capturing this or if the lambda will escape the current scope.
Here it is the case for both.

Test Plan: Run all test suites.

Reviewers: andrewkr, dhruba

Reviewed By: andrewkr, dhruba

Subscribers: yhchiang, IslamAbdelRahman, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58515
2016-05-20 10:16:49 -07:00
Aaron Orenstein
2073cf3775 Eliminate use of 'using namespace std'. Also remove a number of ADL references to std functions.
Summary: Reduce use of argument-dependent name lookup in RocksDB.

Test Plan: 'make check' passed.

Reviewers: andrewkr

Reviewed By: andrewkr

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58203
2016-05-20 07:42:18 -07:00
Islam AbdelRahman
1f2dca0eaa Add MaxOperator to utilities/merge_operators/
Summary:
Introduce MaxOperator a simple merge operator that return the max of all operands.
This merge operand help me in benchmarking

Test Plan: Add new unitttests

Reviewers: sdong, andrewkr, yhchiang

Reviewed By: yhchiang

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57873
2016-05-19 15:51:29 -07:00
Islam AbdelRahman
05c5c39a7c Fix build 2016-05-18 00:41:14 -07:00
Reid Horuff
a6254f2bd4 Long outstanding prepare test
Summary: This tests that a prepared transaction is not lost after several crashes, restarts, and memtable flushes.

Test Plan: TwoPhaseLongPrepareTest

Reviewers: sdong

Subscribers: hermanlee4, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58185
2016-05-17 18:57:06 -07:00
Islam AbdelRahman
2ead115116 Fix TransactionTest.TwoPhaseMultiThreadTest under TSAN
Summary:
TransactionTest.TwoPhaseMultiThreadTest runs forever under TSAN and our CI builds time out
looks like the reason is that some threads keep running and other threads dont get a chance to increment the counter

Test Plan: run the test under TSAN

Reviewers: sdong, horuff

Reviewed By: horuff

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58359
2016-05-17 18:54:27 -07:00
krad
1f0142ce19 Persistent Read Cache (Part 2) Data structure for building persistent read cache index
Summary:
We expect the persistent read cache to perform at speeds upto 8 GB/s. In order
to accomplish that, we need build a index mechanism which operate in the order
of multiple millions per sec rate.

This patch provide the basic data structure to accomplish that:

(1) Hash table implementation with lock contention spread
    It is based on the StripedHashSet<T> implementation in
    The Art of multiprocessor programming by Maurice Henry & Nir Shavit
(2) LRU implementation
    Place holder algorithm for further optimizing
(3) Evictable Hash Table implementation
    Building block for building index data structure that evicts data like files
    etc

TODO:
(1) Figure if the sharded hash table and LRU can be used instead
(2) Figure if we need to support configurable eviction algorithm for
EvictableHashTable

Test Plan: Run unit tests

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D55785
2016-05-17 13:18:47 -07:00
Islam AbdelRahman
f6aedb62c0 Fix Transaction memory leak
Summary:
- Make sure we clean up recovered_transactions_ on DBImpl destructor
- delete leaked txns and env in TransactionTest

Test Plan: Run transaction_test under valgrind

Reviewers: sdong, andrewkr, yhchiang, horuff

Reviewed By: horuff

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58263
2016-05-16 16:32:55 -07:00
Reid Horuff
40123b3805 signed vs unsigned comparison fix 2016-05-11 14:22:43 -07:00
Islam AbdelRahman
c1af07ce84 Disable backupable_db_test.cc on Windows
Summary: Disable backupable_db_test.cc on Windows since EnvChroot is not supported

Test Plan: check ROCKSDB_LITE

Reviewers: andrewkr

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D58035
2016-05-11 12:25:32 -07:00
Andrew Kryczka
e61ba052b3 Isolate db env and backup Env in unit tests
Summary:
- Used ChrootEnv so the database and backup Envs are isolated in the filesystem.
- Removed DifferentEnvs test since now every test uses different Envs

Depends on D57543

Test Plan:
- ran backupable_db_test
- verified backupable_db_test now catches the bug when D57159 is backed out (this bug previously passed through the test cases, which motivated this change)

Reviewers: sdong, lightmark, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57615
2016-05-11 08:18:44 -07:00
Reid Horuff
c27061dae7 [rocksdb] 2PC double recovery bug fix
Summary:
1. prepare()
2. crash
3. recover
4. commit()
5. crash
6. data is lost

This is due to the transaction data still only residing in the WAL but because the logs were flushed on the first recovery the data is ignored on the second recovery. We must scan all logs found on recovery and only ignore redundant data at the time of replay. It is not possible to know which logs still contain relevant data at time of recovery. We cannot simply ignore a log because all of the non-2pc data it contains has already been written to L0.

The changes made to MemTableInserter are to ensure that prepared sections are still recovered even if all of the non-2pc data in that log has already been flushed to L0.

Test Plan: Provided test.

Reviewers: sdong

Subscribers: andrewkr, hermanlee4, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57729
2016-05-10 14:06:07 -07:00
Reid Horuff
a657ee9a9c [rocksdb] Recovery path sequence miscount fix
Summary:
Consider the following WAL with 4 batch entries prefixed with their sequence at time of memtable insert.
[1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(a)]
[1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(b)]
[4: COMMIT(a)]
[7: COMMIT(b)]

The first two batches do not consume any sequence numbers so are both prefixed with seq=1.
For 2pc commit, memtable insertion takes place before COMMIT batch is written to WAL.
We can see that sequence number consumption takes place between WAL entries giving us the seemingly sparse sequence prefix for WAL entries.
This is a valid WAL.

Because with 2PC markers one WriteBatch points to another batch containing its inserts a writebatch can consume more or less sequence numbers than the number of sequence consuming entries that it contains.

We can see that, given the entries in the WAL, 6 sequence ids were consumed. Yet on recovery the maximum sequence consumed would be 7 + 3 (the number of sequence numbers consumed by COMMIT(b))

So, now upon recovery we must track the actual consumption of sequence numbers.
In the provided scenario there will be no sequence gaps, but it is possible to produce a sequence gap. This should not be a problem though. correct?

Test Plan: provided test.

Reviewers: sdong

Subscribers: andrewkr, leveldb, dhruba, hermanlee4

Differential Revision: https://reviews.facebook.net/D57645
2016-05-10 14:06:07 -07:00
Reid Horuff
8a66c85e90 [rocksdb] Two Phase Transaction
Summary:
Two Phase Commit addition to RocksDB.

See wiki: https://github.com/facebook/rocksdb/wiki/Two-Phase-Commit-Implementation
Quip: https://fb.quip.com/pxZrAyrx53r3

Depends on:
WriteBatch modification: https://reviews.facebook.net/D54093
Memtable Log Referencing and Prepared Batch Recovery: https://reviews.facebook.net/D56919

Test Plan:
- SimpleTwoPhaseTransactionTest
- PersistentTwoPhaseTransactionTest.
- TwoPhaseRollbackTest
- TwoPhaseMultiThreadTest
- TwoPhaseLogRollingTest
- TwoPhaseEmptyWriteTest
- TwoPhaseExpirationTest

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: leveldb, hermanlee4, andrewkr, vasilep, dhruba, santoshb

Differential Revision: https://reviews.facebook.net/D56925
2016-05-10 14:06:07 -07:00
Reid Horuff
0460e9dcce Modification of WriteBatch to support two phase commit
Summary: Adds three new WriteBatch data types: Prepare(xid), Commit(xid), Rollback(xid). Prepare(xid) should precede the (single) operation to which is applies. There can obviously be multiple Prepare(xid) markers. There should only be one Rollback(xid) or Commit(xid) marker yet not both. None of this logic is currently enforced and will most likely be implemented further up such as in the memtableinserter. All three markers are similar to PutLogData in that they are writebatch meta-data, ie stored but not counted. All three markers differ from PutLogData in that they will actually be written to disk. As for WriteBatchWithIndex, Prepare, Commit, Rollback are all implemented just as PutLogData and none are tested just as PutLogData.

Test Plan: single unit test in write_batch_test.

Reviewers: hermanlee4, sdong, anthony

Subscribers: leveldb, dhruba, vasilep, andrewkr

Differential Revision: https://reviews.facebook.net/D57867
2016-05-10 14:06:07 -07:00
Andrew Kryczka
269f6b2e2d Revert "Modification of WriteBatch to support two phase commit"
Summary: Revert D54093 and D57453

Test Plan: running make check

Reviewers: horuff, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57819
2016-05-06 16:58:24 -07:00
Reid Horuff
6e56a114be Modification of WriteBatch to support two phase commit
Summary: Adds three new WriteBatch data types: Prepare(xid), Commit(xid), Rollback(xid). Prepare(xid) should precede the (single) operation to which is applies. There can obviously be multiple Prepare(xid) markers. There should only be one Rollback(xid) or Commit(xid) marker yet not both. None of this logic is currently enforced and will most likely be implemented further up such as in the memtableinserter. All three markers are similar to PutLogData in that they are writebatch meta-data, ie stored but not counted. All three markers differ from PutLogData in that they will actually be written to disk. As for WriteBatchWithIndex, Prepare, Commit, Rollback are all implemented just as PutLogData and none are tested just as PutLogData.

Test Plan: single unit test in write_batch_test.

Reviewers: hermanlee4, sdong, anthony

Subscribers: andrewkr, vasilep, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D54093
2016-04-29 11:50:30 -07:00
Li Peng
6d4832a998 Merge pull request #1101 from flyd1005/wip-fix-typo
fix typos and remove duplicated words
2016-04-28 02:30:44 -07:00
Islam AbdelRahman
eb73980853 Fix BackupableDBTest
Summary: Fix BackupableDBTest.NoDoubleCopy and BackupableDBTest.DifferentEnvs by mocking the db files in db_env instead of backup_env_

Test Plan: make check -j64

Reviewers: sdong, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D57273
2016-04-26 16:07:04 -07:00
Andrew Kryczka
1995e34d6a Retrieve file size from proper Env
Summary:
When db_env_ != backup_env_, InsertPathnameToSizeBytes() would
use the wrong Env during backup creation. This happened because this function
used backup_env_ instead of db_env_ to get WAL/data file sizes.

This diff adds an argument to InsertPathnameToSizeBytes() indicating which Env
to use.

Test Plan: ran @anirbanb's BackupTestTool

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57159
2016-04-26 12:33:30 -07:00
dx9
b71c4e613f Alpine Linux Build (#990)
* Musl libc does not provide adaptive mutex. Added feature test for PTHREAD_MUTEX_ADAPTIVE_NP.

* Musl libc does not provide backtrace(3). Added a feature check for backtrace(3).

* Fixed compiler error.

* Musl libc does not implement backtrace(3). Added platform check for libexecinfo.

* Alpine does not appear to support gcc -pg option. By default (gcc has PIE option enabled) it fails with:

gcc: error: -pie and -pg|p|profile are incompatible when linking

When -fno-PIE and -nopie are used it fails with:

/usr/lib/gcc/x86_64-alpine-linux-musl/5.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find gcrt1.o: No such file or directory

Added gcc -pg platform test and output PROFILING_FLAGS accordingly. Replaced pg var in Makefile with PROFILING_FLAGS.

* fix segfault when TEST_IOCTL_FRIENDLY_TMPDIR is undefined and default candidates are not suitable

* use ASSERT_DOUBLE_EQ instead of ASSERT_EQ

* When compiled with ROCKSDB_MALLOC_USABLE_SIZE UniversalCompactionFourPaths and UniversalCompactionSecondPathRatio tests fail due to premature memtable flushes on systems with 16-byte alignment. Arena runs out of block space before GenerateNewFile() completes.

Increased options.write_buffer_size.
2016-04-22 16:49:12 -07:00
Andrew Kryczka
40b840f294 Delete deprecated *BackupableDB interface for backups
Summary:
This interface is redundant and has been deprecated for a while.
It's also unused internally. Let's delete it.

I moved the comments to the corresponding functions in BackupEngine/
BackupEngineReadOnly. This caused the diff tool to not work cleanly.

Test Plan:
unit tests

  $ ./backupable_db_test

Reviewers: yhchiang, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D56331
2016-04-18 09:04:14 -07:00
sdong
cea8ed9702 Fix backupable_db_test test cases that can't run by itself
Summary:
Several of backupable_db_test fails if running standalone, because of directory missing. Fix it by:
(1) garbage collector skips shared directory if it doesn't exit
(2) BackupableDBTest.Issue921Test to create the parent directory of the backup directory fist.

Test Plan: Run the tests individually and make sure they pass

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56829
2016-04-15 15:48:57 -07:00
Andrew Kryczka
114a1b8792 Fix build errors for windows
Summary:
- Need to use unsigned long long for 64-bit literals on windows
- Need size_t for backup meta-file length since clang doesn't let us assign size_t to int

Test Plan: backupable_db_test and options_test

Reviewers: IslamAbdelRahman, yhchiang, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D56391
2016-04-08 13:09:19 -07:00
flabby
ada88b63f5 fix wrong assignment of level0_stop_writes_trigger in spatialdb (#1061) 2016-04-07 09:02:28 -07:00
Islam AbdelRahman
d02eb8d00b Fix unused variable warning
Summary:
My last diff introduced a warning when compiling under release mode
https://reviews.facebook.net/D55539

fix the warning

Test Plan:
DEBUG_LEVEL=0 make db_bench
make check

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56295
2016-04-04 11:10:46 -07:00
Islam AbdelRahman
f38540b12a WriteBatchWithIndex micro optimization
Summary:
  - Put key offset and key size in WriteBatchIndexEntry
  - Use vector for comparators in WriteBatchEntryComparator

I use a slightly modified version of @yoshinorim code to benchmark
https://gist.github.com/IslamAbdelRahman/b120f4fba8d6ff7d58d2

For Put I create a transaction that put a 1000000 keys and measure the time spent without commit.
For GetForUpdate I read the keys that I added in the Put transaction.

Original time:

```
 rm -rf /dev/shm/rocksdb-example/
 ./txn_bench put 1000000
 1000000 OK Ops | took      3.679 seconds
 ./txn_bench get_for_update 1000000
 1000000 OK Ops | took      3.940 seconds
```

New Time

```
  rm -rf /dev/shm/rocksdb-example/
 ./txn_bench put 1000000
 1000000 OK Ops | took      2.727 seconds
 ./txn_bench get_for_update 1000000
 1000000 OK Ops | took      3.880 seconds
```

It looks like there is no significant improvement in GetForUpdate() but we can see ~30% improvement in Put()

Test Plan: unittests

Reviewers: yhchiang, anthony, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D55539
2016-04-01 15:23:46 -07:00
Uddipta Maity
b55e2165be Rocksdb backup can store optional application specific metadata
Summary:
Rocksdb backup engine maintains metadata about backups in separate files. But,
there was no way to add extra application specific data to it. Adding support
for that.
In some use cases, applications decide to restore a backup based on some
metadata. This will help those cases to cheaply decide whether to restore or
not.

Test Plan:
Added a unit test. Existing ones are passing

Sample meta file for BinaryMetadata test-

```

1459454043
0
metadata 6162630A64656600676869
2
private/1/MANIFEST-000001 crc32 1184723444
private/1/CURRENT crc32 3505765120

```

Reviewers: sdong, ldemailly, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, ldemailly

Differential Revision: https://reviews.facebook.net/D56007
2016-04-01 10:56:52 -07:00
Andrew Kryczka
72224104d3 Forge current file for checkpoint
Summary:
This fixes a similar issue as D54711: "CURRENT" file can mutate between
GetLiveFiles() and copy to the tmp directory, in which case it would reference
the wrong manifest filename. To fix this, I forge the "CURRENT" file such that
it simply contains the filename for the manifest returned by GetLiveFiles().

- Changed CreateCheckpoint() to forge current file
- Added CreateFile() utility function
- Added test case that rolls manifest during checkpoint creation

Test Plan:
  $ ./checkpoint_test

Reviewers: sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D55065
2016-03-17 10:07:21 -07:00
SherlockNoMad
f11b0df121 Fix AppVeyor build error 2016-03-15 10:57:33 -07:00
agiardullo
790252805d Add multithreaded transaction test
Summary: Refactored db_bench transaction stress tests so that they can be called from unit tests as well.

Test Plan: run new unit test as well as db_bench

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D55203
2016-03-11 15:16:52 -08:00
Andrew Kryczka
d9620239d2 Cleanup stale manifests outside of full purge
Summary:
- Keep track of obsolete manifests in VersionSet
- Updated FindObsoleteFiles() to put obsolete manifests in the JobContext for later use by PurgeObsoleteFiles()
- Added test case that verifies a stale manifest is deleted by a non-full purge

Test Plan:
  $ ./backupable_db_test --gtest_filter=BackupableDBTest.ChangeManifestDuringBackupCreation

Reviewers: IslamAbdelRahman, yoshinorim, sdong

Reviewed By: sdong

Subscribers: andrewkr, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D55269
2016-03-10 18:16:21 -08:00
agiardullo
2200295ee1 optimistic transactions support for reinitialization
Summary: Extend optimization in D53835 to optimistic transactions for completeness.

Test Plan: added test

Reviewers: sdong, IslamAbdelRahman, horuff, jkedgar

Reviewed By: horuff

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D55059
2016-03-07 19:03:09 -08:00
agiardullo
200080ed72 Improve snapshot handling for Transaction reinitialization
Summary: Previously, reusing a transaction (by passing it as an argument to BeginTransaction) would not clear the transaction's snapshot.  This is not a clear, well-definited behavior.

Test Plan: improved test

Reviewers: sdong, IslamAbdelRahman, horuff, jkedgar

Reviewed By: jkedgar

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D55053
2016-03-07 13:28:11 -08:00
SherlockNoMad
58ecd91326 Fix Windows build 2016-03-03 15:08:24 -08:00
Andrew Kryczka
501927ffc4 [backupable db] Remove file size embedded in name workaround
Summary:
Now that we get sizes efficiently, we no longer need the workaround to
embed file size in filename.

Test Plan:
  $ ./backupable_db_test

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D55035
2016-03-03 13:32:20 -08:00
Andrew Kryczka
f8e90e8753 Get file attributes in bulk for VerifyBackup and CreateNewBackup
Summary:
For VerifyBackup(), backup files can be spread across "shared/",
"shared_checksum/", and "private/" subdirectories, so we have to
bulk get all three.

For CreateNewBackup(), we make two separate bulk calls: one for the
data files and one for WAL files.

There is also a new helper function, ExtendPathnameToSizeBytes(),
that translates the file attributes vector to a map. I decided to leave
GetChildrenFileAttributes()'s (from D53781) return type as vector to
keep it consistent with GetChildren().

Depends on D53781.

Test Plan:
verified relevant unit tests

  $ ./backupable_db_test

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D53919
2016-03-01 19:33:33 -08:00
agiardullo
5ea9aa3c14 TransactionDB:ReinitializeTransaction
Summary: Add function to reinitialize a transaction object so that it can be reused.  This is an optimization so users can potentially avoid reallocating transaction objects.

Test Plan: added tests

Reviewers: yhchiang, kradhakrishnan, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: jkedgar, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D53835
2016-02-29 16:27:32 -08:00
Andrew Kryczka
69c471bd9b Handle concurrent manifest update and backup creation
Summary:
Fixed two related race conditions in backup creation.

(1) CreateNewBackup() uses DB::DisableFileDeletions() to prevent table files
from being deleted while it is copying; however, the MANIFEST file could still
rotate during this time. The fix is to stop deleting the old manifest in the
rotation logic. It will be deleted safely later when PurgeObsoleteFiles() runs
(can only happen when file deletions are enabled).

(2) CreateNewBackup() did not account for the CURRENT file being mutable.
This is significant because the files returned by GetLiveFiles() contain a
particular manifest filename, but the manifest to which CURRENT refers can
change at any time. This causes problems when CURRENT changes between the call
to GetLiveFiles() and when it's copied to the backup directory. To workaround this, I
manually forge a CURRENT file referring to the manifest filename returned in
GetLiveFiles().

(2) also applies to the checkpointing code, so let me know if this approach is
good and I'll make the same change there.

Test Plan:
new test for roll manifest during backup creation.

running the test before this change:

  $ ./backupable_db_test --gtest_filter=BackupableDBTest.ChangeManifestDuringBackupCreation
  ...
  IO error: /tmp/rocksdbtest-9383/backupable_db/MANIFEST-000001: No such file or directory

running the test after this change:

  $ ./backupable_db_test --gtest_filter=BackupableDBTest.ChangeManifestDuringBackupCreation
  ...
  [ RUN      ] BackupableDBTest.ChangeManifestDuringBackupCreation
  [       OK ] BackupableDBTest.ChangeManifestDuringBackupCreation (2836 ms)

Reviewers: IslamAbdelRahman, anthony, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D54711
2016-02-29 12:56:55 -08:00
Andrew Kryczka
69c98f0431 Reorder instance variables in backup test for proper destruction order
Summary:
As titled. This fixes the tsan error caused by logger_ being used in
backup_engine_'s destructor. It does not fix the transient unit test failure,
which is caused by MANIFEST file changing while backup is happening.

Test Plan:
verified the tsan error no longer happens on either success or
failure.

  $ COMPILE_WITH_TSAN=1 make -j32 backupable_db_test
  $ while ./backupable_db_test --gtest_filter=BackupableDBTest.CorruptionsTest ; do : ; done

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D54669
2016-02-24 10:32:11 -08:00
Yueh-Hsuan Chiang
79ca039eb4 Relax the check condition of prefix_extractor in CheckOptionsCompatibility
Summary:
Relax the check condition of prefix_extractor in CheckOptionsCompatibility
by allowing changing value from non-nullptr to nullptr or nullptr to
non-nullptr.

Test Plan:
options_test
options_util_test

Reviewers: sdong, anthony, IslamAbdelRahman, kradhakrishnan, gunnarku

Reviewed By: gunnarku

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D54477
2016-02-19 14:42:24 -08:00
agiardullo
d08d50295c Fix transaction locking
Summary: Broke transaction locking in 4.4 in D52197.  Will cherry-pick this change into 4.4 (which hasn't yet been fully released).  Repro'd using db_bench.

Test Plan: unit tests and db_Bench

Reviewers: sdong, yhchiang, kradhakrishnan, ngbronson

Reviewed By: ngbronson

Subscribers: ngbronson, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D54021
2016-02-16 17:15:05 -08:00
reid horuff
5bcf952a87 Fix WriteImpl empty batch hanging issue
Summary: There is an issue in DBImpl::WriteImpl where if an empty writebatch comes in and sync=true then the logs will be marked as being synced yet the sync never actually happens because there is no data in the writebatch. This causes the next incoming batch to hang while waiting for the logs to complete syncing. This fix syncs logs even if the writebatch is empty.

Test Plan: DoubleEmptyBatch unit test in transaction_test.

Reviewers: yoshinorim, hermanlee4, sdong, ngbronson, anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54057
2016-02-16 12:21:33 -08:00
Igor Canadi
c90d63a23d can_unlock set but not used
Test Plan: I couldn't repro, but I hope this fixes it. See the error here: https://evergreen.mongodb.com/task_log_raw/mongodb_mongo_master_ubuntu1404_rocksdb_compile_6e9fd902d5cb25aef992363efa128640affd5196_16_02_11_04_33_37/0?type=T

Reviewers: yhchiang, andrewkr, sdong, anthony

Reviewed By: anthony

Subscribers: meyering, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D54123
2016-02-16 11:24:40 -08:00
Yueh-Hsuan Chiang
3a67bffaa8 Fix an ASAN error in transaction_test.cc
Summary:
One test in transaction_test.cc forgets to call SyncPoint::DisableProcessing().
As a result, a program might to access the SyncPoint singleton after it
already goes out of scope.

This patch fix this error by calling SyncPoint::DisableProcessing().

Test Plan: transaction_test

Reviewers: sdong, IslamAbdelRahman, kradhakrishnan, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D54033
2016-02-10 12:06:59 -08:00
Yueh-Hsuan Chiang
2a04268be3 Temporarily disable unstable tests in memory_test.cc
Summary:
memory_test.cc has some tests that are not unstable but
hard to reproduce, and the cause is the test itself not
the code.  Temporarily disable the tests until
we have a good fix.

Test Plan: memory_test

Reviewers: sdong, anthony, IslamAbdelRahman, rven, kradhakrishnan

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D54009
2016-02-09 16:28:34 -08:00
Baraa Hamodi
21e95811d1 Updated all copyright headers to the new format. 2016-02-09 15:12:00 -08:00
agiardullo
fe93bf9b5d Transaction::UndoGetForUpdate
Summary: MyRocks wants to be able to un-lock a key that was just locked by GetForUpdate().  To do this safely, I am now keeping track of the number of reads(for update) and writes for each key in a transaction.  UndoGetForUpdate() will only unlock a key if it hasn't been written and the read count reaches 0.

Test Plan: more unit tests

Reviewers: igor, rven, yhchiang, spetrunia, sdong

Reviewed By: spetrunia, sdong

Subscribers: spetrunia, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47043
2016-02-09 10:46:11 -08:00
reid horuff
6f71d3b68b Improve perf of Pessimistic Transaction expirations (and optimistic transactions)
Summary:
copy from task 8196669:

1) Optimistic transactions do not support batching writes from different threads.
2) Pessimistic transactions do not support batching writes if an expiration time is set.

In these 2 cases, we currently do not do any write batching in DBImpl::WriteImpl() because there is a WriteCallback that could decide at the last minute to abort the write.  But we could support batching write operations with callbacks if we make sure to process the callbacks correctly.

To do this, we would first need to modify write_thread.cc to stop preventing writes with callbacks from being batched together.  Then we would need to change DBImpl::WriteImpl() to call all WriteCallback's in a batch, only write the batches that succeed, and correctly set the state of each batch's WriteThread::Writer.

Test Plan: Added test WriteWithCallbackTest to write_callback_test.cc which creates multiple client threads and verifies that writes are batched and executed properly.

Reviewers: hermanlee4, anthony, ngbronson

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52863
2016-02-05 10:44:13 -08:00
Gabriela Jacques da Silva
0c2bd5cb4b Removing data race from expirable transactions
Summary:
Doing inline checking of transaction expiration instead of
using a callback.

Test Plan: To be added

Reviewers: anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D53673
2016-02-02 18:37:44 -08:00
agiardullo
45768ade4f transaction allocation perf improvements
Summary: Removed a couple of memory allocations

Test Plan: changes covered by existing tests

Reviewers: rven, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D53523
2016-01-28 17:32:28 -08:00
Andrew Kryczka
167bd8856d [directory includes cleanup] Finish removing util->db dependencies 2016-01-26 10:49:24 -08:00
David Bernard
df7c2f3b59 As per google coding standard replace "using" in option_builder.cc and
geodb_impl.cc
2016-01-20 00:27:29 +00:00
David Bernard
d78c6b28c4 Changes for build on solaris
Makefile adjust paths for solaris build
Makefile enable _GLIBCXX_USE_C99 so that std::to_string is available
db_compaction_test.cc Initialise a variable to avoid a compilation error
db_impl.cc Include <alloca.h>
db_test.cc Include <alloca.h>
Environment.java recognise solaris envrionment
options_bulder.cc Make log unambiguous
geodb_impl.cc Make log and floor unambiguous
2016-01-19 04:45:21 +00:00
agiardullo
9760c842ce fix valgrind failure in backupable_db_test
Summary: fix memory leak in test code

Test Plan: ran test

Reviewers: rven, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D52617
2016-01-06 18:04:58 -08:00
Igor Canadi
e541dcc8fa Fix issue #921
Summary:
See a bug report here: https://github.com/facebook/rocksdb/issues/921
The fix is to not check the shared/ directory if share_table_files is false. We could also check FileExists() before GetChildren(), but that will add extra latency when Env is Hdfs :(

Test Plan: added a unit test

Reviewers: rven, sdong, IslamAbdelRahman, yhchiang, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D52593
2016-01-06 13:05:24 -08:00
Siying Dong
298ba27ae2 Merge pull request #846 from yuslepukhin/enble_c4244_lossofdata
Enable MS compiler warning c4244.
2015-12-23 22:59:42 -08:00
Siying Dong
7810aa802a Merge pull request #899 from zhipeng-jia/fix_clang_warning
Fix clang warnings
2015-12-23 22:58:52 -08:00
sdong
3280ae9a29 Fix warning in release
Summary: Warning in release build.

Test Plan: Make release and make all

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52305
2015-12-23 22:38:12 -08:00
Zhipeng Jia
73b175a773 Fix clang warnings regarding unnecessary std::move 2015-12-24 04:10:00 +08:00
agiardullo
eff309867e Do not use timed_mutex in TransactionDB
Summary: Stopped using std::timed_mutex as it has known issues in older versiong of gcc.  Ran into these problems when testing MongoRocks.

Test Plan: unit tests.  Manual mongo testing on gcc 4.8.

Reviewers: igor, yhchiang, rven, IslamAbdelRahman, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D52197
2015-12-18 17:26:02 -08:00
Dmitri Smirnov
b6d19adcf7 Use port size_t formatting 2015-12-15 11:34:22 -08:00
Venkatesh Radhakrishnan
030215bf01 Running manual compactions in parallel with other automatic or manual compactions in restricted cases
Summary:
This diff provides a framework for doing manual
compactions in parallel with other compactions. We now have a deque of manual compactions. We also pass manual compactions as an argument from RunManualCompactions down to
BackgroundCompactions, so that RunManualCompactions can be reentrant.
Parallelism is controlled by the two routines
ConflictingManualCompaction to allow/disallow new parallel/manual
compactions based on already existing ManualCompactions. In this diff, by default manual compactions still have to run exclusive of other compactions. However, by setting the compaction option, exclusive_manual_compaction to false, it is possible to run other compactions in parallel with a manual compaction. However, we are still restricted to one manual compaction per column family at a time. All of these restrictions will be relaxed in future diffs.
I will be adding more tests later.

Test Plan: Rocksdb regression + new tests + valgrind

Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong

Reviewed By: sdong

Subscribers: yoshinorim, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47973
2015-12-14 11:20:34 -08:00
Dmitri Smirnov
236fe21c92 Enable MS compiler warning c4244.
Mostly due to the fact that there are differences in sizes of int,long
  on 64 bit systems vs GNU.
2015-12-11 16:47:34 -08:00
agiardullo
84f98792d6 Transaction::SetWriteOptions()
Summary: Add support to change write options after creating a transaction.  This is needed for MongoRocks.

Test Plan: added test

Reviewers: sdong, rven, kradhakrishnan, IslamAbdelRahman, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51867
2015-12-11 16:08:25 -08:00
agiardullo
3bfd3d39a3 Use SST files for Transaction conflict detection
Summary:
Currently, transactions can fail even if there is no actual write conflict.  This is due to relying on only the memtables to check for write-conflicts.  Users have to tune memtable settings to try to avoid this, but it's hard to figure out exactly how to tune these settings.

With this diff, TransactionDB will use both memtables and SST files to determine if there are any write conflicts.  This relies on the fact that BlockBasedTable stores sequence numbers for all writes that happen after any open snapshot.  Also, D50295 is needed to prevent SingleDelete from disappearing writes (the TODOs in this test code will be fixed once the other diff is approved and merged).

Note that Optimistic transactions will still rely on tuning memtable settings as we do not want to read from SST while on the write thread.  Also, memtable settings can still be used to reduce how often TransactionDB needs to read SST files.

Test Plan: unit tests, db bench

Reviewers: rven, yhchiang, kradhakrishnan, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb, yoshinorim

Differential Revision: https://reviews.facebook.net/D50475
2015-12-11 12:34:11 -08:00
Igor Canadi
64fa43843b Merge pull request #862 from ceph/wip-env
implement EnvMirror
2015-12-10 18:45:07 -08:00
Sage Weil
2074ddd625 env: add EnvMirror
This is an Env implementation that mirrors all storage-related methods on
two different backend Env's and verifies that they return the same
results (return status and read results).  This is useful for implementing
a new Env and verifying its correctness.

Signed-off-by: Sage Weil <sage@redhat.com>
2015-12-10 21:32:45 -05:00
charsyam
c30b499541 fix typos in comments 2015-12-11 01:54:48 +09:00
Siying Dong
fa3dbf203f Merge pull request #853 from Vaisman/enable_C4267_warning
Enable C4267 warning
2015-12-08 17:59:24 -08:00
agiardullo
e5c5f23814 Support marking snapshots for write-conflict checking - Take 2
Summary:
D51183 was reverted due to breaking the LITE build.

This diff is the same as D51183 but with a fix for the LITE BUILD(D51693)

Test Plan: run all unit tests

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51711
2015-12-08 16:47:31 -08:00
sdong
1d63c3d610 Revert "Support marking snapshots for write-conflict checking"
This reverts commit ec704aafdc for it broke RocksDB LITE build.
2015-12-08 09:27:17 -08:00
agiardullo
ec704aafdc Support marking snapshots for write-conflict checking
Summary:
D50475 enables using SST files for transaction write-conflict checking.  In order for this to work, we need to make sure not to compact out SingleDeletes when there is an earlier transaction snapshot(D50295).  If there is a long-held snapshot, this could reduce the benefit of the SingleDelete optimization.

This diff allows Transactions to mark snapshots as being used for write-conflict checking.  Then, during compaction, we will be able to optimize SingleDeletes better in the future.

This diff adds a flag to SnapshotImpl which is used by Transactions.  This diff also passes the earliest write-conflict snapshot's sequence number to CompactionIterator.  This diff does not actually change Compaction (after this diff is pushed, D50295 will be able to use this information).

Test Plan: no behavior change, ran existing tests

Reviewers: rven, kradhakrishnan, yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51183
2015-12-07 19:40:51 -08:00
Jay Edgar
b28b7c6dd9 Added callback notification when a snapshot is created
Summary: When SetSnapshot() is used the caller immediately knows a snapshot has been created, but when SetSnapshotOnNextOperation() is used the caller needs a way to get notified when that snapshot has been generated.  This creates an interface that the client can implement that will be called at the time the snapshot is created.

Test Plan: Added a new SetSnapshotOnNextOperationWithNotification test into the transaction_test.

Reviewers: sdong, anthony

Reviewed By: anthony

Subscribers: yoshinorim, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51177
2015-12-04 10:20:36 -08:00
Alex Yang
e8180f9901 added public api to schedule flush/compaction, code to prevent race with db::open
Summary:
Fixes T8781168.

Added a new function EnableAutoCompactions in db.h to be publicly
avialable.  This allows compaction to be re-enabled after disabling it via
SetOptions

Refactored code to set the dbptr earlier on in TransactionDB::Open and DB::Open
Temporarily disable auto_compaction in TransactionDB::Open until dbptr is set to
prevent race condition.

Test Plan:
Ran make all check

verified fix on myrocks side:
was able to reproduce the seg fault with
../tools/mysqltest.sh --mem --force rocksdb.drop_table

method was to manually sleep the thread after DB::Open but before TransactionDB ptr was
assigned in transaction_db_impl.cc:
  DB::Open(db_options, dbname, column_families_copy, handles, &db);
  clock_t goal = (60000 * 10) + clock();
  while (goal > clock());
  ...dbptr(aka rdb) gets assigned below

verified my changes fixed the issue.

Also added unit test 'ToggleAutoCompaction' in transaction_test.cc

Reviewers: hermanlee4, anthony

Reviewed By: anthony

Subscribers: alex, dhruba

Differential Revision: https://reviews.facebook.net/D51147
2015-12-03 22:59:44 -08:00
SherlockNoMad
b4efaebff0 Fix ms version Appveyor build error 2015-11-30 11:07:47 -08:00
sdong
6bbfa1874b BackupDB to have a mode to use file size in file name
Summary: Getting file size from all the backup files can take a long time. In some cases, the sizes are available in file names. We allow a mode to get those sizes from file name.

Test Plan:
Make some unit tests in backupable_db_test to run in such a mode.
Make sure RocksDB Lite builds too.

Reviewers: IslamAbdelRahman, rven, yhchiang, kradhakrishnan, anthony, igor

Reviewed By: igor

Subscribers: muthu, asameet, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51243
2015-11-25 11:55:37 -08:00
Vasili Svirski
41b32c6059 Enable C4267 warning
* conversion from 'size_t' to 'type', by add static_cast

Tested:
* by build solution on Windows, Linux locally,
* run tests
* build CI system successful
2015-11-24 16:33:09 +03:00
sdong
6170fec251 Fix build broken by previous commit of "option helper refactor"
Summary:
The commit of option helper refactor broken the build:
(1) a git merge problem
(2) some uncaught compiler warning
Fix it.

Test Plan: Make sure "make all" passes

Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D50943
2015-11-17 16:52:54 -08:00
Islam AbdelRahman
a163cc2d5a Lint everything
Summary:
```
arc2 lint --everything
```

run the linter on the whole code repo to fix exisitng lint issues

Test Plan: make check -j64

Reviewers: sdong, rven, anthony, kradhakrishnan, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D50769
2015-11-16 12:56:21 -08:00
Yueh-Hsuan Chiang
d781da8164 Add CheckOptionsCompatibility() API to options_util
Summary:
Add CheckOptionsCompatibility() API to options_util that returns
Status::OK if the input DBOptions and ColumnFamilyDescriptors
are compatible with the latest options stored in the specified DB path.

Test Plan: Added tests in options_util_test

Reviewers: igor, anthony, IslamAbdelRahman, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D50649
2015-11-12 16:52:51 -08:00
Yueh-Hsuan Chiang
5ac16300b0 Fixed valgrind error in options_util_test
Summary:
Fixed valgrind error in options_util_test by deleting the
compaction_filter allocated from RandomInitCFOptions().

Test Plan: valgrind --error-exitcode=2 --leak-check=full ./options_util_test

Reviewers: anthony, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D50661
2015-11-12 14:12:27 -08:00