rocksdb

Author	SHA1	Message	Date
Dmitri Smirnov	236fe21c92	Enable MS compiler warning c4244. Mostly due to the fact that there are differences in sizes of int,long on 64 bit systems vs GNU.	2015-12-11 16:47:34 -08:00
agiardullo	3bfd3d39a3	Use SST files for Transaction conflict detection Summary: Currently, transactions can fail even if there is no actual write conflict. This is due to relying on only the memtables to check for write-conflicts. Users have to tune memtable settings to try to avoid this, but it's hard to figure out exactly how to tune these settings. With this diff, TransactionDB will use both memtables and SST files to determine if there are any write conflicts. This relies on the fact that BlockBasedTable stores sequence numbers for all writes that happen after any open snapshot. Also, D50295 is needed to prevent SingleDelete from disappearing writes (the TODOs in this test code will be fixed once the other diff is approved and merged). Note that Optimistic transactions will still rely on tuning memtable settings as we do not want to read from SST while on the write thread. Also, memtable settings can still be used to reduce how often TransactionDB needs to read SST files. Test Plan: unit tests, db bench Reviewers: rven, yhchiang, kradhakrishnan, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb, yoshinorim Differential Revision: https://reviews.facebook.net/D50475	2015-12-11 12:34:11 -08:00
agiardullo	9e44629061	Change SingleDelete to support conflict checking Summary: For Transactions, we want to start using the SST files to do write conflict checking. To do this, we need to make sure that compaction never removes all writes if an earlier snapshot exists. So I had to change the way we process SingleDeletes to sometimes leave a SingleDelete behind when we encounter a Put followed by a SingleDelete. See the comments in this diff for a more detailed explanation. Test Plan: added more unit tests Reviewers: rven, igor, kradhakrishnan, IslamAbdelRahman, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D50295	2015-12-10 11:35:38 -08:00
Yueh-Hsuan Chiang	774b80e99e	Resubmit the fix for a race condition in persisting options Summary: This patch fix a race condition in persisting options which will cause a crash when: * Thread A obtain cf options and start to persist options based on that cf options. * Thread B kicks in and finish DropColumnFamily and delete cf_handle. * Thread A wakes up and tries to finish the persisting options and crashes. Test Plan: Add a test in column_family_test that can reproduce the crash Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong Reviewed By: sdong Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D51717	2015-12-08 17:01:02 -08:00
agiardullo	e5c5f23814	Support marking snapshots for write-conflict checking - Take 2 Summary: D51183 was reverted due to breaking the LITE build. This diff is the same as D51183 but with a fix for the LITE BUILD(D51693) Test Plan: run all unit tests Reviewers: sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D51711	2015-12-08 16:47:31 -08:00
sdong	1d63c3d610	Revert "Support marking snapshots for write-conflict checking" This reverts commit `ec704aafdc` for it broke RocksDB LITE build.	2015-12-08 09:27:17 -08:00
agiardullo	ec704aafdc	Support marking snapshots for write-conflict checking Summary: D50475 enables using SST files for transaction write-conflict checking. In order for this to work, we need to make sure not to compact out SingleDeletes when there is an earlier transaction snapshot(D50295). If there is a long-held snapshot, this could reduce the benefit of the SingleDelete optimization. This diff allows Transactions to mark snapshots as being used for write-conflict checking. Then, during compaction, we will be able to optimize SingleDeletes better in the future. This diff adds a flag to SnapshotImpl which is used by Transactions. This diff also passes the earliest write-conflict snapshot's sequence number to CompactionIterator. This diff does not actually change Compaction (after this diff is pushed, D50295 will be able to use this information). Test Plan: no behavior change, ran existing tests Reviewers: rven, kradhakrishnan, yhchiang, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D51183	2015-12-07 19:40:51 -08:00
sdong	f307036bde	Revert "Fix a race condition in persisting options" This reverts commit `2fa3ed5180`. It breaks RocksDB lite build	2015-12-07 17:09:12 -08:00
Yueh-Hsuan Chiang	2fa3ed5180	Fix a race condition in persisting options Summary: This patch fix a race condition in persisting options which will cause a crash when: * Thread A obtain cf options and start to persist options based on that cf options. * Thread B kicks in and finish DropColumnFamily and delete cf_handle. * Thread A wakes up and tries to finish the persisting options and crashes. Test Plan: Add a test in column_family_test that can reproduce the crash Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D51609	2015-12-07 15:25:12 -08:00
Alex Yang	e8180f9901	added public api to schedule flush/compaction, code to prevent race with db::open Summary: Fixes T8781168. Added a new function EnableAutoCompactions in db.h to be publicly avialable. This allows compaction to be re-enabled after disabling it via SetOptions Refactored code to set the dbptr earlier on in TransactionDB::Open and DB::Open Temporarily disable auto_compaction in TransactionDB::Open until dbptr is set to prevent race condition. Test Plan: Ran make all check verified fix on myrocks side: was able to reproduce the seg fault with ../tools/mysqltest.sh --mem --force rocksdb.drop_table method was to manually sleep the thread after DB::Open but before TransactionDB ptr was assigned in transaction_db_impl.cc: DB::Open(db_options, dbname, column_families_copy, handles, &db); clock_t goal = (60000 * 10) + clock(); while (goal > clock()); ...dbptr(aka rdb) gets assigned below verified my changes fixed the issue. Also added unit test 'ToggleAutoCompaction' in transaction_test.cc Reviewers: hermanlee4, anthony Reviewed By: anthony Subscribers: alex, dhruba Differential Revision: https://reviews.facebook.net/D51147	2015-12-03 22:59:44 -08:00
sdong	db320b1b82	DB to only flush the column family with the largest memtable while option.db_write_buffer_size is hit Summary: When option.db_write_buffer_size is hit, we currently flush all column families. Move to flush the column family with the largest active memt table instead. In this way, we can avoid too many small files in some cases. Test Plan: Modify test DBTest.SharedWriteBuffer to work with the updated behavior Reviewers: kradhakrishnan, yhchiang, rven, anthony, IslamAbdelRahman, igor Reviewed By: igor Subscribers: march, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D51291	2015-11-30 13:36:57 -08:00
Reid Horuff	3381e2c3e7	Handle multiple calls to DBImpl::PauseBackgroundWork() and DBImpl::ContinueBackgroundWork() Summary: Handle multiple calls to DBImpl::PauseBackgroundWork() and DBImpl::ContinueBackgroundWork() Test Plan: rocksdb.information_schema handles this case. Reviewers: igor Reviewed By: igor Subscribers: hermanlee4, jkedgar, dhruba Differential Revision: https://reviews.facebook.net/D50781	2015-11-16 14:20:18 -08:00
Venkatesh Radhakrishnan	2ae4d7d708	Make sure that CompactFiles does not run two parallel Level 0 compactions Summary: Since level 0 files can overlap, two level 0 compactions cannot run in parallel. Compact files needs to check this before running a compaction. Test Plan: CompactFilesTest.L0ConflictsFiles Reviewers: igor, IslamAbdelRahman, anthony, sdong, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D50079	2015-11-13 12:01:00 -08:00
Nathan Bronson	6ce42dd075	Don't merge WriteBatch-es if WAL is disabled Summary: There's no need for WriteImpl to flatten the write batch group into a single WriteBatch if the WAL is disabled. This diff moves the flattening into the WAL step, and skips flattening entirely if it isn't needed. It's good for about 5% speedup on a multi-threaded workload with no WAL. This diff also adds clarifying comments about the chance for partial failure of WriteBatchInternal::InsertInto, and always sets bg_error_ if the memtable state diverges from the logged state or if a WriteBatch succeeds only partially. Benchmark for speedup: db_bench -benchmarks=fillrandom -threads=16 -batch_size=1 -memtablerep=skip_list -value_size=0 --num=200000 -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -disable_auto_compactions --max_write_buffer_number=8 -max_background_flushes=8 --disable_wal --write_buffer_size=160000000 Test Plan: asserts + make check Reviewers: sdong, igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D50583	2015-11-12 10:50:38 -08:00
Yueh-Hsuan Chiang	e114f0abb8	Enable RocksDB to persist Options file. Summary: This patch allows rocksdb to persist options into a file on DB::Open, SetOptions, and Create / Drop ColumnFamily. Options files are created under the same directory as the rocksdb instance. In addition, this patch also adds a fail_if_missing_options_file in DBOptions that makes any function call return non-ok status when it is not able to persist options properly. // If true, then DB::Open / CreateColumnFamily / DropColumnFamily // / SetOptions will fail if options file is not detected or properly // persisted. // // DEFAULT: false bool fail_if_missing_options_file; Options file names are formatted as OPTIONS-<number>, and RocksDB will always keep the latest two options files. Test Plan: Add options_file_test. options_test column_family_test Reviewers: igor, IslamAbdelRahman, sdong, anthony Reviewed By: anthony Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48285	2015-11-10 22:58:01 -08:00
Venkatesh Radhakrishnan	9d50afc3b9	Prefix-based iterating only shows keys in prefix Summary: MyRocks testing found an issue that while iterating over keys that are outside the prefix, sometimes wrong results were seen for keys outside the prefix. We now tighten the range of keys seen with a new read option called prefix_seen_at_start. This remembers the starting prefix and then compares it on a Next for equality of prefix. If they are from a different prefix, it sets valid to false. Test Plan: PrefixTest.PrefixValid Reviewers: IslamAbdelRahman, sdong, yhchiang, anthony Reviewed By: anthony Subscribers: spetrunia, hermanlee4, yoshinorim, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D50211	2015-11-05 13:24:05 -08:00
Yueh-Hsuan Chiang	3ecbab0040	Add GetAggregatedIntProperty(): returns the aggregated value from all CFs Summary: This patch adds GetAggregatedIntProperty() that returns the aggregated value from all CFs Test Plan: Added a test in db_test Reviewers: igor, sdong, anthony, IslamAbdelRahman, rven Reviewed By: rven Subscribers: rven, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D49497	2015-11-03 15:54:18 -08:00
Islam AbdelRahman	ff4499e297	Update DB::AddFile() to have less restrictions Summary: Update DB::AddFile() restrictions to be - Key range in loaded table file don't overlap with existing keys or tombstones in DB. - No other writes happen during AddFile call. The updated AddFile() will verify that the file key range don't overlap with any keys or tombstones in the DB, and then add the file to L0 Test Plan: unit tests Reviewers: igor, rven, anthony, kradhakrishnan, sdong Reviewed By: sdong Subscribers: adsharma, ameyag, dhruba Differential Revision: https://reviews.facebook.net/D49233	2015-10-30 16:38:10 -07:00
Islam AbdelRahman	2872e0c8c2	Clean and expose CreateLoggerFromOptions Summary: CreateLoggerFromOptions have some parameters like db_log_dir and env, these parameters are redundant since they already exist in DBOptions this patch remove the redundant parameters and expose CreateLoggerFromOptions to users Test Plan: make check Reviewers: igor, anthony, yhchiang, rven, kradhakrishnan, sdong Reviewed By: sdong Subscribers: dhruba, hermanlee4 Differential Revision: https://reviews.facebook.net/D49713	2015-10-29 18:07:37 -07:00
sdong	296c3a1f94	"make format" in some recent commits Summary: Run "make format" for some recent commits. Test Plan: Build and run tests Reviewers: IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D49707	2015-10-29 17:11:14 -07:00
Herman Lee	0d720dfc17	Use the correct variable when fetching table properties. Summary: An uninitialized parameter was being passed into the call to fetch the table properties during the compaction notification callbacks. Test Plan: Build it with myrocks and verify unit test passed. Run unit tests. Reviewers: rven, yhchiang, igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D49635	2015-10-28 16:28:11 -07:00
Praveen Rao	4ce117c4d5	Merge branch 'master' into wal_filter	2015-10-26 19:03:34 -07:00
Praveen Rao	32cdec634e	Fail recovery if filter provides more records than original and corresponding unit-test, fix naming conventions	2015-10-26 18:11:18 -07:00
Siying Dong	138876a62c	Merge pull request #746 from ceph/wip-recycle Add Options.recycle_log_file_num for Recycling WAL Files	2015-10-26 15:01:28 -07:00
Praveen Rao	2938c5c137	merge upstream changes	2015-10-19 15:21:33 -07:00
Praveen Rao	0c59691dde	Handle multiple batches in single log record - allow app to return a new batch + allow app to return corrupted record status	2015-10-19 13:27:40 -07:00
Alexey Maykov	f18acd8875	Fixed the clang compilation failure Summary: As above. Test Plan: USE_CLANG=1 make check -j Reviewers: igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48981	2015-10-19 10:38:50 -07:00
Sage Weil	9c33f64d19	log_reader: pass in WALRecoveryMode instead of bool report_eof_inconsistency Soon our behavior will depend on more than just whther we are in kAbsoluteConsistency or not. Signed-off-by: Sage Weil <sage@redhat.com>	2015-10-18 21:24:32 -04:00
Sage Weil	3ac13c99d1	log_reader: pass log_number and optional info_log to ctor We will need the log number to validate the recycle-style CRCs. The log is helpful for debugging, but optional, as not all callers have it. Signed-off-by: Sage Weil <sage@redhat.com>	2015-10-18 21:24:32 -04:00
Sage Weil	5830c699f2	log_writer: pass log number and whether recycling is enabled to ctor When we recycle log files, we need to mix the log number into the CRC for each record. Note that for logs that don't get recycled (like the manifest), we always pass a log_number of 0 and false. Signed-off-by: Sage Weil <sage@redhat.com>	2015-10-18 21:24:32 -04:00
Sage Weil	666376150c	db_impl: recycle log files If log recycling is enabled, put old WAL files on a recycle queue instead of deleting them. When we need a new log file, take a recycled file off the list if one is available. Signed-off-by: Sage Weil <sage@redhat.com>	2015-10-18 21:24:32 -04:00
Sage Weil	d666225a0a	db_impl: disable recycle_log_files if WAL archive is enabled We can't recycle the files if they are being archived. Signed-off-by: Sage Weil <sage@redhat.com>	2015-10-18 21:21:24 -04:00
Alexey Maykov	e1a09a7703	Implementation for GetPropertiesOfTablesInRange Summary: In MyRocks, it is sometimes important to get propeties only for the subset of the database. This diff implements the API in RocksDB. Test Plan: ran the GetPropertiesOfTablesInRange Reviewers: rven, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48651	2015-10-17 13:34:43 -07:00
Yueh-Hsuan Chiang	ad471453e8	Allow GetProperty to report the number of currently running flushes / compactions. Summary: Add rocksdb.num-running-compactions and rocksdb.num-running-flushes to GetIntProperty() that reports the number of currently running compactions / flushes. Test Plan: augmented existing tests in db_test Reviewers: igor, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D48693	2015-10-17 00:16:36 -07:00
sdong	277dea78f0	Add more kill points Summary: Add kill points in: 1. after creating a file 2. before writing a manifest record 3. before syncing manifest 4. before creating a new current file 5. after creating a new current file Test Plan: Run all current tests. Reviewers: yhchiang, igor, anthony, IslamAbdelRahman, rven, kradhakrishnan Reviewed By: kradhakrishnan Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D48855	2015-10-16 14:35:12 -07:00
Venkatesh Radhakrishnan	a98fbacfa0	Moving memtable related files from util to a new directory memtable Summary: We are cleaning up dependencies. This diff takes a first step at moving memtable files to their own directory called memtable. In future diffs, we will move other memtable files from db to memtable. Test Plan: make check Reviewers: sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D48915	2015-10-16 14:10:33 -07:00
Islam AbdelRahman	f55d3009c0	Make db_test_util compile under ROCKSDB_LITE Summary: db_test_util is used in multiple test files but it dont compile under ROCKSDB_LITE Test Plan: make check make static_lib OPT=-DROCKSDB_LITE make db_wal_test Reviewers: igor, yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48579	2015-10-13 17:33:23 -07:00
sdong	35ad531be3	Seperate InternalIterator from Iterator Summary: Separate a new class InternalIterator from class Iterator, when the look-up is done internally, which also means they operate on key with sequence ID and type. This change will enable potential future optimizations but for now InternalIterator's functions are still the same as Iterator's. At the same time, separate the cleanup function to a separate class and let both of InternalIterator and Iterator inherit from it. Test Plan: Run all existing tests. Reviewers: igor, yhchiang, anthony, kradhakrishnan, IslamAbdelRahman, rven Reviewed By: rven Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D48549	2015-10-13 15:32:13 -07:00
Praveen Rao	cc4d13e0a8	Put wal_filter under #ifndef ROCKSDB_LITE	2015-10-13 11:10:14 -07:00
Praveen Rao	eb24178553	merge from master	2015-10-12 17:24:21 -07:00
Islam AbdelRahman	c64ae05b1c	Move TEST_NewInternalIterator to NewInternalIterator Summary: Long time ago we add InternalDumpCommand to ldb_tool https://reviews.facebook.net/D11517 This command is using TEST_NewInternalIterator although it's not a test. This patch move TEST_NewInternalIterator outside of db_impl_debug.cc Test Plan: make check make static_lib Reviewers: yhchiang, igor, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48561	2015-10-12 17:22:37 -07:00
Praveen Rao	59a0c219bb	Adding log filter to inspect and filter log records on recovery	2015-10-12 17:03:03 -07:00
Alexey Maykov	3d07b815f6	Passing table properties to compaction callback Summary: It would be nice to have and access to table properties in compaction callbacks. In MyRocks project, it will make possible to update optimizer statistics online. Test Plan: ran the unit test. Ran myrocks with the new way of collecting stats. Reviewers: igor, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D48267	2015-10-09 18:10:55 -07:00
sdong	776bd8d5eb	Pass column family ID to table property collector Summary: Pass column family ID through TablePropertiesCollectorFactory::CreateTablePropertiesCollector() so that users can identify which column family this file is for and handle it differently. Test Plan: Add unit test scenarios in tests related to table properties collectors to verify the information passed in is correct. Reviewers: rven, yhchiang, anthony, kradhakrishnan, igor, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: yoshinorim, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D48411	2015-10-09 14:36:51 -07:00
dyniusz	0267502655	Support for LevelDB SST with .ldb suffix Summary: Handle SST files with both ".sst" and ".ldb" suffix. This enables user to migrate from leveldb to rocksdb. Test Plan: Added unit test with DB operating on SSTs with names schema. See db/dc_test.cc:SSTsWithLdbSuffixHandling for details Reviewers: yhchiang, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D48003	2015-10-06 17:46:22 -07:00
Igor Canadi	115427ef63	Add APIs PauseBackgroundWork() and ContinueBackgroundWork() Summary: To support a new MongoDB capability, we need to make sure that we don't do any IO for a short period of time. For background, see: * https://jira.mongodb.org/browse/SERVER-20704 * https://jira.mongodb.org/browse/SERVER-18899 To implement that, I add a new API calls PauseBackgroundWork() and ContinueBackgroundWork() which reuse the capability we already have in place for RefitLevel() function. Test Plan: Added a new test in db_test. Made sure that test fails when PauseBackgroundWork() is commented out. Reviewers: IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D47901	2015-10-02 13:17:34 -07:00
Islam AbdelRahman	f03b5c987b	Add experimental DB::AddFile() to plug sst files into empty DB Summary: This is an initial version of bulk load feature This diff allow us to create sst files, and then bulk load them later, right now the restrictions for loading an sst file are (1) Memtables are empty (2) Added sst files have sequence number = 0, and existing values in database have sequence number = 0 (3) Added sst files values are not overlapping Test Plan: unit testing Reviewers: igor, ott, sdong Reviewed By: sdong Subscribers: leveldb, ott, dhruba Differential Revision: https://reviews.facebook.net/D39081	2015-09-23 12:42:43 -07:00
sdong	d0c31641d2	Internal stats WAL file synced to match meaning of the stats of the same name Summary: https://reviews.facebook.net/D23343 changed WAL sync bytes to extra fsync. This change does the same for internal stats. Test Plan: Run all existing unit tests and verify results in db_bench. Reviewers: anthony, rven, igor, MarkCallaghan, kradhakrishnan, yhchiang Reviewed By: yhchiang Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D47349	2015-09-22 14:23:11 -07:00
Andres Noetzli	014fd55adc	Support for SingleDelete() Summary: This patch fixes #7460559. It introduces SingleDelete as a new database operation. This operation can be used to delete keys that were never overwritten (no put following another put of the same key). If an overwritten key is single deleted the behavior is undefined. Single deletion of a non-existent key has no effect but multiple consecutive single deletions are not allowed (see limitations). In contrast to the conventional Delete() operation, the deletion entry is removed along with the value when the two are lined up in a compaction. Note: The semantics are similar to @igor's prototype that allowed to have this behavior on the granularity of a column family ( https://reviews.facebook.net/D42093 ). This new patch, however, is more aggressive when it comes to removing tombstones: It removes the SingleDelete together with the value whenever there is no snapshot between them while the older patch only did this when the sequence number of the deletion was older than the earliest snapshot. Most of the complex additions are in the Compaction Iterator, all other changes should be relatively straightforward. The patch also includes basic support for single deletions in db_stress and db_bench. Limitations: - Not compatible with cuckoo hash tables - Single deletions cannot be used in combination with merges and normal deletions on the same key (other keys are not affected by this) - Consecutive single deletions are currently not allowed (and older version of this patch supported this so it could be resurrected if needed) Test Plan: make all check Reviewers: yhchiang, sdong, rven, anthony, yoshinorim, igor Reviewed By: igor Subscribers: maykov, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D43179	2015-09-17 11:42:56 -07:00
Venkatesh Radhakrishnan	51e1c11254	Do not flag error if file to be deleted does not exist Summary: Some users have observed errors in the log file when the log file or sst file is already deleted. Test Plan: Make sure that the errors do not appear for already deleted files. Reviewers: sdong Reviewed By: sdong Subscribers: anthony, kradhakrishnan, yhchiang, rven, igor, IslamAbdelRahman, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D47115	2015-09-17 10:21:34 -07:00
sdong	9aca7cd6d8	DB::Open() to flush info log after printing DB pointer Summary: Now DB::Open() flushes info log before printing DB pointer, so it may not show up if no activity after DB open. Move log flushing from after printing options to printing DB pointer. Test Plan: make commit-prereq Reviewers: igor, IslamAbdelRahman, yhchiang, kradhakrishnan, anthony, rven Reviewed By: rven Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D47121	2015-09-16 16:33:39 -07:00
Yueh-Hsuan Chiang	f21c7415a7	Change the log level of DB start-up log from Warn to Header. Summary: Change the log level of DB start-up log from Warn to Header. Test Plan: db_bench and observe the LOG header Reviewers: igor, anthony, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D47067	2015-09-16 11:31:45 -07:00
Alexey Maykov	3ebf11ed16	Adding the increment for a counter for a number of WAL syncs Summary: This will unblock the corresponding change in MyRocks Test Plan: ran rocksdb.write_sync test Reviewers: sdong, kolmike Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D46911	2015-09-16 11:00:49 -07:00
sdong	f3170b6f6c	DBImpl::FindObsoleteFiles() shouldn't release mutex between getting min_pending_output and scanning files Summary: Releasing mutex between getting min_pending_output and scanning files may cause min_pending_output to be max but some non-final files are found in file scanning, ending up with deleting wrong files. As a recent regression, mutex can be released while waiting for log sync. We move it to after file scanning. Test Plan: Run all existing tests. Don't think it is easy to write a unit test. Maybe we should find a way to assert lock not released so that we can have some test verification for similar cases. Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, kolmike, rven Reviewed By: rven Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D46899	2015-09-14 23:39:30 -07:00
krad	1126644082	Relaxing consistency detection to include errors while inserting to memtable as WAL recovery error. Summary: The current code, considers data to be consistent if the record checksum passes. We do have customer issues where the record checksum passed but the data was incomprehensible. There is no way to get out of this error case since all WAL recovery model will consider this error as unrelated to WAL. Relaxing the definition and including errors while inserting to memtable as WAL errors and handing them as per the recovery level. Test Plan: Used customer dump to verify the fix for different level. The db opens for kSkipAnyCorruptedRecords and kPointInTimeRecovery, but fails for kAbsoluteConsistency and kTolerateCorruptedTailRecords. Reviewers: sdon igor CC: leveldb@ Task ID: #7918721 Blame Rev:	2015-09-10 12:56:17 -07:00
Igor Canadi	ac9bcb55ce	Set max_open_files based on ulimit Summary: We should never set max_open_files to be bigger than the system's ulimit. Otherwise we will get "Too many open files" errors. See an example in this Travis run: https://travis-ci.org/facebook/rocksdb/jobs/79591566 Test Plan: make check I will also verify that max_max_open_files is reasonable. Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D46551	2015-09-10 10:49:28 -07:00
Andres Noetzli	3c9cef1eed	Unified maps with Comparator for sorting, other cleanup Summary: This diff is a collection of cleanups that were initially part of D43179. Additionally it adds a unified way of defining key-value maps that use a Comparator for sorting (this was previously implemented in four different places). Test Plan: make clean check all Reviewers: rven, anthony, yhchiang, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D45993	2015-09-02 13:58:22 -07:00
Andres Noetzli	effd9dd1e1	Fix deadlock in WAL sync Summary: MarkLogsSynced() was doing `logs_.erase(it++);`. The standard is saying: ``` all iterators and references are invalidated, unless the erased members are at an end (front or back) of the deque (in which case only iterators and references to the erased members are invalidated) ``` Because `it` is an iterator to the first element of the container, it is invalidated, only one iteration is executed and `log.getting_synced = false;` is not being done, so `while (logs_.front().getting_synced)` in `WriteImpl()` is not terminating. Test Plan: make db_bench && ./db_bench --benchmarks=fillsync Reviewers: igor, rven, IslamAbdelRahman, anthony, kradhakrishnan, yhchiang, sdong, tnovak Reviewed By: tnovak Subscribers: kolmike, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D45807	2015-08-28 18:06:32 -07:00
Igor Canadi	5f4166c90e	ReadaheadRandomAccessFile -- userspace readahead Summary: ReadaheadRandomAccessFile acts as a transparent layer on top of RandomAccessFile. When a Read() request is issued, it issues a much bigger request to the OS and caches the result. When a new request comes in and we already have the data cached, it doesn't have to issue any requests to the OS. We add ReadaheadRandomAccessFile layer only when file is read during compactions. D45105 was incorrectly closed by Phabricator because I committed it to a separate branch (not master), so I'm resubmitting the diff. Test Plan: make check Reviewers: MarkCallaghan, sdong Reviewed By: sdong Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D45123	2015-08-26 15:25:59 -07:00
Andres Notzli	09d982f9e0	Fix compact_files_example Summary: See task #7983654. The example was triggering an assert in compaction job because the compaction was not marked as manual. With this patch, CompactionPicker::FormCompaction() marks compactions as manual. This patch also fixes a couple of typos, adds optimistic_transaction_example to .gitignore and librocksdb as a dependency for examples. Adding librocksdb as a dependency makes sure that the examples are built with the latest changes in librocksdb. Test Plan: make clean && cd examples && make all && ./compact_files_example Reviewers: rven, sdong, anthony, igor, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D45117	2015-08-25 12:29:44 -07:00
Andres Noetzli	2050832974	Fixing race condition in DBTest.DynamicMemtableOptions Summary: This patch fixes a race condition in DBTEst.DynamicMemtableOptions. In rare cases, it was possible that the main thread would fill up both memtables before the flush job acquired its work. Then, the flush job was flushing both memtables together, producing only one L0 file while the test expected two. Now, the test waits for flushes to finish earlier, to make sure that the memtables are flushed in separate flush jobs. Test Plan: Insert "usleep(10000);" after "IOSTATS_SET_THREAD_POOL_ID(Env::Priority::HIGH);" in BGWorkFlush() to make the issue more likely. Then test with: make db_test && time while ./db_test --gtest_filter=*DynamicMemtableOptions; do true; done Reviewers: rven, sdong, yhchiang, anthony, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D45429	2015-08-24 17:04:18 -07:00
Igor Canadi	4ab26c5ad1	Smarter purging during flush Summary: Currently, we only purge duplicate keys and deletions during flush if `earliest_seqno_in_memtable <= newest_snapshot`. This means that the newest snapshot happened before we first created the memtable. This is almost never true for MyRocks and MongoRocks. This patch makes purging during flush able to understand snapshots. The main logic is copied from compaction_job.cc, although the logic over there is much more complicated and extensive. However, we should try to merge the common functionality at some point. I need this patch to implement no_overwrite_i_promise functionality for flush. We'll also need this to support SingleDelete() during Flush(). @yoshinorim requested the feature. Test Plan: make check I had to adjust some unit tests to understand this new behavior Reviewers: yhchiang, yoshinorim, anthony, sdong, noetzli Reviewed By: noetzli Subscribers: yoshinorim, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D42087	2015-08-24 11:11:12 -07:00
Islam AbdelRahman	3fd70b05b8	Rate limit deletes issued by DestroyDB Summary: Update DestroyDB so that all SST files in the first path id go through DeleteScheduler instead of being deleted immediately Test Plan: added a unittest Reviewers: igor, yhchiang, anthony, kradhakrishnan, rven, sdong Reviewed By: sdong Subscribers: jeanxu2012, dhruba Differential Revision: https://reviews.facebook.net/D44955	2015-08-19 15:02:17 -07:00
Ari Ekmekji	f0da6977a3	[Parallel L0-L1 Compaction Prep]: Giving Subcompactions Their Own State Summary: In prepration for running multiple threads at the same time during a compaction job, this patch assigns each subcompaction its own state (instead of sharing the one global CompactionState). Each subcompaction then uses this state to update its statistics, keep track of its snapshots, etc. during the course of execution. Then at the end of all the executions the statistics are aggregated across the subcompactions so that the final result is the same as if only one larger compaction had run. Test Plan: ./db_test ./db_compaction_test ./compaction_job_test Reviewers: sdong, anthony, igor, noetzli, yhchiang Reviewed By: yhchiang Subscribers: MarkCallaghan, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D43239	2015-08-18 11:06:23 -07:00
sdong	72613657f0	Measure file read latency histogram per level Summary: In internal stats, remember read latency histogram, if statistics is enabled. It can be retrieved from DB::GetProperty() with "rocksdb.dbstats" property, if it is enabled. Test Plan: Manually run db_bench and prints out "rocksdb.dbstats" by hand and make sure it prints out as expected Reviewers: igor, IslamAbdelRahman, rven, kradhakrishnan, anthony, yhchiang Reviewed By: yhchiang Subscribers: MarkCallaghan, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D44193	2015-08-14 17:32:42 -07:00
Nathan Bronson	b7198c3afe	reduce db mutex contention for write batch groups Summary: This diff allows a Writer to join the next write batch group without acquiring any locks. Waiting is performed via a per-Writer mutex, so all of the non-leader writers never need to acquire the db mutex. It is now possible to join a write batch group after the leader has been chosen but before the batch has been constructed. This diff doesn't increase parallelism, but reduces synchronization overheads. For some CPU-bound workloads (no WAL, RAM-sized working set) this can substantially reduce contention on the db mutex in a multi-threaded environment. With T=8 N=500000 in a CPU-bound scenario (see the test plan) this is good for a 33% perf win. Not all scenarios see such a win, but none show a loss. This code is slightly faster even for the single-threaded case (about 2% for the CPU-bound scenario below). Test Plan: 1. unit tests 2. COMPILE_WITH_TSAN=1 make check 3. stress high-contention scenarios with db_bench -benchmarks=fillrandom -threads=$T -batch_size=1 -memtablerep=skip_list -value_size=0 --num=$N -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -disable_auto_compactions --max_write_buffer_number=8 -max_background_flushes=8 --disable_wal --write_buffer_size=160000000 Reviewers: sdong, igor, rven, ljin, yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D43887	2015-08-14 10:55:43 -07:00
sdong	603b6da8b8	Add options.compaction_measure_io_stats to print write I/O stats in compactions Summary: Add options.compaction_measure_io_stats to print out / pass to listener accumulated time spent on write calls. Example outputs in info logs: 2015/08/12-16:27:59.463944 7fd428bff700 (Original Log Time 2015/08/12-16:27:59.463922) EVENT_LOG_v1 {"time_micros": 1439422079463897, "job": 6, "event": "compaction_finished", "output_level": 1, "num_output_files": 4, "total_output_size": 6900525, "num_input_records": 111483, "num_output_records": 106877, "file_write_nanos": 15663206, "file_range_sync_nanos": 649588, "file_fsync_nanos": 349614797, "file_prepare_write_nanos": 1505812, "lsm_state": [2, 4, 0, 0, 0, 0, 0]} Add two more counters in iostats_context. Also add a parameter of db_bench. Test Plan: Add a unit test. Also manually verify LOG outputs in db_bench Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D44115	2015-08-13 16:52:26 -07:00
agiardullo	0db807ec28	Transaction error statuses Summary: Based on feedback from spetrunia, we should better differentiate error statuses for transaction failures. https://github.com/MySQLOnRocksDB/mysql-5.6/issues/86#issuecomment-124605954 Test Plan: unit tests Reviewers: rven, kradhakrishnan, spetrunia, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D43323	2015-08-11 17:52:56 -07:00
agiardullo	c2f2cb0214	Pessimistic Transactions Summary: Initial implementation of Pessimistic Transactions. This diff contains the api changes discussed in D38913. This diff is pretty large, so let me know if people would prefer to meet up to discuss it. MyRocks folks: please take a look at the API in include/rocksdb/utilities/transaction[_db].h and let me know if you have any issues. Also, you'll notice a couple of TODOs in the implementation of RollbackToSavePoint(). After chatting with Siying, I'm going to send out a separate diff for an alternate implementation of this feature that implements the rollback inside of WriteBatch/WriteBatchWithIndex. We can then decide which route is preferable. Next, I'm planning on doing some perf testing and then integrating this diff into MongoRocks for further testing. Test Plan: Unit tests, db_bench parallel testing. Reviewers: igor, rven, sdong, yhchiang, yoshinorim Reviewed By: sdong Subscribers: hermanlee4, maykov, spetrunia, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D40869	2015-08-11 17:52:23 -07:00
sdong	6a4aaadcd7	Avoid type unique_ptr in LogWriterNumber::writer for Windows build break Summary: Visual Studio complains about deque<LogWriterNumber> because LogWriterNumber is non-copyable for its unique_ptr member writer. Move away from it, and do explit free. It is less safe but I can't think of a better way to unblock it. Test Plan: valgrind check test Reviewers: anthony, IslamAbdelRahman, kolmike, rven, yhchiang Reviewed By: yhchiang Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D43647	2015-08-06 10:52:41 -07:00
Mike Kolupaev	e06cf1a098	[wal changes 3/3] method in DB to sync WAL without blocking writers Summary: Subj. We really need this feature. Previous diff D40899 has most of the changes to make this possible, this diff just adds the method. Test Plan: `make check`, the new test fails without this diff; ran with ASAN, TSAN and valgrind. Reviewers: igor, rven, IslamAbdelRahman, anthony, kradhakrishnan, tnovak, yhchiang, sdong Reviewed By: sdong Subscribers: MarkCallaghan, maykov, hermanlee4, yoshinorim, tnovak, dhruba Differential Revision: https://reviews.facebook.net/D40905	2015-08-05 06:06:39 -07:00
Islam AbdelRahman	c45a57b41e	Support delete rate limiting Summary: Introduce DeleteScheduler that allow enforcing a rate limit on file deletion Instead of deleting files immediately, files are moved to trash directory and deleted in a background thread that apply sleep penalty between deletes if needed. I have updated PurgeObsoleteFiles and PurgeObsoleteWALFiles to use the delete_scheduler instead of env_->DeleteFile Test Plan: added delete_scheduler_test existing unit tests Reviewers: kradhakrishnan, anthony, rven, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D43221	2015-08-04 20:45:27 -07:00
Poornima Chozhiyath Raman	1bdfcef7bf	Fix when output level is 0 of universal compaction with trivial move Summary: Fix for universal compaction with trivial move, when the ouput level is 0. The tests where failing. Fixed by allowing normal compaction when output level is 0. Test Plan: modified test cases run successfully. Reviewers: sdong, yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: anthony, kradhakrishnan, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D42933	2015-07-27 14:25:57 -07:00
Mike Kolupaev	fe09a6dae3	[wal changes 2/3] write with sync=true syncs previous unsynced wals to prevent illegal data loss Summary: I'll just copy internal task summary here: " This sequence will cause data loss in the middle after an sync write: non-sync write key 1 flush triggered, not yet scheduled sync write key 2 system crash After rebooting, users might see key 2 but not key 1, which violates the API of sync write. This can be reproduced using unit test FaultInjectionTest::DISABLED_WriteOptionSyncTest. One way to fix it is for a sync write, if there is outstanding unsynced log files, we need to syc them too. " This diff should be considered together with the next diff D40905; in isolation this fix probably could be a little simpler. Test Plan: `make check`; added a test for that (DBTest.SyncingPreviousLogs) before noticing FaultInjectionTest.WriteOptionSyncTest (keeping both since mine asserts a bit more); both tests fail without this diff; for D40905 stacked on top of this diff, ran tests with ASAN, TSAN and valgrind Reviewers: rven, yhchiang, IslamAbdelRahman, anthony, kradhakrishnan, igor, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D40899	2015-07-22 03:28:08 -07:00
agiardullo	064294081b	Improved FileExists API Summary: Add new CheckFileExists method. Considered changing the FileExists api but didn't want to break anyone's builds. Test Plan: unit tests Reviewers: yhchiang, igor, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D42003	2015-07-20 17:20:40 -07:00
sdong	6e9fbeb27c	Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future. Test Plan: Run all existing unit tests. Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D42321	2015-07-17 16:58:18 -07:00
Igor Canadi	35ca59364c	Don't let flushes preempt compactions Summary: When we first started, max_background_flushes was 0 by default and compaction thread was executing flushes (since there was no flush thread). Then, we switched the default max_background_flushes to 1. However, we still support the case where there is no flush thread and flushes are done in compaction. This is making our code a bit more complicated. By not supporting this use-case we can make our code simpler. We have a special case that when you set max_background_flushes to 0, we schedule the flush to execute on the compaction thread. Test Plan: make check (there might be some unit tests that depend on this behavior) Reviewers: IslamAbdelRahman, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D41931	2015-07-17 12:02:52 -07:00
Igor Canadi	a96fcd09b7	Deprecate CompactionFilterV2 Summary: It has been around for a while and it looks like it never found any uses in the wild. It's also complicating our compaction_job code quite a bit. We're deprecating it in 3.13, but will put it back in 3.14 if we actually find users that need this feature. Test Plan: make check Reviewers: noetzli, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D42405	2015-07-17 18:59:11 +02:00
sdong	6c0c8dee7b	Fix data loss after DB recovery by not allowing flush/compaction to be scheduled until DB opened Summary: Previous run may leave some SST files with higher file numbers than manifest indicates. Compaction or flush may start to run while DB::Open() is still going on. SST file garbage collection may happen interleaving with compaction or flush, and overwrite files generated by compaction of flushes after they are generated. This might cause data loss. This possibility of interleaving is recently introduced. Fix it by not allowing compaction or flush to be scheduled before DB::Open() finishes. Test Plan: Add a unit test. This verification will have a chance to fail without the fix but doesn't fix without the fix. Reviewers: kradhakrishnan, anthony, yhchiang, IslamAbdelRahman, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D42399	2015-07-16 10:57:41 -07:00
Poornima Chozhiyath Raman	beb19ad0dd	Fixing delete files in Trivial move of universal compaction Summary: Trvial move in universal compaction was failing when trying to move files from levels other than 0. This was because the DeleteFile while trivially moving, was only deleting files of level 0 which caused duplication of same file in different levels. This is fixed by passing the right level as argument in the call of DeleteFile while doing trivial move. Test Plan: ./db_test ran successfully with the new test cases. Reviewers: sdong Reviewed By: sdong Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D42135	2015-07-15 12:28:22 -07:00
Igor Canadi	5aea98ddd8	Deprecate WriteOptions::timeout_hint_us Summary: In one of our recent meetings, we discussed deprecating features that are not being actively used. One of those features, at least within Facebook, is timeout_hint. The feature is really nicely implemented, but if nobody needs it, we should remove it from our code-base (until we get a valid use-case). Some arguments: * Less code == better icache hit rate, smaller builds, simpler code * The motivation for adding timeout_hint_us was to work-around RocksDB's stall issue. However, we're currently addressing the stall issue itself (see @sdong's recent work on stall write_rate), so we should never see sharp lock-ups in the future. * Nobody is using the feature within Facebook's code-base. Googling for `timeout_hint_us` also doesn't yield any users. Test Plan: make check Reviewers: anthony, kradhakrishnan, sdong, yhchiang Reviewed By: yhchiang Subscribers: sdong, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D41937	2015-07-14 09:35:48 +02:00
sdong	f9728640f3	"make format" against last 10 commits Summary: This helps Windows port to format their changes, as discussed. Might have formatted some other codes too becasue last 10 commits include more. Test Plan: Build it. Reviewers: anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D41961	2015-07-13 13:50:18 -07:00
sdong	5fd11853cb	Print Fast CRC32 support information in DB LOG Summary: Print whether fast CRC32 is supported in DB info LOG Test Plan: Run db_bench and see it prints out correctly. Reviewers: yhchiang, anthony, kradhakrishnan, igor Reviewed By: igor Subscribers: MarkCallaghan, yoshinorim, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D41733	2015-07-10 17:59:36 -07:00
Dmitri Smirnov	c903ccc4c2	Merge from github/master	2015-07-09 18:01:08 -07:00
Poornima Chozhiyath Raman	4bed00a44b	Fix function name format according to google style Summary: Change the naming style of getter and setters according to Google C++ style in compaction.h file Test Plan: Compilation success Reviewers: sdong Reviewed By: sdong Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D41265	2015-07-08 15:21:10 -07:00
Poornima Chozhiyath Raman	c0b23dd5b0	Enabling trivial move in universal compaction Summary: This change enables trivial move if all the input files are non onverlapping while doing Universal Compaction. Test Plan: ./compaction_picker_test and db_test ran successfully with the new testcases. Reviewers: sdong Reviewed By: sdong Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D40875	2015-07-07 14:18:55 -07:00
Yueh-Hsuan Chiang	4ce5be4255	fixed leaking log::Writers Summary: Fixes valgrind errors in column_family_test. Test Plan: `make check`, `make valgrind_check` Reviewers: igor, yhchiang Reviewed By: yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D41181	2015-07-07 12:10:10 -07:00
Mike Kolupaev	218487d8dc	[wal changes 1/3] fixed unbounded wal growth in some workloads Summary: This fixes the following scenario we've hit: - we reached max_total_wal_size, created a new wal and scheduled flushing all memtables corresponding to the old one, - before the last of these flushes started its column family was dropped; the last background flush call was a no-op; no one removed the old wal from alive_logs_, - hours have passed and no flushes happened even though lots of data was written; data is written to different column families, compactions are disabled; old column families are dropped before memtable grows big enough to trigger a flush; the old wal still sits in alive_logs_ preventing max_total_wal_size limit from kicking in, - a few more hours pass and we run out disk space because of one huge .log file. Test Plan: `make check`; backported the new test, checked that it fails without this diff Reviewers: igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D40893	2015-07-02 14:27:00 -07:00
Dmitri Smirnov	9dbde7277c	Merge remote-tracking branch 'origin' into ms_win_port	2015-07-02 11:34:22 -07:00
Dmitri Smirnov	18285c1e2f	Windows Port from Microsoft Summary: Make RocksDb build and run on Windows to be functionally complete and performant. All existing test cases run with no regressions. Performance numbers are in the pull-request. Test plan: make all of the existing unit tests pass, obtain perf numbers. Co-authored-by: Praveen Rao praveensinghrao@outlook.com Co-authored-by: Sherlock Huang baihan.huang@gmail.com Co-authored-by: Alex Zinoviev alexander.zinoviev@me.com Co-authored-by: Dmitri Smirnov dmitrism@microsoft.com	2015-07-01 16:13:56 -07:00
Venkatesh Radhakrishnan	c9cd404bcd	Make flush check for shutdown Summary: Fixes task 7156865 where a compaction causes a hang in flush memtable if CancelAllBackgroundWork was called prior to it. Stack trace is in : https://phabricator.fb.com/P19848829 We end up waiting for a flush which will never happen because there are no background threads. Test Plan: PreShutdownFlush Reviewers: sdong, igor Reviewed By: sdong, igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D40617	2015-06-25 14:43:25 -07:00
Islam AbdelRahman	674b1181cf	Bottommost level compaction option Summary: Replace force_bottommost_level_compaction in CompactRangeOption with an option that allow the user to (always skip, always compact, compact if compaction filter is present) the bottommost level for level based compaction. Test Plan: make check Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D40527	2015-06-23 13:32:40 -07:00
Giuseppe Ottaviano	782a1590f9	Implement a table-level row cache Summary: Implementation of a table-level row cache. It only caches point queries done through the `DB::Get` interface, queries done through the `Iterator` interface will completely skip the cache. Supports snapshots and merge operations. Test Plan: Ran `make valgrind_check commit-prereq` Reviewers: igor, philipp, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D39849	2015-06-23 10:25:45 -07:00
krad	de85e4cadf	Introduce WAL recovery consistency levels Summary: The "one size fits all" approach with WAL recovery will only introduce inconvenience for our varied clients as we go forward. The current recovery is a bit heuristic. We introduce the following levels of consistency while replaying the WAL. 1. RecoverAfterRestart (kTolerateCorruptedTailRecords) This mocks the current recovery mode. 2. RecoverAfterCleanShutdown (kAbsoluteConsistency) This is ideal for unit test and cases where the store is shutdown cleanly. We tolerate no corruption or incomplete writes. 3. RecoverPointInTime (kPointInTimeRecovery) This is ideal when using devices with controller cache or file systems which can loose data on restart. We recover upto the point were is no corruption or incomplete write. 4. RecoverAfterDisaster (kSkipAnyCorruptRecord) This is ideal mode to recover data. We tolerate corruption and incomplete writes, and we hop over those sections that we cannot make sense of salvaging as many records as possible. Test Plan: (1) Run added unit test to cover all levels. (2) Run make check. Reviewers: leveldb, sdong, igor Subscribers: yoshinorim, dhruba Differential Revision: https://reviews.facebook.net/D38487	2015-06-22 15:28:12 -07:00
Islam AbdelRahman	530534fceb	Fix trivial move merge Summary: Fixing bad merge Test Plan: make -j64 check (this is not enough to verify the fix) Reviewers: igor, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D40521	2015-06-22 15:20:30 -07:00
Igor Canadi	760e9a94de	Fail DB::Open() when the requested compression is not available Summary: Currently RocksDB silently ignores this issue and doesn't compress the data. Based on discussion, we agree that this is pretty bad because it can cause confusion for our users. This patch fails DB::Open() if we don't support the compression that is specified in the options. Test Plan: make check with LZ4 not present. If Snappy is not present all tests will just fail because Snappy is our default library. We should make Snappy the requirement, since without it our default DB::Open() fails. Reviewers: sdong, MarkCallaghan, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D39687	2015-06-18 14:55:05 -07:00
Islam AbdelRahman	4eabbdb7ec	Skip bottommost level compaction if possible Summary: This is https://reviews.facebook.net/D39999 but after introducing an option to force compaction the bottom most level Changes in this patch - Introduce force_bottommost_level_compaction to CompactRangeOptions that force compacting bottommost level during compaction - Skip bottommost level compaction if we dont have a compaction filter and force_bottommost_level_compaction options is not set Although tests pass on my machine but I suspect that there maybe some tests that I am not aware of that should use force_bottommost_level_compaction to pass in a deterministic way Test Plan: make check adding new tests Reviewers: igor, sdong, yhchiang Reviewed By: yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D40059	2015-06-18 11:03:31 -07:00
Yueh-Hsuan Chiang	bb1c74ce18	Fixed a bug of CompactionStats in multi-level universal compaction case Summary: Universal compaction can involves in multiple levels. However, the current implementation of bytes_readn and bytes_readnp1 (and some other stats with postfix `n` and `np1`) assumes compaction can only have two levels. This patch fixes this bug and redefines bytes_readn and bytes_readnp1: * bytes_readnp1: the number of bytes read in the compaction output level. * bytes_readn: the total number of bytes read minus bytes_readnp1 Test Plan: Add a test in compaction_job_stats_test Reviewers: igor, sdong, rven, anthony, kradhakrishnan, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D40239	2015-06-17 23:40:34 -07:00
Islam AbdelRahman	12e030a992	Use CompactRangeOptions for CompactRange Summary: This diff update DB::CompactRange to use RangeCompactionOptions instead of using multiple parameters Old CompactRange is still available but deprecated Test Plan: make all check make rocksdbjava USE_CLANG=1 make all OPT=-DROCKSDB_LITE make release Reviewers: sdong, yhchiang, igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D40209	2015-06-17 14:36:14 -07:00
Igor Canadi	25d600569d	Clean up InstallSuperVersion Summary: We go to great lengths to make sure MaybeScheduleFlushOrCompaction() is called outside of write thread. But anyway, it's still called in the mutex, so it's not that much cheaper. This diff removes the "optimization" and cleans up the code a bit. Test Plan: make check Reviewers: rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D40113	2015-06-17 12:37:59 -07:00
sdong	40f562e747	Allow GetApproximateSize() to include mem table size if it is skip list memtable Summary: Add an option in GetApproximateSize() so that the result will include estimated sizes in mem tables. To implement it, implement an estimated count from the beginning to a key in skip list. The approach is to count to find the entry, how many Next() is issued from each level, and sum them with a weight that is <branching factor> ^ <level>. Test Plan: Add a test case Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D40119	2015-06-16 18:13:23 -07:00
Islam AbdelRahman	cccd2199a6	Revert skip bottommost compaction Summary: Reverting this diff https://reviews.facebook.net/D39999 Will add an option to force bottom most level compaction and then re submit it Test Plan: make check Reviewers: igor, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D40041	2015-06-12 10:43:33 -07:00
Islam AbdelRahman	20f2b54252	Skip bottom most level compaction if no compaction filter Summary: If we don't have a compaction filter then we can skip compacting the bottom most level Test Plan: make check added unit tests Reviewers: yhchiang, sdong, igor Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D39999	2015-06-12 09:56:08 -07:00
sdong	7842920be5	Slow down writes by bytes written Summary: We slow down data into the database to the rate of options.delayed_write_rate (a new option) with this patch. The thread synchronization approach I take is to still synchronize write controller by DB mutex and GetDelay() is inside DB mutex. Try to minimize the frequency of getting time in GetDelay(). I verified it through db_bench and it seems to work hard_rate_limit is deprecated. options.delayed_write_rate is still not dynamically changeable. Need to work on it as a follow-up. Test Plan: Add new unit tests in db_test Reviewers: yhchiang, rven, kradhakrishnan, anthony, MarkCallaghan, igor Reviewed By: igor Subscribers: ikabiljo, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D36351	2015-06-11 20:42:18 -07:00
Islam AbdelRahman	d6ce0f7c61	Add largest sequence to FlushJobInfo Summary: Adding largest sequence number to FlushJobInfo and passing flushed file metadata to NotifyOnFlushCompleted which include alot of other values that we may want to expose in FlushJobInfo Test Plan: make check Reviewers: igor, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D39927	2015-06-11 15:22:22 -07:00
Yueh-Hsuan Chiang	3eddd1abe9	Add Env::GetThreadID(), which returns the ID of the current thread. Summary: Add Env::GetThreadID(), which returns the ID of the current thread. In addition, make GetThreadList() and InfoLog use same unique ID for the same thread. Test Plan: db_test listener_test Reviewers: igor, rven, IslamAbdelRahman, kradhakrishnan, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D39735	2015-06-11 14:18:02 -07:00
Islam AbdelRahman	73faa3d41d	Handling edge cases for ReFitLevel Summary: Right now the level we pass to ReFitLevel is the maximum level with files (before compaction), there are multiple cases where this maximum level have changed after compaction - all files where in L0 (now maximum level is L1) - using kCompactionStyleUniversal (now maximum level in the last level) - level_compaction_dynamic_level_bytes ?? We can handle each of these cases individually, but I felt it's safer to calculate max_level_with_files again if we want to do a ReFitLevel Test Plan: adding some tests make -j64 check Reviewers: igor, sdong Reviewed By: sdong Subscribers: ott, dhruba Differential Revision: https://reviews.facebook.net/D39663	2015-06-11 14:15:52 -07:00
Venkatesh Radhakrishnan	406a5682eb	Fix hang when closing a DB after doing loads with WAL disabled. Summary: There is a hang during DB close in the following scenario: a) a load with WAL disabled was done, b) CancelAllBackgroundWork was called, c) DB Close was called This was because in that we will wait for a flush but we cannot do a background flush because we have called CancelAllBackgroundWork which marks the DB as shutting downn. Test Plan: Added DBTest FlushOnDestroy Reviewers: sdong Reviewed By: sdong Subscribers: yoshinorim, hermanlee4, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D39747	2015-06-09 10:39:49 -07:00
sdong	d8c8f08c12	GetSnapshot() and ReleaseSnapshot() to move new and free out of DB mutex Summary: We currently issue malloc and free inside DB mutex in GetSnapshot() and ReleaseSnapshot(). Move them out. Test Plan: Go through all tests make valgrind_check Reviewers: yhchiang, rven, IslamAbdelRahman, anthony, igor Reviewed By: igor Subscribers: maykov, hermanlee4, MarkCallaghan, yoshinorim, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D39753	2015-06-08 21:57:02 -07:00
sdong	6df589b446	Add TablePropertiesCollector::NeedCompact() to suggest DB to further compact output files Summary: It is experimental. Allow users to return from a call back function TablePropertiesCollector::NeedCompact(), based on the data in the file. It can be used to allow users to suggest DB to clear up delete tombstones faster. Test Plan: Add a unit test. Reviewers: igor, yhchiang, kradhakrishnan, rven Reviewed By: rven Subscribers: yoshinorim, MarkCallaghan, maykov, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D39585	2015-06-05 20:18:21 -07:00
Yueh-Hsuan Chiang	2e764f06ea	[API Change] Improve EventListener::OnFlushCompleted interface Summary: EventListener::OnFlushCompleted() now passes a structure instead of a list of parameters. This minimizes the API change in the future. Test Plan: listener_test compact_files_test example/compact_files_example Reviewers: kradhakrishnan, sdong, IslamAbdelRahman, rven, igor Reviewed By: rven, igor Subscribers: IslamAbdelRahman, rven, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D39543	2015-06-05 12:28:51 -07:00
Yueh-Hsuan Chiang	7322c74012	Revert incorrect commit Summary: Revert incorrect commit Test Plan: db_test Reviewers: sdong, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D39651	2015-06-05 11:23:09 -07:00
Islam AbdelRahman	31e60e2a77	Unlock mutex in ReFitLevel Summary: I encountered an issue where the database hang, it looks like the mutex is not unlocked on return in ReFitLevel function Test Plan: make -j64 check Reviewers: yhchiang, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D39609	2015-06-05 11:06:14 -07:00
Yueh-Hsuan Chiang	7647df8f9e	Fixed the tsan failure in util/compaction_job_stats_impl.cc Summary: The type of smallest_output_key_prefix and largest_output_key_prefix have been changed to std::string in https://reviews.facebook.net/D39537. As a result, we shouldn't do smallest_output_key_prefix[0] = 0 in the initialization. Test Plan: compile db_test with tsan enabled and repeat DBTest.CompactionDeletionTrigger test to verify the tsan issue has been gone. Reviewers: igor, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D39645	2015-06-05 11:05:35 -07:00
Islam AbdelRahman	3ce3bb3da2	Allowing L0 -> L1 trivial move on sorted data Summary: This diff updates the logic of how we do trivial move, now trivial move can run on any number of files in input level as long as they are not overlapping The conditions for trivial move have been updated Introduced conditions: - Trivial move cannot happen if we have a compaction filter (except if the compaction is not manual) - Input level files cannot be overlapping Removed conditions: - Trivial move only run when the compaction is not manual - Input level should can contain only 1 file More context on what tests failed because of Trivial move ``` DBTest.CompactionsGenerateMultipleFiles This test is expecting compaction on a file in L0 to generate multiple files in L1, this test will fail with trivial move because we end up with one file in L1 ``` ``` DBTest.NoSpaceCompactRange This test expect compaction to fail when we force environment to report running out of space, of course this is not valid in trivial move situation because trivial move does not need any extra space, and did not check for that ``` ``` DBTest.DropWrites Similar to DBTest.NoSpaceCompactRange ``` ``` DBTest.DeleteObsoleteFilesPendingOutputs This test expect that a file in L2 is deleted after it's moved to L3, this is not valid with trivial move because although the file was moved it is now used by L3 ``` ``` CuckooTableDBTest.CompactionIntoMultipleFiles Same as DBTest.CompactionsGenerateMultipleFiles ``` This diff is based on a work by @sdong https://reviews.facebook.net/D34149 Test Plan: make -j64 check Reviewers: rven, sdong, igor Reviewed By: igor Subscribers: yhchiang, ott, march, dhruba, sdong Differential Revision: https://reviews.facebook.net/D34797	2015-06-04 16:51:25 -07:00
Yueh-Hsuan Chiang	0b3172d071	Add EventListener::OnTableFileDeletion() Summary: Add EventListener::OnTableFileDeletion(), which will be called when a table file is deleted. Test Plan: Extend three existing tests in db_test to verify the deleted files. Reviewers: rven, anthony, kradhakrishnan, igor, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38931	2015-06-03 19:57:01 -07:00
Yueh-Hsuan Chiang	8afafc2783	Fix compile warning in db/db_impl Summary: Fix the following compile warning in db/db_impl db/db_impl.cc:1603:19: error: implicit conversion loses integer precision: 'const uint64_t' (aka 'const unsigned long') to 'int' [-Werror,-Wshorten-64-to-32] info.job_id = job_id; ~ ^~~~~~ Test Plan: db_test Reviewers: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D39423	2015-06-02 17:36:45 -07:00
Yueh-Hsuan Chiang	fe5c6321cb	Allow EventListener::OnCompactionCompleted to return CompactionJobStats. Summary: Allow EventListener::OnCompactionCompleted to return CompactionJobStats, which contains useful information about a compaction. Example CompactionJobStats returned by OnCompactionCompleted(): smallest_output_key_prefix 05000000 largest_output_key_prefix 06990000 elapsed_time 42419 num_input_records 300 num_input_files 3 num_input_files_at_output_level 2 num_output_records 200 num_output_files 1 actual_bytes_input 167200 actual_bytes_output 110688 total_input_raw_key_bytes 5400 total_input_raw_value_bytes 300000 num_records_replaced 100 is_manual_compaction 1 Test Plan: Developed a mega test in db_test which covers 20 variables in CompactionJobStats. Reviewers: rven, igor, anthony, sdong Reviewed By: sdong Subscribers: tnovak, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38463	2015-06-02 17:07:16 -07:00
Yueh-Hsuan Chiang	fc83821270	Add EventListener::OnTableFileCreated() Summary: Add EventListener::OnTableFileCreated(), which will be called when a table file is created. This patch is part of the EventLogger and EventListener integration. Test Plan: Augment existing test in db/listener_test.cc Reviewers: anthony, kradhakrishnan, rven, igor, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38865	2015-06-02 14:12:23 -07:00
Yueh-Hsuan Chiang	898e803fc5	Add a stats counter for DB_WRITE back which was mistakenly removed. Summary: Add a stats counter for DB_WRITE back which was mistakenly removed. Test Plan: augment GroupCommitTest Reviewers: sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D39399	2015-06-02 12:35:12 -07:00
Mike Kolupaev	ec7a944360	more times in perf_context and iostats_context Summary: We occasionally get write stalls (>1s Write() calls) on HDD under read load. The following timers explain almost all of the stalls: - perf_context.db_mutex_lock_nanos - perf_context.db_condition_wait_nanos - iostats_context.open_time - iostats_context.allocate_time - iostats_context.write_time - iostats_context.range_sync_time - iostats_context.logger_time In my experiments each of these occasionally takes >1s on write path under some workload. There are rare cases when Write() takes long but none of these takes long. Test Plan: Added code to our application to write the listed timings to log for slow writes. They usually add up to almost exactly the time Write() call took. Reviewers: rven, yhchiang, sdong Reviewed By: sdong Subscribers: march, dhruba, tnovak Differential Revision: https://reviews.facebook.net/D39177	2015-06-02 02:07:58 -07:00
sdong	4266d4fd90	Allow users to migrate to options.level_compaction_dynamic_level_bytes=true using CompactRange() Summary: In DB::CompactRange(), change parameter "reduce_level" to "change_level". Users can compact all data to the last level if needed. By doing it, users can migrate the DB to options.level_compaction_dynamic_level_bytes=true. Test Plan: Add a unit test for it. Reviewers: yhchiang, anthony, kradhakrishnan, igor, rven Reviewed By: rven Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D39099	2015-06-01 18:21:14 -07:00
Yueh-Hsuan Chiang	d333820bad	Removed DBImpl::notifying_events_ Summary: DBImpl::notifying_events_ is a internal counter in DBImpl which is used to prevent DB close when DB is notifying events. However, as the current events all rely on either compaction or flush which already have similar counters to prevent DB close, it is safe to remove notifying_events_. Test Plan: listener_test examples/compact_files_example Reviewers: igor, anthony, kradhakrishnan, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D39315	2015-06-01 15:32:23 -07:00
agiardullo	bc7a7a400c	fix LITE build Summary: Broken by optimistic transaction diff. (I only built 'release' not 'static_lib' when testing). Test Plan: build Reviewers: yhchiang, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D39219	2015-05-29 15:22:00 -07:00
agiardullo	dc9d70de65	Optimistic Transactions Summary: Optimistic transactions supporting begin/commit/rollback semantics. Currently relies on checking the memtable to determine if there are any collisions at commit time. Not yet implemented would be a way of enuring the memtable has some minimum amount of history so that we won't fail to commit when the memtable is empty. You should probably start with transaction.h to get an overview of what is currently supported. Test Plan: Added a new test, but still need to look into stress testing. Reviewers: yhchiang, igor, rven, sdong Reviewed By: sdong Subscribers: adamretter, MarkCallaghan, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D33435	2015-05-29 14:36:35 -07:00
agiardullo	c815351038	Support saving history in memtable_list Summary: For transactions, we are using the memtables to validate that there are no write conflicts. But after flushing, we don't have any memtables, and transactions could fail to commit. So we want to someone keep around some extra history to use for conflict checking. In addition, we want to provide a way to increase the size of this history if too many transactions fail to commit. After chatting with people, it seems like everyone prefers just using Memtables to store this history (instead of a separate history structure). It seems like the best place for this is abstracted inside the memtable_list. I decide to create a separate list in MemtableListVersion as using the same list complicated the flush/installalflushresults logic too much. This diff adds a new parameter to control how much memtable history to keep around after flushing. However, it sounds like people aren't too fond of adding new parameters. So I am making the default size of flushed+not-flushed memtables be set to max_write_buffers. This should not change the maximum amount of memory used, but make it more likely we're using closer the the limit. (We are now postponing deleting flushed memtables until the max_write_buffer limit is reached). So while we might use more memory on average, we are still obeying the limit set (and you could argue it's better to go ahead and use up memory now instead of waiting for a write stall to happen to test this limit). However, if people are opposed to this default behavior, we can easily set it to 0 and require this parameter be set in order to use transactions. Test Plan: Added a xfunc test to play around with setting different values of this parameter in all tests. Added testing in memtablelist_test and planning on adding more testing here. Reviewers: sdong, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D37443	2015-05-28 16:34:24 -07:00
Yueh-Hsuan Chiang	ec4ff4e99c	Rename EventLoggerHelpers EventHelpers Summary: Rename EventLoggerHelpers EventHelpers, as it's going to include all event-related helper functions instead of EventLogger only stuffs. Test Plan: make Reviewers: sdong, rven, anthony Reviewed By: anthony Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D39093	2015-05-28 13:37:47 -07:00
Yueh-Hsuan Chiang	672dda9b3b	[API Change] Move listeners from ColumnFamilyOptions to DBOptions Summary: Move listeners from ColumnFamilyOptions to DBOptions Test Plan: listener_test compact_files_test Reviewers: rven, anthony, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D39087	2015-05-28 13:21:39 -07:00
Yueh-Hsuan Chiang	a0580205c8	Removed an unused private variable in db_impl.h Summary: Removed an unused private variable in db_impl.h Test Plan: make db_test Reviewers: sdong, anthony, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38925	2015-05-26 10:46:26 -07:00
Yueh-Hsuan Chiang	5c224d1b70	Fixed two bugs on logging file deletion. Summary: This patch fixes the following two bugs on logging file deletion. 1. Previously, file deletion failure was only logged in INFO_LEVEL. This patch changes it to ERROR_LEVEL and does some code clean. 2. EventLogger previously will always generate the same log on table file deletion even when file deletion is not successful. Now the resulting status of file deletion will also be logged. Test Plan: make all check Reviewers: sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38817	2015-05-22 12:10:51 -07:00
Yueh-Hsuan Chiang	dc81efe415	Change the log-level of DB summary and options from INFO_LEVEL to WARN_LEVEL Summary: Change the log-level of DB summary and options from INFO_LEVEL to WARN_LEVEL Test Plan: Use db_bench to verify the log level. Sample output: 2015/05/22-00:20:39.778064 7fff75b41300 [WARN] RocksDB version: 3.11.0 2015/05/22-00:20:39.778095 7fff75b41300 [WARN] Git sha rocksdb_build_git_sha:7fee8775a459134c4cb04baae5bd1687e268f2a0 2015/05/22-00:20:39.778099 7fff75b41300 [WARN] Compile date May 22 2015 2015/05/22-00:20:39.778101 7fff75b41300 [WARN] DB SUMMARY 2015/05/22-00:20:39.778145 7fff75b41300 [WARN] SST files in /tmp/rocksdbtest-691931916/dbbench dir, Total Num: 0, files: 2015/05/22-00:20:39.778148 7fff75b41300 [WARN] Write Ahead Log file in /tmp/rocksdbtest-691931916/dbbench: 2015/05/22-00:20:39.778150 7fff75b41300 [WARN] Options.error_if_exists: 0 2015/05/22-00:20:39.778152 7fff75b41300 [WARN] Options.create_if_missing: 1 2015/05/22-00:20:39.778153 7fff75b41300 [WARN] Options.paranoid_checks: 1 Reviewers: MarkCallaghan, igor, kradhakrishnan Reviewed By: igor Subscribers: sdong, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38835	2015-05-22 11:54:59 -07:00
Yueh-Hsuan Chiang	2abb592688	Avoid logging under mutex in DBImpl::WriteLevel0TableForRecovery(). Summary: Avoid logging under mutex in DBImpl::WriteLevel0TableForRecovery(). Test Plan: make all check Reviewers: igor, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38823	2015-05-22 11:24:12 -07:00
Yueh-Hsuan Chiang	e2c1d4b57f	[Public API Change] Make DB::GetDbIdentity() be const function. Summary: Make DB::GetDbIdentity() be const function. Test Plan: make db_test Reviewers: igor, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38745	2015-05-21 11:01:48 -07:00
Yueh-Hsuan Chiang	812c461c96	Dump db stats in WARN level Summary: Dump db stats in WARN level Test Plan: run db_bench and verify the LOG Reviewers: igor, MarkCallaghan Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38691	2015-05-19 18:42:17 -07:00
Igor Canadi	4a855c0799	Add an option wal_bytes_per_sync to control sync_file_range for WAL files Summary: sync_file_range is not always asyncronous and thus can block writes if we do this for WAL in the foreground thread. See more here: http://yoshinorimatsunobu.blogspot.com/2014/03/how-syncfilerange-really-works.html Some users don't want us to call sync_file_range on WALs. Some other do. Thus, I'm adding a separate option wal_bytes_per_sync to control calling sync_file_range on WAL files. bytes_per_sync will apply only to table files now. Test Plan: no more sync_file_range for WAL as evidenced by strace Reviewers: yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38253	2015-05-18 17:03:59 -07:00
Igor Canadi	b0fdda4ff0	Allow flushes to run in parallel with manual compaction Summary: As title. I spent some time thinking about it and I don't think there should be any issue with running manual compaction and flushes in parallel Test Plan: make check works Reviewers: rven, yhchiang, sdong Reviewed By: yhchiang, sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38355	2015-05-18 15:34:33 -07:00
sdong	6fa7085121	CompactRange skips levels 1 to base_level -1 for dynamic level base size Summary: CompactRange() now is much more expensive for dynamic level base size as it goes through all the levels. Skip those not used levels between level 0 an base level. Test Plan: Run all unit tests Reviewers: yhchiang, rven, anthony, kradhakrishnan, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D37125	2015-05-18 10:54:11 -07:00
Igor Canadi	dbd95b7532	Add more table properties to EventLogger Summary: Example output: {"time_micros": 1431463794310521, "job": 353, "event": "table_file_creation", "file_number": 387, "file_size": 86937, "table_info": {"data_size": "81801", "index_size": "9751", "filter_size": "0", "raw_key_size": "23448", "raw_average_key_size": "24.000000", "raw_value_size": "990571", "raw_average_value_size": "1013.890481", "num_data_blocks": "245", "num_entries": "977", "filter_policy_name": "", "kDeletedKeys": "0"}} Also fixed a bug where BuildTable() in recovery was passing Env::IOHigh argument into paranoid_checks_file parameter. Test Plan: make check + check out the output in the log Reviewers: sdong, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38343	2015-05-12 15:53:55 -07:00
agiardullo	711465ccec	API to fetch from both a WriteBatchWithIndex and the db Summary: Added a couple functions to WriteBatchWithIndex to make it easier to query the value of a key including reading pending writes from a batch. (This is needed for transactions). I created write_batch_with_index_internal.h to use to store an internal-only helper function since there wasn't a good place in the existing class hierarchy to store this function (and it didn't seem right to stick this function inside WriteBatchInternal::Rep). Since I needed to access the WriteBatchEntryComparator, I moved some helper classes from write_batch_with_index.cc into write_batch_with_index_internal.h/.cc. WriteBatchIndexEntry, ReadableWriteBatch, and WriteBatchEntryComparator are all unchanged (just moved to a different file(s)). Test Plan: Added new unit tests. Reviewers: rven, yhchiang, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38037	2015-05-11 14:51:51 -07:00
Igor Canadi	65fe1cfbb3	Cleanup CompactionJob Summary: Couple changes: 1. instead of SnapshotList, just take a vector of snapshots 2. don't take a separate parameter is_snapshots_supported. If there are snapshots in the list, that means they are supported. I actually think we should get rid of this notion of snapshots not being supported. 3. don't pass in mutable_cf_options as a parameter. Lifetime of mutable_cf_options is a bit tricky to maintain, so it's better to not pass it in for the whole compaction job. We only really need it when we install the compaction results. Test Plan: make check Reviewers: sdong, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D36627	2015-05-05 19:01:12 -07:00
Laurent Demailly	df4130ad85	fix crashes in stats and compaction filter for db_ttl_impl Summary: fix crashes in stats and compaction filter for db_ttl_impl Test Plan: Ran build with lots of debugging https://reviews.facebook.net/differential/diff/194175/ Reviewers: yhchiang, igor, rven Reviewed By: igor Subscribers: rven, dhruba Differential Revision: https://reviews.facebook.net/D38001	2015-05-05 16:54:47 -07:00
Igor Canadi	36a7408896	Fix UNLIKELY parenthesis Summary: Ooops :) status.ok() is acutally highly likely :) Test Plan: none Reviewers: rven, yhchiang, anthony Reviewed By: anthony Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38043	2015-05-05 08:57:34 -07:00
Venkatesh Radhakrishnan	d2346c2cf0	Fix hang with large write batches and column families. Summary: This diff fixes a hang reported by a Github user. https://www.facebook.com/l.php?u=https%3A%2F%2Fgithub.com%2Ffacebook%2Frocksdb%2Fissues%2F595%23issuecomment-96983273&h=9AQFYOWlo Multiple large write batches with column families cause a hang. The issue was caused by not doing flushes/compaction when the write controller was stopped. Test Plan: Create a DBTest from the user's test case Reviewers: igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D37929	2015-05-01 15:41:50 -07:00
krad	d4540654e9	Optimize GetApproximateSizes() to use lesser CPU cycles. Summary: CPU profiling reveals GetApproximateSizes as a bottleneck for performance. The current implementation is sub-optimal, it scans every file in every level to compute the result. We can take advantage of the fact that all levels above 0 are sorted in the increasing order of key ranges and use binary search to locate the starting index. This can reduce the number of comparisons required to compute the result. Test Plan: We have good test coverage. Run the tests. Reviewers: sdong, igor, rven, dynamike Subscribers: dynamike, maykov, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D37755	2015-04-30 10:55:03 -07:00
Igor Canadi	7f47ba0e26	Fix possible SIGSEGV in CompactRange (github issue #596 ) Summary: For very detailed explanation of what's happening read this: https://github.com/facebook/rocksdb/issues/596 Test Plan: make check + new unit test Reviewers: yhchiang, anthony, rven Reviewed By: rven Subscribers: adamretter, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D37779	2015-04-29 10:52:31 -07:00
Igor Canadi	1bb4928da9	Include bunch of more events into EventLogger Summary: Added these events: * Recovery start, finish and also when recovery creates a file * Trivial move * Compaction start, finish and when compaction creates a file * Flush start, finish Also includes small fix to EventLogger Also added option ROCKSDB_PRINT_EVENTS_TO_STDOUT which is useful when we debug things. I've spent far too much time chasing LOG files. Still didn't get sst table properties in JSON. They are written very deeply into the stack. I'll address in separate diff. TODO: * Write specification. Let's first use this for a while and figure out what's good data to put here, too. After that we'll write spec * Write tools that parse and analyze LOGs. This can be in python or go. Good intern task. Test Plan: Ran db_bench with ROCKSDB_PRINT_EVENTS_TO_STDOUT. Here's the output: https://phabricator.fb.com/P19811976 Reviewers: sdong, yhchiang, rven, MarkCallaghan, kradhakrishnan, anthony Reviewed By: anthony Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D37521	2015-04-27 15:20:02 -07:00
sdong	d01bbb53ae	Fix CompactRange for universal compaction with num_levels > 1 Summary: CompactRange for universal compaction with num_levels > 1 seems to have a bug. The unit test also has a bug so it doesn't capture the problem. Fix it. Revert the compact range to the logic equivalent to num_levels=1. Always compact all files together. It should also fix DBTest.IncreaseUniversalCompactionNumLevels. The issue was that options.write_buffer_size = 100 << 10 and options.write_buffer_size = 100 << 10 are not used in later test scenarios. So write_buffer_size of 4MB was used. The compaction trigger condition is not anymore obvious as expected. Test Plan: Run the new test and all test suites Reviewers: yhchiang, rven, kradhakrishnan, anthony, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D37551	2015-04-23 19:12:31 -07:00
Igor Canadi	e003d3864c	Abstract out SetMaxPossibleForUserKey() and SetMinPossibleForUserKey Summary: Based on feedback from D37083. Are all of these correct? In some spaces it seems like we're doing SetMaxPossibleForUserKey() although we want the smallest possible internal key for user key. Test Plan: make check Reviewers: sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D37341	2015-04-23 18:08:37 -07:00
sdong	debaf85ef5	Bug of trivial move of dynamic level Summary: D36669 introduces a bug that trivial moved data is not going to specific level but the next level, which will incorrectly be level 1 for level 0 compaciton if base level is not level 1. Fixing it by appreciating the output level Test Plan: Run all tests Reviewers: MarkCallaghan, rven, yhchiang, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D37119	2015-04-14 21:42:08 -07:00
Igor Canadi	47b8743984	Make Compaction class easier to use Summary: The goal of this diff is to make Compaction class easier to use. This should also make new compaction algorithms easier to write (like CompactFiles from @yhchiang and dynamic leveled and multi-leveled universal from @sdong). Here are couple of things demonstrating that Compaction class is hard to use: 1. we have two constructors of Compaction class 2. there's this thing called grandparents_, but it appears to only be setup for leveled compaction and not compactfiles 3. it's easy to introduce a subtle and dangerous bug like this: D36225 4. SetupBottomMostLevel() is hard to understand and it shouldn't be. See this comment: `afbafeaeae/db/compaction.cc (L236-L241)`. It also made it harder for @yhchiang to write CompactFiles, as evidenced by this: `afbafeaeae/db/compaction_picker.cc (L204-L210)` The problem is that we create Compaction object, which holds a lot of state, and then pass it around to some functions. After those functions are done mutating, then we call couple of functions on Compaction object, like SetupBottommostLevel() and MarkFilesBeingCompacted(). It is very hard to see what's happening with all that Compaction's state while it's travelling across different functions. If you're writing a new PickCompaction() function you need to try really hard to understand what are all the functions you need to run on Compaction object and what state you need to setup. My proposed solution is to make important parts of Compaction immutable after construction. PickCompaction() should calculate compaction inputs and then pass them onto Compaction object once they are finalized. That makes it easy to create a new compaction -- just provide all the parameters to the constructor and you're done. No need to call confusing functions after you created your object. This diff doesn't fully achieve that goal, but it comes pretty close. Here are some of the changes: * have one Compaction constructor instead of two. * inputs_ is constant after construction * MarkFilesBeingCompacted() is now private to Compaction class and automatically called on construction/destruction. * SetupBottommostLevel() is gone. Compaction figures it out on its own based on the input. * CompactionPicker's functions are not passing around Compaction object anymore. They are only passing around the state that they need. Test Plan: make check make asan_check make valgrind_check Reviewers: rven, anthony, sdong, yhchiang Reviewed By: yhchiang Subscribers: sdong, yhchiang, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D36687	2015-04-10 15:01:54 -07:00
Yueh-Hsuan Chiang	9741dec0e5	Fix a compile error in ROCKSDB_LITE in db/db_impl.cc Summary: Fix a compile error in ROCKSDB_LITE in db/db_impl.cc related to internal_stats. Test Plan: make OPT=-DROCKSDB_LITE shared_lib Reviewers: sdong, igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D36819	2015-04-09 17:07:29 -07:00
sdong	b1bbdd7919	Create EnvOptions using sanitized DB Options Summary: Now EnvOptions uses unsanitized DB options. bytes_per_sync is tuned off when rate_limiter is used, but this change doesn't take effort. Test Plan: See different I/O pattern in db_bench running fillseq. Reviewers: yhchiang, kradhakrishnan, rven, anthony, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D36723	2015-04-08 14:40:42 -07:00
Igor Canadi	5e067a7b19	Clean up compression logging Summary: Now we add warnings when user configures compression and the compression is not supported. Test Plan: Configured compression to non-supported values. Observed messages in my log: 2015/03/26-12:17:57.586341 7ffb8a496840 [WARN] Compression type chosen for level 2 is not supported: LZ4. RocksDB will not compress data on level 2. 2015/03/26-12:19:10.768045 7f36f15c5840 [WARN] Compression type chosen is not supported: LZ4. RocksDB will not compress data. Reviewers: rven, sdong, yhchiang Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D35979	2015-04-06 12:50:44 -07:00
sdong	953a885ebf	A new call back to TablePropertiesCollector to allow users know the entry is add, delete or merge Summary: Currently users have no idea a key is add, delete or merge from TablePropertiesCollector call back. Add a new function to add it. Also refactor the codes so that (1) make table property collector and internal table property collector two separate data structures with the later one now exposed (2) table builders only receive internal table properties Test Plan: Add cases in table_properties_collector_test to cover both of old and new ways of using TablePropertiesCollector. Reviewers: yhchiang, igor.sugak, rven, igor Reviewed By: rven, igor Subscribers: meyering, yoshinorim, maykov, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D35373	2015-04-06 10:27:21 -07:00
Venkatesh Radhakrishnan	afbafeaeae	Disallow trivial move if compression level is different Summary: Check compression level of start_level with output_compression before allowing trivial move Test Plan: New DBTest CompressLevelCompactionThirdPath added Reviewers: igor, yhchiang, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D36213	2015-04-02 11:06:30 -07:00
sdong	b23bbaa82a	Universal Compactions with Small Files Summary: With this change, we use L1 and up to store compaction outputs in universal compaction. The compaction pick logic stays the same. Outputs are stored in the largest "level" as possible. If options.num_levels=1, it behaves all the same as now. Test Plan: 1) convert most of existing unit tests for universal comapaction to include the option of one level and multiple levels. 2) add a unit test to cover parallel compaction in universal compaction and run it in one level and multiple levels 3) add unit test to migrate from multiple level setting back to one level setting 4) add a unit test to insert keys to trigger multiple rounds of compactions and verify results. Reviewers: rven, kradhakrishnan, yhchiang, igor Reviewed By: igor Subscribers: meyering, leveldb, MarkCallaghan, dhruba Differential Revision: https://reviews.facebook.net/D34539	2015-03-30 15:12:02 -07:00
Igor Canadi	fd3dbef22b	Clean up old log files in background threads Summary: Cleaning up log files can do heavy IO, since we call ftruncate() in the destructor. We don't want to call ftruncate() in user threads. This diff moves cleaning to background threads (flush and compaction) Test Plan: make check, will also run valgrind Reviewers: yhchiang, rven, MarkCallaghan, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D36177	2015-03-30 15:04:10 -04:00
Herman Lee	e018892bb6	Formalize the DB properties string definitions. Summary: Assign the string properties to const string variables under the DB::Properties namespace. This helps catch typos during compilation and also consolidates the property definition in one place. Test Plan: Run rocksdb unit tests Reviewers: sdong, yoshinorim, igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D35991	2015-03-27 14:50:20 -07:00
Igor Canadi	030859eb5d	Dump compression info on startup Summary: It's useful to know if we have compression support or no Test Plan: Observed this in my LOG: 2015/03/26-10:34:35.460681 7f5b322b7840 Snappy supported 2015/03/26-10:34:35.460682 7f5b322b7840 Zlib supported 2015/03/26-10:34:35.460686 7f5b322b7840 Bzip supported 2015/03/26-10:34:35.460687 7f5b322b7840 LZ4 NOT supported Reviewers: sdong, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D35955	2015-03-26 11:22:20 -07:00
Yueh-Hsuan Chiang	a057bb2a8e	Improve ThreadStatusSingleCompaction Summary: Improve ThreadStatusSingleCompaction in two ways: 1. Use SYNC_POINT to ensure compaction won't happen before the test finishes its "Put Phase" instead of using sleep. 2. In Put Phase, it continues until we have sufficient number of L0 files. Note that during the put phase, there won't be any compaction that consumes L0 files because of item 1. Test Plan: ./db_test --gtest_filter="ThreadStatusSingleCompaction" Reviewers: sdong, igor, rven Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D35727	2015-03-23 15:30:45 -07:00
Igor Canadi	b088c83e6e	Don't delete files when column family is dropped Summary: To understand the bug read t5943287 and check out the new test in column_family_test (ReadDroppedColumnFamily), iter 0. RocksDB contract allowes you to read a drop column family as long as there is a live reference. However, since our iteration ignores dropped column families, AddLiveFiles() didn't mark files of a dropped column families as live. So we deleted them. In this patch I no longer ignore dropped column families in the iteration. I think this behavior was confusing and it also led to this bug. Now if an iterator client wants to ignore dropped column families, he needs to do it explicitly. Test Plan: Added a new unit test that is failing on master. Unit test succeeds now. Reviewers: sdong, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D32535	2015-03-19 17:04:29 -07:00
Igor Canadi	c88ff4ca76	Deprecate removeScanCountLimit in NewLRUCache Summary: It is no longer used by the implementation, so we should also remove it from the public API. Test Plan: make check Reviewers: sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D34971	2015-03-17 15:04:37 -07:00
Venkatesh Radhakrishnan	98c37fda5d	Remove unused parameter in CancelAllBackgroundWork Summary: Some suggestions for cleanup from Igor. Test Plan: Regression tests. Reviewers: igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D35169	2015-03-16 21:07:54 -07:00
Venkatesh Radhakrishnan	b2b3086524	Speed up rocksDB close call. Summary: On RocksDB, when there are multiple instances doing flushes/compactions in the background, the close call takes a long time because the flushes/compactions need to complete before the database can shut down. If another instance is using the background threads and the compaction for this instance is in the queue since it has been scheduled, we still cannot shutdown. We now remove the scheduled background tasks which have not yet started running, so that shutdown is speeded up. Test Plan: DB Test added. Reviewers: yhchiang, igor, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D33741	2015-03-16 18:49:14 -07:00
Igor Canadi	52d8347a91	EventLogger Summary: Here's my proposal for making our LOGs easier to read by machines. The idea is to dump all events as JSON objects. JSON is easy to read by humans, but more importantly, it's easy to read by machines. That way, we can parse this, load into SQLite/mongo and then query or visualize. I started with table_create and table_delete events, but if everybody agrees, I'll continue by adding more events (flush/compaction/etc etc) Test Plan: Ran db_bench. Observed: 2015/01/15-14:13:25.788019 1105ef000 EVENT_LOG_v1 {"time_micros": 1421360005788015, "event": "table_file_creation", "file_number": 12, "file_size": 1909699} 2015/01/15-14:13:25.956500 110740000 EVENT_LOG_v1 {"time_micros": 1421360005956498, "event": "table_file_deletion", "file_number": 12} Reviewers: yhchiang, rven, dhruba, MarkCallaghan, lgalanis, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D31647	2015-03-13 10:15:54 -07:00
Yueh-Hsuan Chiang	2b785d76b8	Fixed a bug where CompactFiles won't delete obsolete files until flush. Summary: Fixed a bug where CompactFiles won't delete obsolete files until flush. Test Plan: ./compact_files_test export ROCKSDB_TESTS=CompactFiles ./db_test Reviewers: rven, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D34671	2015-03-11 13:06:59 -07:00
Venkatesh Radhakrishnan	284be570c8	Provide a mechanism to inform Rocksdb that it is shutting down Summary: Provide an API which enables users to infor Rocksdb that it is shutting down. Test Plan: db_test Reviewers: sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D34617	2015-03-11 10:31:02 -07:00
Igor Canadi	485ac0dbd0	Add rate_limiter to string options Summary: I want to be able to set this through mongo config. Test Plan: added unit test Reviewers: sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D34599	2015-03-06 14:21:15 -08:00
Yueh-Hsuan Chiang	694988b627	Fix a bug in stall time counter. Improve its output format. Summary: Fix a bug in stall time counter. Improve its output format. Test Plan: export ROCKSDB_TESTS=Timeout ./db_test ./db_bench --benchmarks=fillrandom --stats_interval=10000 --statistics=true --stats_per_interval=1 --num=1000000 --threads=4 --level0_stop_writes_trigger=3 --level0_slowdown_writes_trigger=2 sample output: Uptime(secs): 35.8 total, 0.0 interval Cumulative writes: 359590 writes, 359589 keys, 183047 batches, 2.0 writes per batch, 0.04 GB user ingest, stall seconds: 1786.008 ms Cumulative WAL: 359591 writes, 183046 syncs, 1.96 writes per sync, 0.04 GB written Interval writes: 253 writes, 253 keys, 128 batches, 2.0 writes per batch, 0.0 MB user ingest, stall time: 0 us Interval WAL: 253 writes, 128 syncs, 1.96 writes per sync, 0.00 MB written Reviewers: MarkCallaghan, igor, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D34275	2015-03-03 12:48:12 -08:00
Igor Canadi	db03739340	options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically. Summary: When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases. In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier. Test Plan: New unit tests and pass tests suites including valgrind. Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo Reviewed By: ikabiljo Subscribers: yoshinorim, ikabiljo, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D31437	2015-03-02 22:40:41 -08:00
Mark Callaghan	c4bd03a97e	Fix typo in log message Summary: fix typo Task ID: # Blame Rev: Test Plan: Revert Plan: Database Impact: Memcache Impact: Other Notes: EImportant: - begin PUBLIC platform impact section - Bugzilla: # - end platform impact - Reviewers: igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D34251	2015-03-02 09:35:50 -08:00
Igor Sugak	62247ffa3b	rocksdb: Add missing override Summary: When using latest clang (3.6 or 3.7/trunck) rocksdb is failing with many errors. Almost all of them are missing override errors. This diff adds missing override keyword. No manual changes. Prerequisites: bear and clang 3.5 build with extra tools ```lang=bash % USE_CLANG=1 bear make all # generate a compilation database http://clang.llvm.org/docs/JSONCompilationDatabase.html % clang-modernize -p . -include . -add-override % make format ``` Test Plan: Make sure all tests are passing. ```lang=bash % #Use default fb code clang. % make check ``` Verify less error and no missing override errors. ```lang=bash % # Have trunk clang present in path. % ROCKSDB_NO_FBCODE=1 CC=clang CXX=clang++ make ``` Reviewers: igor, kradhakrishnan, rven, meyering, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D34077	2015-02-26 11:28:41 -08:00
Jinfu Leng	96d989f70d	catch config errors with L0 file count triggers Test Plan: Run "make clean && make all check" Reviewers: rven, igor, yhchiang, kradhakrishnan, MarkCallaghan, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D33627	2015-02-23 16:08:27 -08:00
Jim Meyering	a42324e370	build: do not relink every single binary just for a timestamp Summary: Prior to this change, "make check" would always waste a lot of time relinking 60+ binaries. With this change, it does that only when the generated file, util/build_version.cc, changes, and that happens only when the date changes or when the current git SHA changes. This change makes some other improvements: before, there was no rule to build a deleted util/build_version.cc. If it was somehow removed, any attempt to link a program would fail. There is no longer any need for the separate file, build_tools/build_detect_version. Its functionality is now in the Makefile. * Makefile (DEPFILES): Don't filter-out util/build_version.cc. No need, and besides, removing that dependency was wrong. (date, git_sha, gen_build_version): New helper variables. (util/build_version.cc): New rule, to create this file and update it only if it would contain new information. * build_tools/build_detect_platform: Remove file. * db/db_impl.cc: Now, print only date (not the time). * util/build_version.h (rocksdb_build_compile_time): Remove declaration. No longer used. Test Plan: - Run "make check" twice, and note that the second time no linking is performed. - Remove util/build_version.cc and ensure that any "make" command regenerates it before doing anything else. - Run this: strings librocksdb.a\|grep _build_. That prints output including the following: rocksdb_build_git_date:2015-02-19 rocksdb_build_git_sha:2.8.fb-1792-g3cb6cc0 Reviewers: ljin, sdong, igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D33591	2015-02-19 13:11:10 -08:00
Venkatesh Radhakrishnan	7d817268b9	Managed iterator Summary: This is a diff for managed iterator. A managed iterator is a wrapper around an iterator which saves the options for that iterator as well as the current key/value so that the underlying iterator and its associated memory can be released when it is aged out automatically or on the request of the user. Will provide the automatic release as a follow-up diff. Test Plan: Managed* tests in db_test and XF tests for managed iterator Reviewers: igor, yhchiang, anthony, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D31401	2015-02-18 11:49:31 -08:00
Yueh-Hsuan Chiang	e60bc99fe0	Allow GetThreadList to reflect flush activity. Summary: Allow GetThreadList to reflect flush activity. Test Plan: Developed ThreadStatusFlush test and updated ThreadStatusMultiCompaction test. ./db_test ./thread_list_test Reviewers: sdong, rven, igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D32871	2015-02-17 10:13:52 -08:00
Igor Canadi	e7ea51a8e7	Introduce job_id for flush and compaction Summary: It would be good to assing background job their IDs. Two benefits: 1) makes LOGs more readable 2) I might use it in my EventLogger, which will try to make our LOG easier to read/query/visualize Test Plan: ran rocksdb, read the LOG Reviewers: sdong, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D31617	2015-02-12 09:54:48 -08:00
Igor Canadi	863009b5a5	Fix deleting obsolete files #2 Summary: For description of the bug, see comment in db_test. The fix is pretty straight forward. Test Plan: added unit test. eventually we need better testing of FOF/POF process. Reviewers: yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D33081	2015-02-09 17:38:32 -08:00
sdong	91ac3b2067	Print DB pointer when opening a DB Summary: Having a pointer for DB will be helpful to debug when GDB or working on a dump. If the client process doesn't have any thread actively working on RocksDB, it can be hard to find out. Test Plan: make all check Reviewers: rven, yhchiang, igor Reviewed By: igor Subscribers: yoshinorim, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D33159	2015-02-09 12:52:58 -08:00
Igor Canadi	2a979822b6	Fix deleting obsolete files Summary: This diff basically reverts D30249 and also adds a unit test that was failing before this patch. I have no idea how I didn't catch this terrible bug when writing a diff, sorry about that :( I think we should redesign our system of keeping track of and deleting files. This is already a second bug in this critical piece of code. I'll think of few ideas. BTW this diff is also a regression when running lots of column families. I plan to revisit this separately. Test Plan: added a unit test Reviewers: yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D33045	2015-02-06 08:44:30 -08:00
Igor Canadi	6f10130354	Fix DestroyDB Summary: When DestroyDB() finds a wal file in the DB directory, it assumes it is actually in WAL directory. This can lead to confusion, since it reports IO error when it tries to delete wal file from DB directory. For example: https://ci-builds.fb.com/job/rocksdb_clang_build/296/console This change will fix our unit tests. Test Plan: unit tests work Reviewers: yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D32907	2015-02-05 20:09:42 -08:00
Yueh-Hsuan Chiang	181191a1e4	Add a counter for collecting the wait time on db mutex. Summary: Add a counter for collecting the wait time on db mutex. Also add MutexWrapper and CondVarWrapper for measuring wait time. Test Plan: ./db_test export ROCKSDB_TESTS=MutexWaitStats ./db_test verify stats output using db_bench make clean make release ./db_bench --statistics=1 --benchmarks=fillseq,readwhilewriting --num=10000 --threads=10 Sample output: rocksdb.db.mutex.wait.micros COUNT : 7546866 Reviewers: MarkCallaghan, rven, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D32787	2015-02-04 21:39:45 -08:00
Ori Bernstein	f9758e0129	Add compaction listener. Summary: This adds a listener for compactions, and gives some useful statistics on each compaction pass. Test Plan: Unit tests. Reviewers: sdong, igor, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D31641	2015-01-27 14:44:02 -08:00
sdong	be8f0b12ed	Rename DBImpl::log_dir_unsynced_ to log_dir_synced_ Summary: log_dir_unsynced_ is a confusing name. Rename it to log_dir_synced_ and flip the value. Test Plan: Run ./fault_injection_test Reviewers: rven, yhchiang, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D32235	2015-01-26 16:01:36 -08:00
sdong	d888c95748	Sync WAL Directory and DB Path if different from DB directory Summary: 1. If WAL directory is different from db directory. Sync the directory after creating a log file under it. 2. After creating an SST file, sync its parent directory instead of DB directory. 3. change the check of kResetDeleteUnsyncedFiles in fault_injection_test. Since we changed the behavior to sync log files' parent directory after first WAL sync, instead of creating, kResetDeleteUnsyncedFiles will not guarantee to show post sync updates. Test Plan: make all check Reviewers: yhchiang, rven, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D32067	2015-01-26 14:17:45 -08:00
Igor Canadi	42189612c3	Fix data race #2 Summary: We should not be calling InternalStats methods outside of the mutex. Test Plan: COMPILE_WITH_TSAN=1 m db_test && ROCKSDB_TESTS=CompactionTrigger ./db_test failing before the diff, works now Reviewers: yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D32127	2015-01-23 18:04:39 -08:00
sdong	4e48753b73	Sync manifest file when initializing it Summary: Now we don't sync manifest file when initializing it, so DB cannot be safely reopened before the first mem table flush. Fix it by syncing it. This fixes fault_injection_test. Test Plan: make all check Reviewers: rven, yhchiang, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D32001	2015-01-22 14:32:03 -08:00
sdong	206237d121	DBImpl::CheckConsistency() shouldn't create path name with double "/" Summary: GetLiveFilesMetaData() already adds a leading "/" in file name. No need to add one extra "/" in DBImpl::CheckConsistency() Test Plan: make all check Reviewers: yhchiang, rven, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D31779	2015-01-21 16:36:13 -08:00
Yueh-Hsuan Chiang	b229f970df	Remove Compaction::ReleaseInputs(). Summary: This patch remove the unnecessary Compaction::ReleaseInputs(). Compaction::ReleaseInputs() tries to unref its input_version and column_family. However, such unref is always done in ~Compaction(), and all current ReleaseInputs() calls are right before the destructor. Test Plan: ./db_test Reviewers: igor Reviewed By: igor Subscribers: igor, rven, dhruba, sdong Differential Revision: https://reviews.facebook.net/D31605	2015-01-15 12:44:19 -08:00
Yueh-Hsuan Chiang	c91cdd59c1	Allow GetThreadList() to indicate a thread is doing Compaction. Summary: Allow GetThreadList() to indicate a thread is doing Compaction. Test Plan: export ROCKSDB_TESTS=ThreadStatus ./db_test Reviewers: ljin, igor, sdong Reviewed By: sdong Subscribers: leveldb, dhruba, jonahcohen, rven Differential Revision: https://reviews.facebook.net/D30105	2015-01-13 00:04:08 -08:00
sdong	9132e52ea4	DB Stats Dump to print total stall time Summary: Add printing of stall time in DB Stats: Sample outputs: DB Stats Uptime(secs): 53.2 total, 1.7 interval Cumulative writes: 625940 writes, 625939 keys, 625940 batches, 1.0 writes per batch, 0.49 GB user ingest, stall micros: 50691070 Cumulative WAL: 625940 writes, 625939 syncs, 1.00 writes per sync, 0.49 GB written Interval writes: 10859 writes, 10859 keys, 10859 batches, 1.0 writes per batch, 8.7 MB user ingest, stall micros: 1692319 Interval WAL: 10859 writes, 10859 syncs, 1.00 writes per sync, 0.01 MB written Test Plan: make all check verify printing using db_bench Reviewers: igor, yhchiang, rven, MarkCallaghan Reviewed By: MarkCallaghan Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D31239	2015-01-09 11:44:19 -08:00
Igor Canadi	7731d51c82	Simplify column family concurrency Summary: This patch changes concurrency guarantees around ColumnFamilySet::column_families_ and ColumnFamilySet::column_families_data_. Before: * When mutating: lock DB mutex and spin lock * When reading: lock DB mutex OR spin lock After: * When mutating: lock DB mutex and be in write thread * When reading: lock DB mutex or be in write thread That way, we eliminate the spin lock that protects these hash maps and simplify concurrency. That means we don't need to lock the spin lock during writing, since writing is mutually exclusive with column family create/drop (the only operations that mutate those hash maps). With these new restrictions, I also needed to move column family create to the write thread (column family drop was already in the write thread). Even though we don't need to lock the spin lock during write, impact on performance should be minimal -- the spin lock is almost never busy, so locking it is almost free. This addresses task t5116919. Test Plan: make check Stress test with lots and lots of column family drop and create: time ./db_stress --threads=30 --ops_per_thread=5000000 --max_key=5000 --column_families=200 --clear_column_family_one_in=100000 --verify_before_write=0 --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress/ Reviewers: yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D30651	2015-01-06 12:44:21 -08:00
Igor Canadi	07aa4e0e35	Fix compaction summary log for trivial move Summary: When trivial move commit is done, we log the summary of the input version instead of current. This is inconsistent with other log messages and confusing. Test Plan: compiles Reviewers: sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D30939	2015-01-05 17:32:49 -08:00
Igor Canadi	62ad0a9b19	Deprecating skip_log_error_on_recovery Summary: Since https://reviews.facebook.net/D16119, we ignore partial tailing writes. Because of that, we no longer need skip_log_error_on_recovery. The documentation says "Skip log corruption error on recovery (If client is ok with losing most recent changes)", while the option actually ignores any corruption of the WAL (not only just the most recent changes). This is very dangerous and can lead to DB inconsistencies. This was originally set up to ignore partial tailing writes, which we now do automatically (after D16119). I have digged up old task t2416297 which confirms my findings. Test Plan: There was actually no tests that verified correct behavior of skip_log_error_on_recovery. Reviewers: yhchiang, rven, dhruba, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D30603	2015-01-05 13:35:56 -08:00
Igor Canadi	fa0b126c0c	Fix corruption_test -- if status is not OK, return status -- during recovery	2015-01-05 10:49:41 -08:00
Igor Canadi	d7b4bb62a7	Fail DB::Open() on WAL corruption Summary: This is a serious bug. If paranod_check == true and WAL is corrupted, we don't fail DB::Open(). I tried going into history and it seems we've been doing this for a long long time. I found this when investigating t5852041. Test Plan: Added unit test to verify correct behavior. Reviewers: yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D30597	2015-01-05 10:26:34 -08:00
Yueh-Hsuan Chiang	a944afd356	Fixed a compile error in db/db_impl.cc on ROCKSDB_LITE	2014-12-23 16:19:40 -08:00
Yueh-Hsuan Chiang	45bab305f9	Move GetThreadList() feature under Env. Summary: GetThreadList() feature depends on the thread creation and destruction, which is currently handled under Env. This patch moves GetThreadList() feature under Env to better manage the dependency of GetThreadList() feature on thread creation and destruction. Renamed ThreadStatusImpl to ThreadStatusUpdater. Add ThreadStatusUtil, which is a static class contains utility functions for ThreadStatusUpdater. Test Plan: run db_test, thread_list_test and db_bench and verify the life cycle of Env and ThreadStatusUpdater is properly managed. Reviewers: igor, sdong Reviewed By: sdong Subscribers: ljin, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D30057	2014-12-22 12:20:17 -08:00
Igor Canadi	4fd26f287c	Only execute flush from compaction if max_background_flushes = 0 Summary: As title. We shouldn't need to execute flush from compaction if there are dedicated threads doing flushes. Test Plan: make check Reviewers: rven, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D30579	2014-12-22 12:05:14 +01:00
Igor Canadi	0acc738810	Speed up FindObsoleteFiles() Summary: There are two versions of FindObsoleteFiles(): * full scan, which is executed every 6 hours (and it's terribly slow) * no full scan, which is executed every time a background process finishes and iterator is deleted This diff is optimizing the second case (no full scan). Here's what we do before the diff: * Get the list of obsolete files (files with ref==0). Some files in obsolete_files set might actually be live. * Get the list of live files to avoid deleting files that are live. * Delete files that are in obsolete_files and not in live_files. After this diff: * The only files with ref==0 that are still live are files that have been part of move compaction. Don't include moved files in obsolete_files. * Get the list of obsolete files (which exclude moved files). * No need to get the list of live files, since all files in obsolete_files need to be deleted. I'll post the benchmark results, but you can get the feel of it here: https://reviews.facebook.net/D30123 This depends on D30123. P.S. We should do full scan only in failure scenarios, not every 6 hours. I'll do this in a follow-up diff. Test Plan: One new unit test. Made sure that unit test fails if we don't have a `if (!f->moved)` safeguard in ~Version. make check Big number of compactions and flushes: ./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000 Reviewers: yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D30249	2014-12-22 12:04:45 +01:00

... 2 3 4 5 6 ...

1000 Commits