Commit Graph

436 Commits

Author SHA1 Message Date
Andrew Kryczka
5358a80568 add VerifyChecksum to HISTORY.md
Summary:
it's a new feature that'll be released in 5.8, introduced by PR #2498.
Closes https://github.com/facebook/rocksdb/pull/2759

Differential Revision: D5661923

Pulled By: ajkr

fbshipit-source-id: 9ba9f0d146c453715358ef2dd298aa7765649d7c
2017-08-18 14:29:39 -07:00
Maysam Yabandeh
1efc600ddf Preload l0 index partitions
Summary:
This fixes the existing logic for pinning l0 index partitions. The patch preloads the partitions into block cache and pin them if they belong to level 0 and pin_l0 is set.

The drawback is that it does many small IOs when preloading all the partitions into the cache is direct io is enabled. Working for a solution for that.
Closes https://github.com/facebook/rocksdb/pull/2661

Differential Revision: D5554010

Pulled By: maysamyabandeh

fbshipit-source-id: 1e6f32a3524d71355c77d4138516dcfb601ca7b2
2017-08-18 10:56:20 -07:00
Sagar Vemuri
9a44b4c32c Allow merge operator to be called even with a single operand
Summary:
Added a function `MergeOperator::DoesAllowSingleMergeOperand()` to allow invoking a merge operator even with a single merge operand, if overriden.

This is needed for Cassandra-on-RocksDB work. All Cassandra writes are through merges and this will allow a single merge-value to be updated in the merge-operator invoked via a compaction, if needed, due to an expired TTL.
Closes https://github.com/facebook/rocksdb/pull/2721

Differential Revision: D5608706

Pulled By: sagar0

fbshipit-source-id: f299f9f91c4d1ac26e48bd5906e122c1c5e5f3fc
2017-08-16 23:42:00 -07:00
Andrew Kryczka
af012c0f83 fix deleterange with memtable prefix bloom
Summary:
the range delete tombstones in memtable should be added to the aggregator even when the memtable's prefix bloom filter tells us the lookup key's not there. This bug could cause data to temporarily reappear until the memtable containing range deletions is flushed.

Reported in #2743.
Closes https://github.com/facebook/rocksdb/pull/2745

Differential Revision: D5639007

Pulled By: ajkr

fbshipit-source-id: 04fc6facb6f978340a3f639536f4ca7c0d73dfc9
2017-08-16 19:13:01 -07:00
Andrew Kryczka
acf935e40f fix deletion dropping in intra-L0
Summary:
`KeyNotExistsBeyondOutputLevel` didn't consider L0 files' key-ranges. So if a key only was covered by older L0 files' key-ranges, we would incorrectly drop deletions of that key. This PR just skips the deletion-dropping optimization when output level is L0.
Closes https://github.com/facebook/rocksdb/pull/2726

Differential Revision: D5617286

Pulled By: ajkr

fbshipit-source-id: 4bff1396b06d49a828ba4542f249191052915bce
2017-08-11 18:12:38 -07:00
Andrew Kryczka
cc01985db0 Introduce bottom-pri thread pool for large universal compactions
Summary:
When we had a single thread pool for compactions, a thread could be busy for a long time (minutes) executing a compaction involving the bottom level. In multi-instance setups, the entire thread pool could be consumed by such bottom-level compactions. Then, top-level compactions (e.g., a few L0 files) would be blocked for a long time ("head-of-line blocking"). Such top-level compactions are critical to prevent compaction stalls as they can quickly reduce number of L0 files / sorted runs.

This diff introduces a bottom-priority queue for universal compactions including the bottom level. This alleviates the head-of-line blocking situation for fast, top-level compactions.

- Added `Env::Priority::BOTTOM` thread pool. This feature is only enabled if user explicitly configures it to have a positive number of threads.
- Changed `ThreadPoolImpl`'s default thread limit from one to zero. This change is invisible to users as we call `IncBackgroundThreadsIfNeeded` on the low-pri/high-pri pools during `DB::Open` with values of at least one. It is necessary, though, for bottom-pri to start with zero threads so the feature is disabled by default.
- Separated `ManualCompaction` into two parts in `PrepickedCompaction`. `PrepickedCompaction` is used for any compaction that's picked outside of its execution thread, either manual or automatic.
- Forward universal compactions involving last level to the bottom pool (worker thread's entry point is `BGWorkBottomCompaction`).
- Track `bg_bottom_compaction_scheduled_` so we can wait for bottom-level compactions to finish. We don't count them against the background jobs limits. So users of this feature will get an extra compaction for free.
Closes https://github.com/facebook/rocksdb/pull/2580

Differential Revision: D5422916

Pulled By: ajkr

fbshipit-source-id: a74bd11f1ea4933df3739b16808bb21fcd512333
2017-08-03 15:43:29 -07:00
Andrew Kryczka
6a36b3a7b9 fix db get/write stats
Summary:
we were passing `record_read_stats` (a bool) as the `hist_type` argument, which meant we were updating either `rocksdb.db.get.micros` (`hist_type == 0`) or `rocksdb.db.write.micros` (`hist_type == 1`) with wrong data.
Closes https://github.com/facebook/rocksdb/pull/2666

Differential Revision: D5520384

Pulled By: ajkr

fbshipit-source-id: 2f7c956aec32f8b58c5c18845ac478e0230c9516
2017-07-31 12:12:03 -07:00
Siying Dong
21696ba502 Replace dynamic_cast<>
Summary:
Replace dynamic_cast<> so that users can choose to build with RTTI off, so that they can save several bytes per object, and get tiny more memory available.
Some nontrivial changes:
1. Add Comparator::GetRootComparator() to get around the internal comparator hack
2. Add the two experiemental functions to DB
3. Add TableFactory::GetOptionString() to avoid unnecessary casting to get the option string
4. Since 3 is done, move the parsing option functions for table factory to table factory files too, to be symmetric.
Closes https://github.com/facebook/rocksdb/pull/2645

Differential Revision: D5502723

Pulled By: siying

fbshipit-source-id: fd13cec5601cf68a554d87bfcf056f2ffa5fbf7c
2017-07-28 16:27:16 -07:00
Maysam Yabandeh
277f6f23d4 Release note for partitioned index/filters
Summary: Closes https://github.com/facebook/rocksdb/pull/2637

Differential Revision: D5489751

Pulled By: maysamyabandeh

fbshipit-source-id: 0298f8960d4f86ce67959616615beee4d802c2e4
2017-07-25 10:24:12 -07:00
Siying Dong
e67b35c076 Add Iterator::Refresh()
Summary:
Add and implement Iterator::Refresh(). When this function is called, if the super version doesn't change, update the sequence number of the iterator to the latest one and invalidate the iterator. If the super version changed, recreated the whole iterator. This can help users reuse the iterator more easily.
Closes https://github.com/facebook/rocksdb/pull/2621

Differential Revision: D5464500

Pulled By: siying

fbshipit-source-id: f548bd35e85c1efca2ea69273802f6704eba6ba9
2017-07-24 10:54:37 -07:00
Islam AbdelRahman
20a691d98f Update HISTORY to release 5.7
Summary:
Update HISTORY file to release 5.7
Closes https://github.com/facebook/rocksdb/pull/2576

Differential Revision: D5417716

Pulled By: IslamAbdelRahman

fbshipit-source-id: 3af5e7dd533607162212cf5d63a0121a07d637cd
2017-07-13 13:36:48 -07:00
Andrew Kryczka
f4ae1bab02 update history for OnBackgroundError and DeleteRange fix
Summary:
Mentioned changes:
- #2477
- #2503
Closes https://github.com/facebook/rocksdb/pull/2528

Differential Revision: D5360185

Pulled By: ajkr

fbshipit-source-id: 59d6ae465bcb0aa0a739317581fa3fc7871c6de6
2017-06-30 16:57:57 -07:00
Sagar Vemuri
1cd45cd1b3 FIFO Compaction with TTL
Summary:
Introducing FIFO compactions with TTL.

FIFO compaction is based on size only which makes it tricky to enable in production as use cases can have organic growth. A user requested an option to drop files based on the time of their creation instead of the total size.

To address that request:
- Added a new TTL option to FIFO compaction options.
- Updated FIFO compaction score to take TTL into consideration.
- Added a new table property, creation_time, to keep track of when the SST file is created.
- Creation_time is set as below:
  - On Flush: Set to the time of flush.
  - On Compaction: Set to the max creation_time of all the files involved in the compaction.
  - On Repair and Recovery: Set to the time of repair/recovery.
  - Old files created prior to this code change will have a creation_time of 0.
- FIFO compaction with TTL is enabled when ttl > 0. All files older than ttl will be deleted during compaction. i.e. `if (file.creation_time < (current_time - ttl)) then delete(file)`. This will enable cases where you might want to delete all files older than, say, 1 day.
- FIFO compaction will fall back to the prior way of deleting files based on size if:
  - the creation_time of all files involved in compaction is 0.
  - the total size (of all SST files combined) does not drop below `compaction_options_fifo.max_table_files_size` even if the files older than ttl are deleted.

This feature is not supported if max_open_files != -1 or with table formats other than Block-based.

**Test Plan:**
Added tests.

**Benchmark results:**
Base: FIFO with max size: 100MB ::
```
svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100

readwhilewriting :       1.924 micros/op 519858 ops/sec;   13.6 MB/s (1176277 of 5000000 found)
```

With TTL (a low one for testing) ::
```
svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100 --fifo_compaction_ttl=20

readwhilewriting :       1.902 micros/op 525817 ops/sec;   13.7 MB/s (1185057 of 5000000 found)
```
Example Log lines:
```
2017/06/26-15:17:24.609249 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609177) [db/compaction_picker.cc:1471] [default] FIFO compaction: picking file 40 with creation time 1498515423 for deletion
2017/06/26-15:17:24.609255 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609234) [db/db_impl_compaction_flush.cc:1541] [default] Deleted 1 files
...
2017/06/26-15:17:25.553185 7fd5a61a5800 [DEBUG] [db/db_impl_files.cc:309] [JOB 0] Delete /dev/shm/dbbench/000040.sst type=2 #40 -- OK
2017/06/26-15:17:25.553205 7fd5a61a5800 EVENT_LOG_v1 {"time_micros": 1498515445553199, "job": 0, "event": "table_file_deletion", "file_number": 40}
```

SST Files remaining in the dbbench dir, after db_bench execution completed:
```
svemuri@dev15905 ~/rocksdb (fifo-compaction)  $ ls -l /dev/shm//dbbench/*.sst
-rw-r--r--. 1 svemuri users 30749887 Jun 26 15:17 /dev/shm//dbbench/000042.sst
-rw-r--r--. 1 svemuri users 30768779 Jun 26 15:17 /dev/shm//dbbench/000044.sst
-rw-r--r--. 1 svemuri users 30757481 Jun 26 15:17 /dev/shm//dbbench/000046.sst
```
Closes https://github.com/facebook/rocksdb/pull/2480

Differential Revision: D5305116

Pulled By: sagar0

fbshipit-source-id: 3e5cfcf5dd07ed2211b5b37492eb235b45139174
2017-06-27 17:11:48 -07:00
Siying Dong
7061912c23 Trivial typo in HISTORY.md
Summary: Closes https://github.com/facebook/rocksdb/pull/2499

Differential Revision: D5324914

Pulled By: siying

fbshipit-source-id: c69827d4dddafa81e651180633943a3380cdd5bb
2017-06-26 15:57:08 -07:00
Siying Dong
d757355cbf Fix bug that flush doesn't respond to fsync result
Summary:
With a regression bug was introduced two years ago, by 6e9fbeb27c , we fail to check return status of fsync call. This can cause we miss the information from the file system and can potentially cause corrupted data which we could have been detected.
Closes https://github.com/facebook/rocksdb/pull/2495

Reviewed By: ajkr

Differential Revision: D5321949

Pulled By: siying

fbshipit-source-id: c68117914bb40700198fc37d0e4c63163a8a1031
2017-06-26 12:41:48 -07:00
Andrew Kryczka
c217e0b9c7 Call RateLimiter for compaction reads
Summary:
Allow users to rate limit background work based on read bytes, written bytes, or sum of read and written bytes. Support these by changing the RateLimiter API, so no additional options were needed.
Closes https://github.com/facebook/rocksdb/pull/2433

Differential Revision: D5216946

Pulled By: ajkr

fbshipit-source-id: aec57a8357dbb4bfde2003261094d786d94f724e
2017-06-13 14:56:46 -07:00
Andrew Kryczka
e97304c681 update history for 5.6
Summary:
- mention range deletion + file ingestion
- move post-5.6 stuff into new section
Closes https://github.com/facebook/rocksdb/pull/2440

Differential Revision: D5229910

Pulled By: ajkr

fbshipit-source-id: 1facfe41993fa1f3b1f6fa7dc77d2b11aa2b317a
2017-06-12 12:30:09 -07:00
Siying Dong
5582123dee Sample number of reads per SST file
Summary:
We estimate number of reads per SST files, by updating the counter per file in sampled read requests. This information can later be used to trigger compactions to improve read performacne.
Closes https://github.com/facebook/rocksdb/pull/2417

Differential Revision: D5193528

Pulled By: siying

fbshipit-source-id: b4241c5ad0eaf444b61afb53f8e6290d9f5da2df
2017-06-12 07:12:08 -07:00
Aaron Gao
a472c4ae4c update 5.5 change log
Summary:
update bug fixed.
Closes https://github.com/facebook/rocksdb/pull/2434

Differential Revision: D5218601

Pulled By: lightmark

fbshipit-source-id: 1f86b2c93345673612381081537d464e7d12e434
2017-06-09 11:12:10 -07:00
Siying Dong
52a7f38b19 WriteOptions.low_pri which can throttle low pri writes if needed
Summary:
If ReadOptions.low_pri=true and compaction is behind, the write will either return immediate or be slowed down based on ReadOptions.no_slowdown.
Closes https://github.com/facebook/rocksdb/pull/2369

Differential Revision: D5127619

Pulled By: siying

fbshipit-source-id: d30e1cff515890af0eff32dfb869d2e4c9545eb0
2017-06-05 15:02:35 -07:00
Aaron Gao
7f6c02dda1 using ThreadLocalPtr to hide ROCKSDB_SUPPORT_THREAD_LOCAL from public…
Summary:
… headers

https://github.com/facebook/rocksdb/pull/2199 should not reference RocksDB-specific macros (like ROCKSDB_SUPPORT_THREAD_LOCAL in this case) to public headers, `iostats_context.h` and `perf_context.h`. We shouldn't do that because users have to provide these compiler flags when building their binary with RocksDB.

We should hide the thread local global variable inside our implementation and just expose a function api to retrieve these variables. It may break some users for now but good for long term.

make check -j64
Closes https://github.com/facebook/rocksdb/pull/2380

Differential Revision: D5177896

Pulled By: lightmark

fbshipit-source-id: 6fcdfac57f2e2dcfe60992b7385c5403f6dcb390
2017-06-02 17:26:19 -07:00
Siying Dong
95b0e89b5d Improve write buffer manager (and allow the size to be tracked in block cache)
Summary:
Improve write buffer manager in several ways:
1. Size is tracked when arena block is allocated, rather than every allocation, so that it can better track actual memory usage and the tracking overhead is slightly lower.
2. We start to trigger memtable flush when 7/8 of the memory cap hits, instead of 100%, and make 100% much harder to hit.
3. Allow a cache object to be passed into buffer manager and the size allocated by memtable can be costed there. This can help users have one single memory cap across block cache and memtable.
Closes https://github.com/facebook/rocksdb/pull/2350

Differential Revision: D5110648

Pulled By: siying

fbshipit-source-id: b4238113094bf22574001e446b5d88523ba00017
2017-06-02 14:26:56 -07:00
Andrew Kryczka
bb01c1880c Introduce max_background_jobs mutable option
Summary:
- `max_background_flushes` and `max_background_compactions` are still supported for backwards compatibility
- `base_background_compactions` is completely deprecated. Now we just throttle to one background compaction when there's no pressure.
- `max_background_jobs` is added to automatically partition the concurrent background jobs into flushes vs compactions. Currently it's very simple as we just allocate one-fourth of the jobs to flushes, and the remaining can be used for compactions.
- The test cases that set `base_background_compactions > 1` needed to be updated. I just grab the pressure token such that the desired number of compactions can be scheduled.
Closes https://github.com/facebook/rocksdb/pull/2205

Differential Revision: D4937461

Pulled By: ajkr

fbshipit-source-id: df52cbbd497e13bbc9a60560a5ac2a2526b3f1f9
2017-05-24 11:29:08 -07:00
Siying Dong
41cbb72749 options.delayed_write_rate use the rate of rate_limiter by default.
Summary:
It's hard for RocksDB to come up with a good default of delayed write rate. Use rate given by rate limiter if it is availalbe. This provides the I/O order of magnitude.
Closes https://github.com/facebook/rocksdb/pull/2357

Differential Revision: D5115324

Pulled By: siying

fbshipit-source-id: 341065ad2211c981fc804011c0f0e59a50c7e754
2017-05-24 09:58:24 -07:00
Andrew Kryczka
6cc9aef162 New API for background work in single thread pool
Summary:
Previously users could set `max_background_flushes=0` to force rocksdb to use a single thread pool for both background flushes and compactions. That'll no longer be possible since I'm going to deprecate `max_background_flushes` and `max_background_compactions` in favor of a single option. This diff introduces a new way to force a single thread pool: when high-pri pool has zero threads, all background jobs will be submitted to low-pri pool.

Note the majority of the code change is adding `Env::GetBackgroundThreads()`, which is necessary to check whether the user has provided a zero-sized thread pool.
Closes https://github.com/facebook/rocksdb/pull/2204

Differential Revision: D4936256

Pulled By: ajkr

fbshipit-source-id: 929a07a0c0705f7766f5339cd013ff74e90d6e01
2017-05-23 11:12:27 -07:00
Andrew Kryczka
ac39d6bec5 Core-local statistics
Summary:
This diff changes `StatisticsImpl` from a thread-local approach to a core-local one. The goal is to perform faster aggregations, particularly for applications that have many threads. There should be no behavior change.
Closes https://github.com/facebook/rocksdb/pull/2258

Differential Revision: D5016258

Pulled By: ajkr

fbshipit-source-id: 7d4d165b4a91d8110f0409d113d1be91f22d31a9
2017-05-23 10:42:59 -07:00
Yi Wu
07bdcb91fe New WriteImpl to pipeline WAL/memtable write
Summary:
PipelineWriteImpl is an alternative approach to WriteImpl. In WriteImpl, only one thread is allow to write at the same time. This thread will do both WAL and memtable writes for all write threads in the write group. Pending writers wait in queue until the current writer finishes. In the pipeline write approach, two queue is maintained: one WAL writer queue and one memtable writer queue. All writers (regardless of whether they need to write WAL) will still need to first join the WAL writer queue, and after the house keeping work and WAL writing, they will need to join memtable writer queue if needed. The benefit of this approach is that
1. Writers without memtable writes (e.g. the prepare phase of two phase commit) can exit write thread once WAL write is finish. They don't need to wait for memtable writes in case of group commit.
2. Pending writers only need to wait for previous WAL writer finish to be able to join the write thread, instead of wait also for previous memtable writes.

Merging #2056 and #2058 into this PR.
Closes https://github.com/facebook/rocksdb/pull/2286

Differential Revision: D5054606

Pulled By: yiwu-arbug

fbshipit-source-id: ee5b11efd19d3e39d6b7210937b11cefdd4d1c8d
2017-05-19 14:26:42 -07:00
yizhu.sun
f5ba131bf8 Fixed some spelling mistakes
Summary: Closes https://github.com/facebook/rocksdb/pull/2314

Differential Revision: D5079601

Pulled By: sagar0

fbshipit-source-id: ae5696fd735718f544435c64c3179c49b8c04349
2017-05-17 23:12:36 -07:00
Aaron Gao
362ba9b02e Release RocksDB 5.5.0
Summary:
change history.md and version
Closes https://github.com/facebook/rocksdb/pull/2317

Differential Revision: D5080484

Pulled By: lightmark

fbshipit-source-id: 8d70b3b52dc0d34fefc0d34f91d379c27ac13ed3
2017-05-17 12:42:20 -07:00
Andrew Kryczka
3fa9a39c68 Add GetAllKeyVersions API
Summary:
- Introduced an include/ file dedicated to db-related debug functions to avoid making db.h more complex
- Added debugging function, `GetAllKeyVersions()`, to return a listing of internal data for a range of user keys. The new `struct KeyVersion` exposes data similar to internal key without exposing any internal type.
- Migrated the "ldb idump" subcommand to use this function
- The API takes an inclusive-exclusive range to match behavior of "ldb idump". This will be quite annoying for users who want to query a single user key's versions :(.
Closes https://github.com/facebook/rocksdb/pull/2232

Differential Revision: D4976007

Pulled By: ajkr

fbshipit-source-id: cab375da53a7595d6575af2b7e3b776aa3ad793e
2017-05-12 15:54:06 -07:00
Yi Wu
2cd00773c7 Add bulk create/drop column family API
Summary:
Adding DB::CreateColumnFamilie() and DB::DropColumnFamilies() to bulk create/drop column families. This is to address the problem creating/dropping 1k column families takes minutes. The bottleneck is we persist options files for every single column family create/drop, and it parses the persisted options file for verification, which take a lot CPU time.

The new APIs simply create/drop column families individually, and persist options file once at the end. This improves create 1k column families to within ~0.1s. Further improvement can be merge manifest write to one IO.
Closes https://github.com/facebook/rocksdb/pull/2248

Differential Revision: D5001578

Pulled By: yiwu-arbug

fbshipit-source-id: d4e00bda671451e0b314c13e12ad194b1704aa03
2017-05-07 23:20:46 -07:00
Siying Dong
264d3f540c Allow IntraL0 compaction in FIFO Compaction
Summary:
Allow an option for users to do some compaction in FIFO compaction, to pay some write amplification for fewer number of files.
Closes https://github.com/facebook/rocksdb/pull/2163

Differential Revision: D4895953

Pulled By: siying

fbshipit-source-id: a1ab608dd0627211f3e1f588a2e97159646e1231
2017-05-04 18:16:13 -07:00
Leonidas Galanis
a45e98a5b5 max_open_files dynamic set, follow up
Summary:
Followup to make 0x40000 a TableCache constant that indicates infinite capacity
Closes https://github.com/facebook/rocksdb/pull/2247

Differential Revision: D5001349

Pulled By: lgalanis

fbshipit-source-id: ce7bd2e54b0975bb9f8680fdaa0f8bb0e7ae81a2
2017-05-04 10:42:45 -07:00
Leonidas Galanis
e7ae4a3a02 Max open files mutable
Summary:
Makes max_open_files db option dynamically set-able by SetDBOptions. During the call of SetDBOptions we call SetCapacity on the table cache, which is a LRUCache.
Closes https://github.com/facebook/rocksdb/pull/2185

Differential Revision: D4979189

Pulled By: yiwu-arbug

fbshipit-source-id: ca7e8dc5e3619c79434f579be4847c0f7e56afda
2017-05-03 21:13:14 -07:00
siddontang
b551104e04 support PopSavePoint for WriteBatch
Summary:
Try to fix https://github.com/facebook/rocksdb/issues/1969
Closes https://github.com/facebook/rocksdb/pull/2170

Differential Revision: D4907333

Pulled By: yiwu-arbug

fbshipit-source-id: 417b420ff668e6c2fd0dad42a94c57385012edc5
2017-05-03 10:57:45 -07:00
Dmitri Smirnov
cdad04b051 Remove double buffering on RandomRead on Windows.
Summary:
Remove double buffering on RandomRead on Windows.
  With more logic appear in file reader/write Read no longer
  obeys forwarding calls to Windows implementation.
  Previously direct_io (unbuffered) was only available on Windows
  but now is supported as generic.
  We remove intermediate buffering on Windows.
  Remove random_access_max_buffer_size option which was windows specific.
  Non-zero values for that opton introduced unnecessary lock contention.
  Remove Env::EnableReadAhead(), Env::ShouldForwardRawRequest() that are
  no longer necessary.
  Add aligned buffer reads for cases when requested reads exceed read ahead size.
Closes https://github.com/facebook/rocksdb/pull/2105

Differential Revision: D4847770

Pulled By: siying

fbshipit-source-id: 8ab48f8e854ab498a4fd398a6934859792a2788f
2017-04-27 12:30:05 -07:00
Andrew Kryczka
efc361ef7d Add user stats Reset API
Summary:
It resets all the ticker and histogram stats to zero. Needed to change the locking a bit since Reset() is the only operation that manipulates multiple tickers/histograms together, and that operation should be seen as atomic by other operations that access tickers/histograms.
Closes https://github.com/facebook/rocksdb/pull/2213

Differential Revision: D4952232

Pulled By: ajkr

fbshipit-source-id: c0475c3e4c7b940120d53891b69c3091149a0679
2017-04-26 15:57:01 -07:00
Siying Dong
97005dbd5d tools/check_format_compatible.sh to cover option file loading too
Summary:
tools/check_format_compatible.sh will check a newer version of RocksDB can open option files generated by older version releases. In order to achieve that, a new parameter "--try_load_options" is added to ldb. With this parameter set, if option file exists, we load the option file and use it to open the DB. With this opiton set, we can validate option loading logic.
Closes https://github.com/facebook/rocksdb/pull/2178

Differential Revision: D4914989

Pulled By: siying

fbshipit-source-id: db114f7724fcb41e5e9483116d84d7c4b8389ca4
2017-04-20 10:26:37 -07:00
Siying Dong
c49d704656 Add DB:ResetStats()
Summary:
Add a function to allow users to reset internal stats without restarting the DB.
Closes https://github.com/facebook/rocksdb/pull/2167

Differential Revision: D4907939

Pulled By: siying

fbshipit-source-id: ab2dd85b88aabe9380da7485320a1d460d3e1f68
2017-04-18 16:56:48 -07:00
Yi Wu
0fcdccc33e Blob storage helper methods
Summary:
Split out interfaces needed for blob storage from #1560, including
* CompactionEventListener and OnFlushBegin listener interfaces.
* Blob filename support.
Closes https://github.com/facebook/rocksdb/pull/2169

Differential Revision: D4905463

Pulled By: yiwu-arbug

fbshipit-source-id: 564e73448f1b7a367e5e46216a521e57ea9011b5
2017-04-18 12:42:38 -07:00
Aaron Gao
44fa8ece9b change use_direct_writes to use_direct_io_for_flush_and_compaction
Summary:
Replace Options::use_direct_writes with Options::use_direct_io_for_flush_and_compaction
Now if Options::use_direct_io_for_flush_and_compaction = true, we will enable direct io for both reads and writes for flush and compaction job. Whereas Options::use_direct_reads controls user reads like iterator and Get().
Closes https://github.com/facebook/rocksdb/pull/2117

Differential Revision: D4860912

Pulled By: lightmark

fbshipit-source-id: d93575a8a5e780cf7e40797287edc425ee648c19
2017-04-13 16:12:04 -07:00
Sagar Vemuri
415be221cb RocksDB Release 5.4 : Update HISTORY.md and build version.
Summary: Closes https://github.com/facebook/rocksdb/pull/2142

Reviewed By: siying

Differential Revision: D4874696

Pulled By: sagar0

fbshipit-source-id: 03e6e21735bb74e5a37cc913aabb2c250af558cc
2017-04-12 17:36:27 -07:00
Aaron Gao
02799ad77a Revert "delete fallocate with punch_hole"
Summary:
This reverts commit 0fd574926c.
It breaks tmpfs on kernel 4.0 or earlier. We will wait for the fix before remove this part
Closes https://github.com/facebook/rocksdb/pull/2096

Differential Revision: D4839661

Pulled By: lightmark

fbshipit-source-id: 574a51f
2017-04-05 16:10:09 -07:00
Maysam Yabandeh
e2a7b202c3 Release note for partition filters
Summary:
I tagged it experimental since the configuring the filter partitions by size will not be ready for this release.
Closes https://github.com/facebook/rocksdb/pull/2049

Differential Revision: D4831575

Pulled By: maysamyabandeh

fbshipit-source-id: fcc2d25
2017-04-05 12:24:22 -07:00
Andrew Kryczka
d659faad54 Level-based L0->L0 compaction
Summary:
Level-based L0->L0 compaction operates on spans of files that aren't currently being compacted. It reduces the number of L0 files, thus making write stall conditions harder to reach.

- L0->L0 is triggered when base level is unavailable due to pending compactions
- L0->L0 always outputs one file of at most `max_level0_burst_file_size` bytes.
- Subcompactions are disabled for L0->L0 since we want to output one file.
- Input files are chosen as the longest span of available files that will fit within the size limit. This minimizes number of files in L0.
Closes https://github.com/facebook/rocksdb/pull/2027

Differential Revision: D4760318

Pulled By: ajkr

fbshipit-source-id: 9d07183
2017-04-04 18:09:11 -07:00
Maysam Yabandeh
a1c469d719 Add release notes for PinnableSlice
Summary: Closes https://github.com/facebook/rocksdb/pull/2037

Differential Revision: D4822085

Pulled By: maysamyabandeh

fbshipit-source-id: 9d5a986
2017-04-03 15:09:19 -07:00
Sagar Vemuri
c6d04f2ecf Option to fail a request as incomplete when skipping too many internal keys
Summary:
Operations like Seek/Next/Prev sometimes take too long to complete when there are many internal keys to be skipped. Adding an option, max_skippable_internal_keys -- which could be used to set a threshold for the maximum number of keys that can be skipped, will help to address these cases where it is much better to fail a request (as incomplete) than to wait for a considerable time for the request to complete.

This feature -- to fail an iterator seek request as incomplete, is disabled by default when max_skippable_internal_keys = 0. It is enabled only when max_skippable_internal_keys > 0.

This feature is based on the discussion mentioned in the PR https://github.com/facebook/rocksdb/pull/1084.
Closes https://github.com/facebook/rocksdb/pull/2000

Differential Revision: D4753223

Pulled By: sagar0

fbshipit-source-id: 1c973f7
2017-03-30 12:09:21 -07:00
Aaron Gao
0fd574926c delete fallocate with punch_hole
Summary:
As discuss in this thread:
https://www.facebook.com/groups/rocksdb.dev/permalink/1218043868294125/

We remove fallocate with FALLOC_FL_PUNCH_HOLE because the recent bug on xfs in kernel 4.x+ that align file size to page size even with FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE.
Closes https://github.com/facebook/rocksdb/pull/2038

Differential Revision: D4779974

Pulled By: siying

fbshipit-source-id: 5f54625
2017-03-28 15:54:12 -07:00
Siying Dong
909028e214 HISTORY.md for log_size_for_flush in CreateCheckpoint()
Summary: Closes https://github.com/facebook/rocksdb/pull/2021

Differential Revision: D4755324

Pulled By: siying

fbshipit-source-id: c8d7955
2017-03-22 11:24:12 -07:00
Raza Hussain
6908e24b56 dynamic setting of stats_dump_period_sec through SetDBOption()
Summary:
Resolved the following issue: https://github.com/facebook/rocksdb/issues/1930
Closes https://github.com/facebook/rocksdb/pull/2004

Differential Revision: D4736764

Pulled By: yiwu-arbug

fbshipit-source-id: 64fe0b7
2017-03-20 22:54:13 -07:00
Aaron Gao
e5bd8def1e update history.md for fixing the bug that skips keys
Summary: Closes https://github.com/facebook/rocksdb/pull/1986

Differential Revision: D4699152

Pulled By: siying

fbshipit-source-id: b18c32c
2017-03-13 11:39:13 -07:00
Andrew Kryczka
fe1835617a release 5.3
Summary: Closes https://github.com/facebook/rocksdb/pull/1971

Differential Revision: D4683851

Pulled By: ajkr

fbshipit-source-id: 967116e
2017-03-09 12:39:10 -08:00
Aaron Gao
f7997f1341 add direct I/O to version notes 5.2.0
Summary:
let users know this feature is ready
Closes https://github.com/facebook/rocksdb/pull/1915

Differential Revision: D4610842

Pulled By: lightmark

fbshipit-source-id: d102772
2017-02-23 22:09:12 -08:00
Siying Dong
8efb5ffa2a [rocksdb][PR] Remove option min_partial_merge_operands and verify_checksums_in_comp…
Summary:
…action

 The two options, min_partial_merge_operands and verify_checksums_in_compaction, are not seldom used. Remove them to reduce the total number of options. Also remove them from Java and C interface.
Closes https://github.com/facebook/rocksdb/pull/1902

Differential Revision: D4601219

Pulled By: siying

fbshipit-source-id: aad4cb2
2017-02-23 15:09:12 -08:00
Islam AbdelRahman
1560b2f5f0 Temporarly return deprecated functions to fix MongoRocks build
Summary:
MongoRocks is still using some deprecated functions, return them temporarily
Closes https://github.com/facebook/rocksdb/pull/1892

Differential Revision: D4592451

Pulled By: IslamAbdelRahman

fbshipit-source-id: 5e6be3e
2017-02-21 12:54:11 -08:00
Yi Wu
381fd32247 Remove timeout_hint_us from WriteOptions
Summary:
The option has been deprecated for two years and has no effect. Removing.
Closes https://github.com/facebook/rocksdb/pull/1866

Differential Revision: D4555203

Pulled By: yiwu-arbug

fbshipit-source-id: c48f627
2017-02-17 15:24:17 -08:00
Islam AbdelRahman
7ab0051835 Remove deprecated DB::AddFile and DB::CompactRange
Summary:
Remove functions that we deprecated long time ago in db.h
Closes https://github.com/facebook/rocksdb/pull/1878

Differential Revision: D4576521

Pulled By: IslamAbdelRahman

fbshipit-source-id: dfddad1
2017-02-17 10:54:13 -08:00
Sagar Vemuri
eb912a927e Remove disableDataSync option
Summary:
Remove disableDataSync, and another similarly named disable_data_sync options.
This is being done to simplify options, and also because the performance gains of this feature can be achieved by other methods.
Closes https://github.com/facebook/rocksdb/pull/1859

Differential Revision: D4541292

Pulled By: sagar0

fbshipit-source-id: 5b3a6ca
2017-02-13 11:09:13 -08:00
Siying Dong
9afa20cf2f Increase build version and HISTORY.md for releasing 5.2
Summary:
Also clean up HISTORY.md a little bit.
Closes https://github.com/facebook/rocksdb/pull/1854

Differential Revision: D4539556

Pulled By: siying

fbshipit-source-id: 567391e
2017-02-10 12:09:15 -08:00
Maysam Yabandeh
ac2a77a746 Announce the experimetnal two-level index feature in HISTORY.md
Summary:
Announce the experimetnal two-level index feature in HISTORY.md. Also updated the default for index_per_partition to 1024.
Closes https://github.com/facebook/rocksdb/pull/1855

Differential Revision: D4530102

Pulled By: maysamyabandeh

fbshipit-source-id: b0fc6ff
2017-02-08 14:24:10 -08:00
Vitaliy Liptchinsky
1aaa898cf1 Adding GetApproximateMemTableStats method
Summary:
Added method that returns approx num of entries as well as size for memtables.
Closes https://github.com/facebook/rocksdb/pull/1841

Differential Revision: D4511990

Pulled By: VitaliyLi

fbshipit-source-id: 9a4576e
2017-02-06 14:54:16 -08:00
Siying Dong
4a3e7d320c Change the default of delayed slowdown value to 16MB/s
Summary:
Change the default of delayed slowdown value to 16MB/s and further increase the L0 stop condition to 36 files.
Closes https://github.com/facebook/rocksdb/pull/1821

Differential Revision: D4489229

Pulled By: siying

fbshipit-source-id: 1003981
2017-02-01 20:39:17 -08:00
Siying Dong
2d75cd40d3 NewLRUCache() to pick number of shard bits based on capacity if not given
Summary:
If the users use the NewLRUCache() without passing in the number of shard bits, instead of using hard-coded 6, we'll determine it based on capacity.
Closes https://github.com/facebook/rocksdb/pull/1584

Differential Revision: D4242517

Pulled By: siying

fbshipit-source-id: 86b0f18
2017-01-27 06:39:12 -08:00
Vitaliy Liptchinsky
e840213d6e Change DB::GetApproximateSizes for more flexibility needed for MyRocks
Summary:
Added an option to GetApproximateSizes to exclude file stats, as MyRocks has those counted exactly and we need only stats from memtables.
Closes https://github.com/facebook/rocksdb/pull/1787

Differential Revision: D4441111

Pulled By: IslamAbdelRahman

fbshipit-source-id: c11f4c3
2017-01-20 09:39:11 -08:00
Siying Dong
17a4b75cc3 Always fsync the file after file copying
Summary:
File copying happens when creating checkpoints and bulkloading files from different FS partition. We should fsync the files when copying them to guarantee durability. A side effect will be that the dirty pages in file system buffers won't grow too large.
Closes https://github.com/facebook/rocksdb/pull/1728

Differential Revision: D4371083

Pulled By: siying

fbshipit-source-id: 579e14c
2016-12-28 19:09:16 -08:00
Siying Dong
906523d98a Add description to the 2PC checkpooint bug in HISTORY.md
Summary: Closes https://github.com/facebook/rocksdb/pull/1729

Differential Revision: D4371674

Pulled By: siying

fbshipit-source-id: 907e373
2016-12-28 15:54:14 -08:00
Andrew Kryczka
243975d5da More accurate error status for BackupEngine::Open
Summary:
Some users are assuming NotFound means the backup does not
exist at the provided path, which is a reasonable assumption. We need to
stop returning NotFound for system errors.

Depends on #1644
Closes https://github.com/facebook/rocksdb/pull/1645

Differential Revision: D4312233

Pulled By: ajkr

fbshipit-source-id: 5343c10
2016-12-12 13:24:21 -08:00
Islam AbdelRahman
ed8fbdb560 Add EventListener::OnExternalFileIngested() event
Summary:
Add EventListener::OnExternalFileIngested() to allow user to subscribe to external file ingestion events
Closes https://github.com/facebook/rocksdb/pull/1623

Differential Revision: D4285844

Pulled By: IslamAbdelRahman

fbshipit-source-id: 0b95a88
2016-12-06 14:09:17 -08:00
Yi Wu
0b0f235724 Mention IngestExternalFile changes in HISTORY.md
Summary:
I hit the land button too fast and didn't include the line.
Closes https://github.com/facebook/rocksdb/pull/1622

Differential Revision: D4281316

Pulled By: yiwu-arbug

fbshipit-source-id: c7b38e0
2016-12-05 16:09:11 -08:00
Yi Wu
23db48e8d8 Update HISTORY.md for 5.0 branch
Summary:
These changes are included in the new branch-cut.
Closes https://github.com/facebook/rocksdb/pull/1621

Differential Revision: D4281015

Pulled By: yiwu-arbug

fbshipit-source-id: d88858b
2016-12-05 15:39:11 -08:00
Anton Safonov
9053fe2a5c Made delete_obsolete_files_period_micros option dynamic
Summary:
Made delete_obsolete_files_period_micros option dynamic. It can be updating using DB::SetDBOptions().
Closes https://github.com/facebook/rocksdb/pull/1595

Differential Revision: D4246569

Pulled By: tonek

fbshipit-source-id: d23f560
2016-12-05 14:24:16 -08:00
Igor Canadi
3f407b065c Kill flashcache code in RocksDB
Summary:
Now that we have userspace persisted cache, we don't need flashcache anymore.
Closes https://github.com/facebook/rocksdb/pull/1588

Differential Revision: D4245114

Pulled By: igorcanadi

fbshipit-source-id: e2c1c72
2016-12-01 10:09:22 -08:00
Mike Kolupaev
247d0979aa Support for range skips in compaction filter
Summary:
This adds the ability for compaction filter to say "drop this key-value, and also drop everything up to key x". This will cause the compaction to seek input iterator to x, without reading the data. This can make compaction much faster when large consecutive chunks of data are filtered out. See the changes in include/rocksdb/compaction_filter.h for the new API.

Along the way this diff also adds ability for compaction filter changing merge operands, similar to how it can change values; we're not going to use this feature, it just seemed easier and cleaner to implement it than to document that it's not implemented :)

The diff is not as big as it may seem, about half of the lines are a test.
Closes https://github.com/facebook/rocksdb/pull/1599

Differential Revision: D4252092

Pulled By: al13n321

fbshipit-source-id: 41e1e48
2016-12-01 07:09:15 -08:00
Siying Dong
cd7c4143d7 Improve Write Stalling System
Summary:
Current write stalling system has the problem of lacking of positive feedback if the restricted rate is already too low. Users sometimes stack in very low slowdown value. With the diff, we add a positive feedback (increasing the slowdown value) if we recover from slowdown state back to normal. To avoid the positive feedback to keep the slowdown value to be to high, we add issue a negative feedback every time we are close to the stop condition. Experiments show it is easier to reach a relative balance than before.

Also increase level0_stop_writes_trigger default from 24 to 32. Since level0_slowdown_writes_trigger default is 20, stop trigger 24 only gives four files as the buffer time to slowdown writes. In order to avoid stop in four files while 20 files have been accumulated, the slowdown value must be very low, which is amost the same as stop. It also doesn't give enough time for the slowdown value to converge. Increase it to 32 will smooth out the system.
Closes https://github.com/facebook/rocksdb/pull/1562

Differential Revision: D4218519

Pulled By: siying

fbshipit-source-id: 95e4088
2016-11-23 09:24:15 -08:00
Islam AbdelRahman
c1038d2837 Release RocksDB 5.0
Summary:
Update HISTORY.md and version.h
Closes https://github.com/facebook/rocksdb/pull/1536

Differential Revision: D4202987

Pulled By: IslamAbdelRahman

fbshipit-source-id: 94985e3
2016-11-17 18:39:15 -08:00
Yi Wu
36e4762ce0 Remove Ticker::SEQUENCE_NUMBER
Summary:
Remove the ticker count because:
* Having to reset the ticker count in WriteImpl is ineffiecent;
* It doesn't make sense to have it as a ticker count if multiple db
  instance share a statistics object.
Closes https://github.com/facebook/rocksdb/pull/1531

Differential Revision: D4194442

Pulled By: yiwu-arbug

fbshipit-source-id: e2110a9
2016-11-16 22:39:09 -08:00
Andrew Kryczka
0765babe15 Remove LATEST_BACKUP file
Summary:
This has been unused since D42069 but kept around for backward
compatibility. I think it is unlikely anyone will use a much older version of
RocksDB for restore than they use for backup, so I propose removing it. It is
also causing recurring confusion, e.g., https://www.facebook.com/groups/rocksdb.dev/permalink/980454015386446/

Ported from https://reviews.facebook.net/D60735
Closes https://github.com/facebook/rocksdb/pull/1529

Differential Revision: D4194199

Pulled By: ajkr

fbshipit-source-id: 82f9bf4
2016-11-16 17:24:15 -08:00
Yueh-Hsuan Chiang
647eafdc21 Introduce Lua Extension: RocksLuaCompactionFilter
Summary:
This diff includes an implementation of CompactionFilter that allows
users to write CompactionFilter in Lua.  With this ability, users can
dynamically change compaction filter logic without requiring building
the rocksdb binary and restarting the database.

To compile, WITH_LUA_PATH must be specified to the base directory
of lua.
Closes https://github.com/facebook/rocksdb/pull/1478

Differential Revision: D4150138

Pulled By: yhchiang

fbshipit-source-id: ed84222
2016-11-16 15:39:12 -08:00
Siying Dong
972e3ff295 Enable allow_concurrent_memtable_write and enable_write_thread_adaptive_yield by default
Summary: Closes https://github.com/facebook/rocksdb/pull/1496

Differential Revision: D4168080

Pulled By: siying

fbshipit-source-id: 056ae62
2016-11-16 09:39:09 -08:00
Andrew Kryczka
489d142808 DeleteRange interface
Summary:
Expose DeleteRange() interface since we think the implementation is functionally correct now.
Closes https://github.com/facebook/rocksdb/pull/1503

Differential Revision: D4171921

Pulled By: ajkr

fbshipit-source-id: 5e21c98
2016-11-15 15:24:16 -08:00
Artemiy Kolesnikov
91300d01f6 Dynamic max_total_wal_size option
Summary: Closes https://github.com/facebook/rocksdb/pull/1509

Differential Revision: D4176426

Pulled By: yiwu-arbug

fbshipit-source-id: b57689d
2016-11-14 22:54:17 -08:00
Yi Wu
1ea79a78c9 Optimize sequential insert into memtable - Part 1: Interface
Summary:
Currently our skip-list have an optimization to speedup sequential
inserts from a single stream, by remembering the last insert position.
We extend the idea to support sequential inserts from multiple streams,
and even tolerate small reordering wihtin each stream.

This PR is the interface part adding the following:
- Add `memtable_insert_prefix_extractor` to allow specifying prefix for each key.
- Add `InsertWithHint()` interface to memtable, to allow underlying
  implementation to return a hint of insert position, which can be later
  pass back to optimize inserts.
- Memtable will maintain a map from prefix to hints and pass the hint
  via `InsertWithHint()` if `memtable_insert_prefix_extractor` is non-null.
Closes https://github.com/facebook/rocksdb/pull/1419

Differential Revision: D4079367

Pulled By: yiwu-arbug

fbshipit-source-id: 3555326
2016-11-13 19:09:18 -08:00
Lijun Tang
adb665e0bf Allowed delayed_write_rate option to be dynamically set.
Summary: Closes https://github.com/facebook/rocksdb/pull/1488

Differential Revision: D4157784

Pulled By: siying

fbshipit-source-id: f150081
2016-11-12 15:54:11 -08:00
Yi Wu
437942e481 Add avoid_flush_during_shutdown DB option
Summary:
Add avoid_flush_during_shutdown DB option.
Closes https://github.com/facebook/rocksdb/pull/1451

Differential Revision: D4108643

Pulled By: yiwu-arbug

fbshipit-source-id: abdaf4d
2016-11-02 15:39:18 -07:00
Benoit Girard
2b16d664cb Change max_bytes_for_level_multiplier to double
Summary: Closes https://github.com/facebook/rocksdb/pull/1427

Differential Revision: D4094732

Pulled By: yiwu-arbug

fbshipit-source-id: b9b79e9
2016-11-01 21:09:23 -07:00
Yueh-Hsuan Chiang
ab53998372 Bump RocksDB version to 4.13 (#1405)
Summary:
Bump RocksDB version to 4.13

Test Plan:
unit tests

Reviewers: sdong, IslamAbdelRahman, andrewkr, lightmark

Subscribers: leveldb
2016-10-18 15:39:10 -07:00
Aaron Gao
d88dff4ef2 add seeforprev in history
Summary: update new feature in history and avoid breaking mongorocks

Test Plan: make check

Reviewers: sdong, yiwu, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64611
2016-10-17 15:34:13 -07:00
Yi Wu
e29d3b67c2 Make max_background_compactions and base_background_compactions dynamic changeable
Summary:
Add DB::SetDBOptions to dynamic change max_background_compactions and base_background_compactions.
I'll add more dynamic changeable options soon.

Test Plan: unit test.

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64749
2016-10-14 12:25:39 -07:00
Yi Wu
d6ae6dec69 Add Statistics::getAndResetTickerCount().
Summary: A convience method to atomically get and reset ticker count. I'm wanting to use it to have a thin wrapper to the statistics object to export ticker counts to ODS for LogDevice (since they don't even use fb303).

Test Plan:
test in LogDevice shadow cluster.
https://fburl.com/461868822

Reviewers: andrewkr, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64869
2016-10-11 10:54:11 -07:00
Yi Wu
17f76fc564 DB::GetOptions() reflect dynamic changed options
Summary: DB::GetOptions() reflect dynamic changed options.

Test Plan: See the new unit test.

Reviewers: yhchiang, sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D63903
2016-09-14 22:10:28 -07:00
Islam AbdelRahman
eb1d4d53c8 Release RocksDB 4.12
Summary: Release 4.12

Test Plan: none

Reviewers: andrewkr, yiwu, lightmark, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D63885
2016-09-12 12:34:08 -07:00
Andrew Kryczka
de28a25533 Update HISTORY.md for thread-local stats
Summary: as titled

Test Plan: doitlive

Reviewers: sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D63777
2016-09-08 14:02:19 -07:00
sdong
607628d349 Support ZSTD with finalized format
Summary:
ZSTD 1.0.0 is coming. We can finally add a support of ZSTD without worrying about compatibility.
Still keep ZSTDNotFinal for compatibility reason.

Test Plan: Run all tests. Run db_bench with ZSTD version with RocksDB built with ZSTD 1.0 and older.

Reviewers: andrewkr, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: cyan, igor, IslamAbdelRahman, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D63141
2016-09-06 12:22:16 -07:00
Yi Wu
a88677d2cf Remove ImmutableCFOptions from public API
Summary: There's no reference to ImmutableCFOptions elsewhere in /include/rocksdb. ImmutableCFOptions was introduced in this commit (5665e5e285) but later its reference in /include/rocksdb/table.h is removed.

Test Plan:
  make all check

Reviewers: IslamAbdelRahman, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: yhchiang, andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D63177
2016-09-02 14:16:31 -07:00
sdong
32149059f9 Merge options source_compaction_factor, max_grandparent_overlap_bytes and expanded_compaction_factor into max_compaction_bytes
Summary: To reduce number of options, merge source_compaction_factor, max_grandparent_overlap_bytes and expanded_compaction_factor into max_compaction_bytes.

Test Plan: Add two new unit tests. Run all existing tests, including jtest.

Reviewers: yhchiang, igor, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59829
2016-09-01 14:33:24 -07:00
Justin T. Gibbs
23a057007c Document memtable flush behavior in CancelAllBackgroundWork()
Summary:
Update History.md to reflect recent change that ensures unpersisted data
is flushed even if clients call CancelAllBackgroundWork() directly.

Test Plan: Review rendering of markdown.

Reviewers: sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D62703
2016-08-26 10:14:40 -07:00
Yi Wu
4cc37f59e5 Introduce ClockCache
Summary:
Clock-based cache implemenetation aim to have better concurreny than
default LRU cache. See inline comments for implementation details.

Test Plan:
Update cache_test to run on both LRUCache and ClockCache. Adding some
new tests to catch some of the bugs that I fixed while implementing the
cache.

Reviewers: kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61647
2016-08-19 12:28:19 -07:00
sdong
638c49f24f Change HISTORY.md for release 4.11
Summary:
Need to change HISTORY.md for 4.11.
4.10 was not updated either. Update it together.

Test Plan: Not needed.

Reviewers: kradhakrishnan, andrewkr, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D61665
2016-08-10 10:44:24 -07:00
Wanning Jiang
c38b075e7e Update HISTORY.md 2016-08-09 21:12:45 -07:00
Wanning Jiang
8f399e3fee Update HISTORY.md 2016-08-09 20:57:40 -07:00
sdong
7c4615cf1f A utility function to help users migrate DB after options change
Summary: Add a utility function that trigger necessary full compaction and put output to the correct level by looking at new options and old options.

Test Plan: Add unit tests for it.

Reviewers: andrewkr, igor, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: muthu, sumeet, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60783
2016-08-05 15:39:55 -07:00
sdong
e5b5f12b81 Change options memtable_prefix_bloom_huge_page_tlb_size => memtable_huge_page_size and cover huge page to memtable too
Summary: Extend the option memtable_prefix_bloom_huge_page_tlb_size from just putting memtable bloom filter to huge page to memtable itself too.

Test Plan: Run all existing tests.

Reviewers: IslamAbdelRahman, yhchiang, andrewkr

Reviewed By: andrewkr

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60513
2016-07-26 18:15:11 -07:00
sdong
32df9733d1 Add options.write_buffer_manager: control total memtable size across DB instances
Summary: Add option write_buffer_manager to help users control total memory spent on memtables across multiple DB instances.

Test Plan: Add a new unit test.

Reviewers: yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: adela, benj, sumeet, muthu, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59925
2016-07-05 18:11:25 -07:00
Lucas Qiu
38fae9e65d fix typos in HISTORY.md (#1192) 2016-07-04 22:58:35 -07:00
Islam AbdelRahman
9b5adea97f Add More Logging to track total_log_size
Summary: We saw instances where total_log_size is off the real value, but I'm not able to reproduce it. Add more logging to help debugging when it happens again.

Test Plan: Run the unit test and see the logging.

Reviewers: andrewkr, yhchiang, igor, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D60081
2016-07-02 01:56:57 -07:00
Andrew Kryczka
56ac686292 Detect column family from properties [CF + RepairDB part 2/3]
Summary:
This diff uses the CF ID and CF name properties in the SST file
to associate recovered data with the proper column family. Depends on D59775.

- In ScanTable(), create column families in VersionSet each time a new one is discovered (via reading SST file properties)
- In ConvertLogToTable(), dump an SST file for every column family with data in the WAL
- In AddTables(), make a VersionEdit per-column family that adds all of that CF's tables

Test Plan:
  $ ./repair_test

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D59781
2016-06-24 13:12:13 -07:00
omegaga
c4e19b77e8 Add a read option to enable background purge when cleaning up iterators
Summary:
Add a read option `background_purge_on_iterator_cleanup` to avoid deleting files in foreground when destroying iterators.
Instead, a job is scheduled in high priority queue and would be executed in a separate background thread.

Test Plan: Add a variant of PurgeObsoleteFileTest. Turn on background purge option in the new test, and use sleeping task to ensure files are deleted in background.

Reviewers: IslamAbdelRahman, sdong

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59499
2016-06-21 18:41:23 -07:00
sdong
7b79238b65 Deprectate filter_deletes
Summary: filter_deltes is not a frequently used feature. Remove it.

Test Plan: Run all test suites.

Reviewers: igor, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59427
2016-06-17 10:30:47 -07:00
sdong
0babce57f7 Move away from enum char value -1
Summary: char is not signed in some platforms. Having negative values confuse those compilers.

Test Plan: Run all existing tests.

Reviewers: andrewkr, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59619
2016-06-14 17:07:34 -07:00
Yi Wu
bc8af90e8c add option to not flush memtable on open()
Summary:
Add option to not flush memtable on open()
In case the option is enabled, don't delete existing log files by not updating log numbers to MANIFEST.
Will still flush if we need to (e.g. memtable full in the middle). In that case we also flush final memtable.
If wal_recovery_mode = kPointInTimeRecovery, do not halt immediately after encounter corruption. Instead, check if seq id of next log file is last_log_sequence + 1. In that case we continue recovery.

Test Plan: See unit test.

Reviewers: dhruba, horuff, sdong

Reviewed By: sdong

Subscribers: benj, yhchiang, andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57813
2016-06-13 11:34:16 -07:00
sdong
20699df843 memtable_prefix_bloom_bits -> memtable_prefix_bloom_bits_ratio and deprecate memtable_prefix_bloom_probes
Summary:
memtable_prefix_bloom_probes is not a critical option. Remove it to reduce number of options.
It's easier for users to make mistakes with memtable_prefix_bloom_bits, turn it to memtable_prefix_bloom_bits_ratio

Test Plan: Run all existing tests

Reviewers: yhchiang, igor, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: gunnarku, yoshinorim, MarkCallaghan, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59199
2016-06-10 12:12:10 -07:00
Aaron Gao
53a4bd8a69 duplicate line
Summary: duplicated line in history.md

Test Plan: n/a

Reviewers: yiwu

Reviewed By: yiwu

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59433
2016-06-09 17:30:52 -07:00
Aaron Gao
3e86869616 release 4.9 update version and history
Summary: update version and history

Test Plan: N/A

Reviewers: yiwu, sdong, IslamAbdelRahman, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D59421
2016-06-09 17:26:17 -07:00
sdong
b2973eaaeb Remove options builder
Summary: AFIK, options builder is not used by anyone. Remove it.

Test Plan: Run all existing tests.

Reviewers: IslamAbdelRahman, andrewkr, igor

Reviewed By: igor

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D59319
2016-06-09 16:56:17 -07:00
sdong
1d725ca51d Deprecate BlockBasedTableOptions.hash_index_allow_collision=false.
Summary: Deprecate this one option and delete code and tests that are now superfluous.

Test Plan: all tests pass

Reviewers: igor, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: msalib, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D55317
2016-05-20 17:52:27 -07:00
Evan Shaw
7383b64b3a Fix formatting of HISTORY.md (#1126)
A couple "New Features" headers needed a blank line before them.
2016-05-19 10:56:12 -07:00
Islam AbdelRahman
4b31723433 Add bottommost_compression option
Summary:
Add a new option that can be used to set a specific compression algorithm for bottommost level.
This option will only affect levels larger than base level.

I have also updated CompactionJobInfo to include the compression algorithm used in compaction

Test Plan:
added new unittest
existing unittests

Reviewers: andrewkr, yhchiang, sdong

Reviewed By: sdong

Subscribers: lightmark, andrewkr, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D57669
2016-05-09 15:57:19 -07:00
Yi Wu
24a24f013d Enable configurable readahead for iterators
Summary:
Add an option `iterator_readahead_size` to `ReadOptions` to enable
configurable readahead for iterators similar to the corresponding
option for compaction.

Test Plan:
```
make commit_prereq
```

Reviewers: kumar.rangarajan, ott, igor, sdong

Reviewed By: sdong

Subscribers: yiwu, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D55419
2016-05-04 15:25:58 -07:00
Yi Wu
1b166928c7 Release RocksDB 4.8.0
Summary: Release RocksDB 4.8.0

Test Plan: N/A

Reviewers: sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D57525
2016-05-02 14:38:04 -07:00
Yi Wu
a92049e3e7 Added EventListener::OnTableFileCreationStarted() callback
Summary: Added EventListener::OnTableFileCreationStarted. EventListener::OnTableFileCreated will be called on failure case. User can check creation status via TableFileCreationInfo::status.

Test Plan: unit test.

Reviewers: dhruba, yhchiang, ott, sdong

Reviewed By: sdong

Subscribers: sdong, kradhakrishnan, IslamAbdelRahman, andrewkr, yhchiang, leveldb, ott, dhruba

Differential Revision: https://reviews.facebook.net/D56337
2016-04-29 11:35:00 -07:00
Andrew Kryczka
843d2e3137 Shared dictionary compression using reference block
Summary:
This adds a new metablock containing a shared dictionary that is used
to compress all data blocks in the SST file. The size of the shared dictionary
is configurable in CompressionOptions and defaults to 0. It's currently only
used for zlib/lz4/lz4hc, but the block will be stored in the SST regardless of
the compression type if the user chooses a nonzero dictionary size.

During compaction, computes the dictionary by randomly sampling the first
output file in each subcompaction. It pre-computes the intervals to sample
by assuming the output file will have the maximum allowable length. In case
the file is smaller, some of the pre-computed sampling intervals can be beyond
end-of-file, in which case we skip over those samples and the dictionary will
be a bit smaller. After the dictionary is generated using the first file in a
subcompaction, it is loaded into the compression library before writing each
block in each subsequent file of that subcompaction.

On the read path, gets the dictionary from the metablock, if it exists. Then,
loads that dictionary into the compression library before reading each block.

Test Plan: new unit test

Reviewers: yhchiang, IslamAbdelRahman, cyan, sdong

Reviewed By: sdong

Subscribers: andrewkr, yoshinorim, kradhakrishnan, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D52287
2016-04-27 17:36:03 -07:00
Andrew Kryczka
7a6045a3c7 fix typo in HISTORY.md 2016-04-20 18:53:37 -07:00
Andrew Kryczka
73a847ef89 Add per-level compression ratio property
Summary:
This is needed so we can measure compression ratio improvements
achieved by D52287.

The property compares raw data size against the total file size for a given
level. If the level is empty it should return 0.0.

Test Plan: new unit test

Reviewers: IslamAbdelRahman, yhchiang, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D56967
2016-04-20 18:46:54 -07:00
Andrew Kryczka
40b840f294 Delete deprecated *BackupableDB interface for backups
Summary:
This interface is redundant and has been deprecated for a while.
It's also unused internally. Let's delete it.

I moved the comments to the corresponding functions in BackupEngine/
BackupEngineReadOnly. This caused the diff tool to not work cleanly.

Test Plan:
unit tests

  $ ./backupable_db_test

Reviewers: yhchiang, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D56331
2016-04-18 09:04:14 -07:00
sdong
4b6833aec1 Rename options.compaction_measure_io_stats to options.report_bg_io_stats and include flush too.
Summary: It is useful to print out IO stats in flush jobs too. Extend options.compaction_measure_io_stats to flush jobs and raname it.

Test Plan: Try db_bench and see the stats are printed out.

Reviewers: yhchiang

Reviewed By: yhchiang

Subscribers: kradhakrishnan, yiwu, IslamAbdelRahman, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56769
2016-04-15 10:22:18 -07:00
Islam AbdelRahman
1aeca97337 Release RocksDB 4.7
Summary: Bump the version and update HISTORY.md

Test Plan: none

Reviewers: yhchiang, sdong, andrewkr

Reviewed By: andrewkr

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D56469
2016-04-13 15:03:08 -07:00
sdong
2feafa3db9 Change some RocksDB default options
Summary: Change some RocksDB default options to make it more friendly to server workloads.

Test Plan: Run all existing tests

Reviewers: yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: sumeet, muthu, benj, MarkCallaghan, igor, leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D55941
2016-03-31 17:12:18 -07:00
Andrew Kryczka
0267655dad Update change log for 4.6 release
Summary: as titled

Test Plan: N/A

Reviewers: sdong, kradhakrishnan, anthony, yhchiang, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D55323
2016-03-12 13:51:57 -08:00
Islam AbdelRahman
580fede347 Aggregate hot Iterator counters in LocalStatistics (DBIter::Next perf regression)
Summary:
This patch bump the counters in the frequent code path DBIter::Next() / DBIter::Prev() in a local data members and send them to Statistics when the iterator is destroyed
A better solution will be to have thread_local implementation for Statistics

New performance
```
readseq      :       0.035 micros/op 28597881 ops/sec; 3163.7 MB/s
     1,851,568,819      stalled-cycles-frontend   #   31.29% frontend cycles idle    [49.86%]
       884,929,823      stalled-cycles-backend    #   14.95% backend  cycles idle    [50.21%]
readreverse  :       0.071 micros/op 14077393 ops/sec; 1557.3 MB/s
     3,239,575,993      stalled-cycles-frontend   #   27.36% frontend cycles idle    [49.96%]
     1,558,253,983      stalled-cycles-backend    #   13.16% backend  cycles idle    [50.14%]

```

Existing performance

```
readreverse  :       0.174 micros/op 5732342 ops/sec;  634.1 MB/s
    20,570,209,389      stalled-cycles-frontend   #   70.71% frontend cycles idle    [50.01%]
    18,422,816,837      stalled-cycles-backend    #   63.33% backend  cycles idle    [50.04%]

readseq      :       0.119 micros/op 8400537 ops/sec;  929.3 MB/s
    15,634,225,844      stalled-cycles-frontend   #   79.07% frontend cycles idle    [49.96%]
    14,227,427,453      stalled-cycles-backend    #   71.95% backend  cycles idle    [50.09%]
```

Test Plan: unit tests

Reviewers: yhchiang, sdong, igor

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D55107
2016-03-11 19:01:12 -08:00
Yi Wu
f71fc77b7c Cache to have an option to fail Cache::Insert() when full
Summary:
Cache to have an option to fail Cache::Insert() when full. Update call sites to check status and handle error.

I totally have no idea what's correct behavior of all the call sites when they encounter error. Please let me know if you see something wrong or more unit test is needed.

Test Plan: make check -j32, see tests pass.

Reviewers: anthony, yhchiang, andrewkr, IslamAbdelRahman, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D54705
2016-03-10 17:35:19 -08:00
sdong
12fd9b1868 Change BlockBasedTableOptions.format_version default to 2
Summary: BlockBasedTableOptions.format_version = 2 uses better encoding format. Now it's the time to make it default.

Test Plan: Run all existing tests.

Reviewers: igor, yhchiang, anthony, kradhakrishnan, andrewkr, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54879
2016-03-01 14:28:52 -08:00
sdong
1f5954147b Introduce Iterator::GetProperty() and replace Iterator::IsKeyPinned()
Summary:
Add Iterator::GetProperty(), a way for users to communicate with iterator, and turn Iterator::IsKeyPinned() with it.
As a follow-up, I'll ask a property as the version number attached to the iterator

Test Plan: Rerun existing tests and add a negative test case.

Reviewers: yhchiang, andrewkr, kradhakrishnan, anthony, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54783
2016-02-29 14:01:31 -08:00
sdong
92a9ccf1a6 Add a new compaction priority that picks file whose overlapping ratio is smallest
Summary:
Add a new compaction priority as following:
For every file, we calculate total size of files overalapping with the file in the next level, over the file's size itself. The file with smallest ratio will be picked first.
My "db_bench --fillrandom" shows about 5% less compaction than kOldestSmallestSeqFirst if --hard_pending_compaction_bytes_limit value to keep LSM tree in shape. If not limiting hard_pending_compaction_bytes_limit, improvement is only 1% or 2%.

Test Plan: Add a unit test

Reviewers: andrewkr, kradhakrishnan, anthony, IslamAbdelRahman, yhchiang

Reviewed By: yhchiang

Subscribers: MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54075
2016-02-11 15:59:19 -08:00
Yueh-Hsuan Chiang
4a8cbf4e31 Allows Get and MultiGet to read directly from SST files.
Summary:
Add kSstFileTier to ReadTier, which allows Get and MultiGet to
read only directly from SST files and skip mem-tables.

    kSstFileTier = 0x2      // data in SST files.
                          // Note that this ReadTier currently only supports
                          // Get and MultiGet and does not support iterators.

Test Plan: add new test in db_test.

Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong

Reviewed By: sdong

Subscribers: igor, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D53511
2016-02-09 11:20:22 -08:00
sdong
73a9b0f4ba Update version to 4.5
Summary: Time to cut branch for release 4.5. Change the versions.

Test Plan: Not needed

Reviewers: IslamAbdelRahman, yhchiang, kradhakrishnan, andrewkr, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D53883
2016-02-05 13:05:54 -08:00
Islam AbdelRahman
d6c838f1e1 Add SstFileManager (component tracking all SST file in DBs and control the deletion rate)
Summary:
Add a new class SstFileTracker that will be notified whenever a DB add/delete/move and sst file, it will also replace DeleteScheduler
SstFileTracker can be used later to abort writes when we exceed a specific size

Test Plan: unit tests

Reviewers: rven, anthony, yhchiang, sdong

Reviewed By: sdong

Subscribers: igor, lovro, march, dhruba

Differential Revision: https://reviews.facebook.net/D50469
2016-01-28 18:35:01 -08:00
sdong
d20915d52a Disable stats about mutex duration by default
Summary: Measuring mutex duration will measure time inside DB mutex, which breaks our best practice. Add a stat level in Statistics class. By default, disable to measure the mutex operations.

Test Plan: Add a unit test to make sure it is off by default.

Reviewers: rven, anthony, IslamAbdelRahman, kradhakrishnan, andrewkr, yhchiang

Reviewed By: yhchiang

Subscribers: MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D53367
2016-01-26 14:56:55 -08:00
sdong
38e1d7fea3 ldb to support --column_family option
Summary:
Add an option --column_family option, so that users can query or update specific column family.
Also add an create column family parameter to make unit test easier.
Still need to add unit tests.

Test Plan: Will add a test case in ldb python test.

Reviewers: yhchiang, rven, andrewkr, IslamAbdelRahman, kradhakrishnan, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D53265
2016-01-25 14:58:18 -08:00
sdong
fb9811ee9b Add a perf context level that doesn't measure time for mutex operations
Summary: Timing mutex operations can impact scalability of the system. Add a new perf context level that can measure time counters except for mutex.

Test Plan: Add a new unit test case to make sure it is not set.

Reviewers: IslamAbdelRahman, rven, kradhakrishnan, yhchiang, anthony

Reviewed By: anthony

Subscribers: MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D53199
2016-01-25 14:45:03 -08:00
agiardullo
f7ebc2f346 Update HISTORY.mc for 4.4.0
Summary: Prepare to cut release

Test Plan: no code change.

Reviewers: yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D52827
2016-01-14 16:57:06 -08:00
Gunnar Kudrjavets
b1a3b4c0d0 Make ldb automagically determine the file type and use the correct dumping function
Summary:
This set of changes implements the following design: `ldb` will utilize `--path` parameter which can be used to specify a file name. Tool will then apply some heuristic to determine how to output the data properly. The design decision is not to probe the file content, but use file names to determine what dumping function to call.

Usage examples:

Understands that path points to a manifest file and dumps it.
`./ldb --path=/tmp/test_db/MANIFEST-000023 dump`

Understands that path points to a WAL file and dumps it.
`./ldb --path=/tmp/test_db/000024.log dump --header`

Understands that path points to a SST file and dumps it.
`./ldb --path=/tmp/test_db/000007.sst dump`

Figures out that none of the supported file types are applicable and outputs
an appropriate error message.
`./ldb --path=/tmp/cron.log dump`

Test Plan:
Basics:

git diff
make clean
make -j 32 commit-prereq
arc lint

More specific testing (done as part of commit-prereq, but can be iterated separately when making isolated changes):

make clean
make ldb
python tools/ldb_test.py
make rocksdb_dump
make rocksdb_undump
sh tools/rocksdb_dump_test.sh

Reviewers: rven, IslamAbdelRahman, yhchiang, kradhakrishnan, anthony, igor, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D52269
2016-01-06 14:19:08 -08:00
sdong
15b8902264 Change default options.delayed_write_rate
Summary: We now have a mechanism to further slowdown writes. Double default options.delayed_write_rate to try to keep the default behavior closer to it used to be.

Test Plan: Run all tests.

Reviewers: IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: yhchiang, kradhakrishnan, rven, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52281
2015-12-23 14:51:55 -08:00
sdong
b9f77ba12b When slowdown is triggered, reduce the write rate
Summary: It's usually hard for users to set a value of options.delayed_write_rate. With this diff, after slowdown condition triggers, we greedily reduce write rate if estimated pending compaction bytes increase. If estimated compaction pending bytes drop, we increase the write rate.

Test Plan:
Add a unit test
Test with db_bench setting:
TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom -num=10000000 --soft_pending_compaction_bytes_limit=1000000000 --hard_pending_compaction_bytes_limit=3000000000 --delayed_write_rate=100000000

and make sure without the commit, write stop will happen, but with the commit, it will not happen.

Reviewers: igor, anthony, rven, yhchiang, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52131
2015-12-23 11:33:15 -08:00
Islam AbdelRahman
d005c66faf Report compaction reason in CompactionListener
Summary:
Add CompactionReason to CompactionJobInfo
This will allow users to understand why compaction started which will help options tuning

Test Plan:
added new tests
make check -j64

Reviewers: yhchiang, anthony, kradhakrishnan, sdong, rven

Reviewed By: rven

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D51975
2015-12-22 11:37:19 -08:00
sdong
d72b31774e Slowdown when writing to the last write buffer
Summary: Now if inserting to mem table is much faster than writing to files, there is no mechanism users can rely on to avoid stopping for reaching options.max_write_buffer_number. With the commit, if there are more than four maximum write buffers configured, we slow down to the rate of options.delayed_write_rate while we reach the last one.

Test Plan:
1. Add a new unit test.
2. Run db_bench with

./db_bench --benchmarks=fillrandom --num=10000000 --max_background_flushes=6 --batch_size=32 -max_write_buffer_number=4 --delayed_write_rate=500000 --statistics

based on hard drive and see stopping is avoided with the commit.

Reviewers: yhchiang, IslamAbdelRahman, anthony, rven, kradhakrishnan, igor

Reviewed By: igor

Subscribers: MarkCallaghan, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D52047
2015-12-17 10:49:08 -08:00
charsyam
c30b499541 fix typos in comments 2015-12-11 01:54:48 +09:00
sdong
56e77f0967 Deprecate options.soft_rate_limit and add options.soft_pending_compaction_bytes_limit
Summary: Deprecate options.soft_rate_limit, which is hard to tune, with options.soft_pending_compaction_bytes_limit, which would trigger the slowdown if estimated pending compaction bytes exceeds the threshold. The hope is to make it more striaght-forward to tune.

Test Plan: Modify DBTest.SoftLimit to cover options.soft_pending_compaction_bytes_limit instead; run all unit tests.

Reviewers: IslamAbdelRahman, yhchiang, rven, kradhakrishnan, igor, anthony

Reviewed By: anthony

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51117
2015-12-09 18:22:45 -08:00
sdong
d6e1035a1f A new compaction picking priority that optimizes for write amplification for random updates.
Summary: Introduce a compaction picking priority that picks files who contains the oldest rows to compact. This is a mode that slightly improves write amplification for random update cases.

Test Plan: Add a unit test and run it in valgrind too.

Reviewers: yhchiang, anthony, IslamAbdelRahman, rven, kradhakrishnan, MarkCallaghan, igor

Reviewed By: igor

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51459
2015-12-09 18:13:03 -08:00
krad
188170fb4e Updating HISTORY.md
Summary: Added 4.3.0 version

Test Plan:

Reviewers:

CC: leveldb@

Task ID: #9298965

Blame Rev:
2015-12-08 17:46:43 -08:00
sdong
db320b1b82 DB to only flush the column family with the largest memtable while option.db_write_buffer_size is hit
Summary: When option.db_write_buffer_size is hit, we currently flush all column families. Move to flush the column family with the largest active memt table instead. In this way, we can avoid too many small files in some cases.

Test Plan: Modify test DBTest.SharedWriteBuffer to work with the updated behavior

Reviewers: kradhakrishnan, yhchiang, rven, anthony, IslamAbdelRahman, igor

Reviewed By: igor

Subscribers: march, leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D51291
2015-11-30 13:36:57 -08:00