Commit Graph

1484 Commits

Author SHA1 Message Date
Orgad Shaneh
6401a8b76b Fix build with MinGW
Summary:
There still are many warnings (most of them about invalid printf format
for long long), but it builds if FAIL_ON_WARNINGS is disabled.
Closes https://github.com/facebook/rocksdb/pull/2052

Differential Revision: D4807355

Pulled By: siying

fbshipit-source-id: ef03786
2017-03-30 16:54:52 -07:00
Andrew Kryczka
80fe5b3855 disable test: DeleteSchedulerTest.DynamicRateLimiting1
Summary:
temporarily disable since it isn't working on travis.
Closes https://github.com/facebook/rocksdb/pull/2064

Differential Revision: D4807373

Pulled By: ajkr

fbshipit-source-id: f2bb2b0
2017-03-30 16:54:52 -07:00
Sagar Vemuri
c6d04f2ecf Option to fail a request as incomplete when skipping too many internal keys
Summary:
Operations like Seek/Next/Prev sometimes take too long to complete when there are many internal keys to be skipped. Adding an option, max_skippable_internal_keys -- which could be used to set a threshold for the maximum number of keys that can be skipped, will help to address these cases where it is much better to fail a request (as incomplete) than to wait for a considerable time for the request to complete.

This feature -- to fail an iterator seek request as incomplete, is disabled by default when max_skippable_internal_keys = 0. It is enabled only when max_skippable_internal_keys > 0.

This feature is based on the discussion mentioned in the PR https://github.com/facebook/rocksdb/pull/1084.
Closes https://github.com/facebook/rocksdb/pull/2000

Differential Revision: D4753223

Pulled By: sagar0

fbshipit-source-id: 1c973f7
2017-03-30 12:09:21 -07:00
Herman Lee
58179ec4a6 Cleanup of ThreadStatusUtil structures should use the DB's reference
Summary:
instead of thread_local

The cleanup path for the rocksdb database might not have the
thread_updater_local_cache_ pointer initialized because the thread
executing the cleanup is likely not a rocksdb thread. This results in a
memory leak detected by Valgrind. The cleanup code path should use the
thread_status_updater pointer obtained from the DB object instead of a
thread local one.
Closes https://github.com/facebook/rocksdb/pull/2059

Differential Revision: D4801611

Pulled By: hermanlee

fbshipit-source-id: 407d7de
2017-03-30 10:39:13 -07:00
Min Wei
8a8c967460 Enable Fast CRC32 for Win64
Summary:
Currently the fast crc32 path is not enabled on Windows. I am trying to enable it here, hopefully, with the minimum impact to the existing code structure.
Closes https://github.com/facebook/rocksdb/pull/2033

Differential Revision: D4770635

Pulled By: siying

fbshipit-source-id: 676f8b8
2017-03-29 17:39:19 -07:00
Aaron Gao
0fd574926c delete fallocate with punch_hole
Summary:
As discuss in this thread:
https://www.facebook.com/groups/rocksdb.dev/permalink/1218043868294125/

We remove fallocate with FALLOC_FL_PUNCH_HOLE because the recent bug on xfs in kernel 4.x+ that align file size to page size even with FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE.
Closes https://github.com/facebook/rocksdb/pull/2038

Differential Revision: D4779974

Pulled By: siying

fbshipit-source-id: 5f54625
2017-03-28 15:54:12 -07:00
Maysam Yabandeh
e7731d119a Configure index partition size
Summary:
Allow the users to specify the target index partition size.

With this patch an index partition is cut before its estimated in-memory size goes above the configured value for metadata_block_size. The filter partitions are still cut right after an index partition is cut.
Closes https://github.com/facebook/rocksdb/pull/2041

Differential Revision: D4780216

Pulled By: maysamyabandeh

fbshipit-source-id: 95a0831
2017-03-28 12:09:12 -07:00
Shu Zhang
8dee8cad9e Enable fifo compaction benchmark to db_bench
Summary:
Added fifo benchmark to db_bench.
One thing i am not sure is that i am using CompactRange() instead of CompactFiles(). (may cause performance skew because CompactionRange() is not happening in current thread?)  For CompactFiles(), for some reason FIFO compaction doesn't work as expected. More insight is welcomed. I guess FIFO compaction doesn't work with file names? igorcanadi

test cmd:
./db_bench --compaction_style=2 --benchmarks=fillseqdeterministic --disable_auto_compactions --num_levels=1 --fifo_compaction_max_table_files_size_mb=10

---------------------- DB 0 LSM ---------------------
Level[0]: /000014.sst(size: 4211014 bytes)
fillseqdeterministic :       4.731 micros/op 211381 ops/sec;   23.4 MB/s
Closes https://github.com/facebook/rocksdb/pull/1734

Differential Revision: D4774964

Pulled By: siying

fbshipit-source-id: 9d08df6
2017-03-24 17:09:15 -07:00
Raza Hussain
6908e24b56 dynamic setting of stats_dump_period_sec through SetDBOption()
Summary:
Resolved the following issue: https://github.com/facebook/rocksdb/issues/1930
Closes https://github.com/facebook/rocksdb/pull/2004

Differential Revision: D4736764

Pulled By: yiwu-arbug

fbshipit-source-id: 64fe0b7
2017-03-20 22:54:13 -07:00
Aaron Gao
9272e12f19 avoid ftruncate twice in buffered io
Summary:
in buffered io, the filesize_ is the real size.
Closes https://github.com/facebook/rocksdb/pull/1991

Differential Revision: D4711433

Pulled By: lightmark

fbshipit-source-id: ad604b9
2017-03-17 11:39:13 -07:00
Islam AbdelRahman
d52f334cbd Break stalls when no bg work is happening
Summary:
Current stall will keep sleeping even if there is no Flush/Compactions to wait for, I changed the logic to break the stall if we are not flushing or compacting

db_bench command used
```
# fillrandom
# memtable size = 10MB
# value size = 1 MB
# num = 1000
# use /dev/shm
./db_bench --benchmarks="fillrandom,stats" --value_size=1048576 --write_buffer_size=10485760 --num=1000 --delayed_write_rate=XXXXX  --db="/dev/shm/new_stall" | grep "Cumulative stall"
```

```
Current results

# delayed_write_rate = 1000 Kb/sec
Cumulative stall: 00:00:9.031 H:M:S

# delayed_write_rate = 200 Kb/sec
Cumulative stall: 00:00:22.314 H:M:S

# delayed_write_rate = 100 Kb/sec
Cumulative stall: 00:00:42.784 H:M:S

# delayed_write_rate = 50 Kb/sec
Cumulative stall: 00:01:23.785 H:M:S

# delayed_write_rate = 25 Kb/sec
Cumulative stall: 00:02:45.702 H:M:S
```

```
New results

# delayed_write_rate = 1000 Kb/sec
Cumulative stall: 00:00:9.017 H:M:S

# delayed_write_rate = 200 Kb/sec
Cumulative stall: 00
Closes https://github.com/facebook/rocksdb/pull/1884

Differential Revision: D4585439

Pulled By: IslamAbdelRahman

fbshipit-source-id: aed2198
2017-03-16 18:24:17 -07:00
Islam AbdelRahman
995618a821 Support SstFileManager::SetDeleteRateBytesPerSecond()
Summary:
Update DeleteScheduler component to support changing delete rate in runtime by introducing
SstFileManager::SetDeleteRateBytesPerSecond()
Closes https://github.com/facebook/rocksdb/pull/1994

Differential Revision: D4719906

Pulled By: IslamAbdelRahman

fbshipit-source-id: e6b8d9e
2017-03-16 12:09:15 -07:00
Islam AbdelRahman
e19163688b Add macros to include file name and line number during Logging
Summary:
current logging
```
2017/03/14-14:20:30.393432 7fedde9f5700 (Original Log Time 2017/03/14-14:20:30.393414) [default] Level summary: base level 1 max bytes base 268435456 files[1 0 0 0 0 0 0] max score 0.25
2017/03/14-14:20:30.393438 7fedde9f5700 [JOB 2] Try to delete WAL files size 61417909, prev total WAL file size 73820858, number of live WAL files 2.
2017/03/14-14:20:30.393464 7fedde9f5700 [DEBUG] [JOB 2] Delete /dev/shm/old_logging//MANIFEST-000001 type=3 #1 -- OK
2017/03/14-14:20:30.393472 7fedde9f5700 [DEBUG] [JOB 2] Delete /dev/shm/old_logging//000003.log type=0 #3 -- OK
2017/03/14-14:20:31.427103 7fedd49f1700 [default] New memtable created with log file: #9. Immutable memtables: 0.
2017/03/14-14:20:31.427179 7fedde9f5700 [JOB 3] Syncing log #6
2017/03/14-14:20:31.427190 7fedde9f5700 (Original Log Time 2017/03/14-14:20:31.427170) Calling FlushMemTableToOutputFile with column family [default], flush slots available 1, compaction slots allowed 1, compaction slots scheduled 1
2017/03/14-14:20:31.
Closes https://github.com/facebook/rocksdb/pull/1990

Differential Revision: D4708695

Pulled By: IslamAbdelRahman

fbshipit-source-id: cb8968f
2017-03-15 19:39:12 -07:00
Aaron Gao
d525718a93 cleanup direct io flag in WritableFileWriter
Summary:
remove unnecessary field `direct_io_`, use `use_direct_io()` instead.
Closes https://github.com/facebook/rocksdb/pull/1992

Differential Revision: D4712195

Pulled By: lightmark

fbshipit-source-id: 57d34f9
2017-03-14 22:39:09 -07:00
Maysam Yabandeh
11526252cc Pinnableslice (2nd attempt)
Summary:
PinnableSlice

    Summary:
    Currently the point lookup values are copied to a string provided by the
    user. This incures an extra memcpy cost. This patch allows doing point lookup
    via a PinnableSlice which pins the source memory location (instead of
    copying their content) and releases them after the content is consumed
    by the user. The old API of Get(string) is translated to the new API
    underneath.

    Here is the summary for improvements:

    value 100 byte: 1.8% regular, 1.2% merge values
    value 1k byte: 11.5% regular, 7.5% merge values
    value 10k byte: 26% regular, 29.9% merge values
    The improvement for merge could be more if we extend this approach to
    pin the merge output and delay the full merge operation until the user
    actually needs it. We have put that for future work.

    PS:
    Sometimes we observe a small decrease in performance when switching from
    t5452014 to this patch but with the old Get(string) API. The d
Closes https://github.com/facebook/rocksdb/pull/1756

Differential Revision: D4391738

Pulled By: maysamyabandeh

fbshipit-source-id: 6f3edd3
2017-03-13 11:54:10 -07:00
Maysam Yabandeh
e6725e8c8d Fix some bugs in MockEnv
Summary:
Fixing some bugs in MockEnv so it be actually used.
Closes https://github.com/facebook/rocksdb/pull/1914

Differential Revision: D4609923

Pulled By: maysamyabandeh

fbshipit-source-id: ca25735
2017-03-13 09:54:11 -07:00
Min Wei
900c62be61 fix compile for VS2015
Summary:
Without the cast, the build will break on Windows.
Closes https://github.com/facebook/rocksdb/pull/1982

Differential Revision: D4690462

Pulled By: ajkr

fbshipit-source-id: c493b6c
2017-03-10 11:24:09 -08:00
Andrew Kryczka
5b11124e39 add max to histogram stats
Summary:
Domas enlightened me about p100 (i.e., max) stats. Let's add them to our histograms.
Closes https://github.com/facebook/rocksdb/pull/1968

Differential Revision: D4678716

Pulled By: ajkr

fbshipit-source-id: 65e7118
2017-03-08 22:24:15 -08:00
Maysam Yabandeh
54b434110e Builders for partition filter
Summary:
This is the second split of this pull request: https://github.com/facebook/rocksdb/pull/1891 which includes only the builder part. The testing will be included in the third split, where the reader is also included.
Closes https://github.com/facebook/rocksdb/pull/1952

Differential Revision: D4660272

Pulled By: maysamyabandeh

fbshipit-source-id: 36b3cf0
2017-03-07 13:54:12 -08:00
Andrew Kryczka
7c80a6d7d1 Statistic for how often rate limiter is drained
Summary:
This is the metric I plan to use for adaptive rate limiting. The statistics are updated only if the rate limiter is drained by flush or compaction. I believe (but am not certain) that this is the normal case.

The Statistics object is passed in RateLimiter::Request() to avoid requiring changes to client code, which would've been necessary if we passed it in the RateLimiter constructor.
Closes https://github.com/facebook/rocksdb/pull/1946

Differential Revision: D4646489

Pulled By: ajkr

fbshipit-source-id: d8e0161
2017-03-02 17:54:15 -08:00
Islam AbdelRahman
f89b3893c0 Remove skip_table_builder_flush and default it to true
Summary:
This option is needed to be enabled for Direct IO
and I cannot think of a reason where we need to disable it

remove it and default it to true
Closes https://github.com/facebook/rocksdb/pull/1944

Differential Revision: D4641088

Pulled By: IslamAbdelRahman

fbshipit-source-id: d7085b9
2017-03-02 16:54:10 -08:00
Siying Dong
8432bcf555 Make compaction_pri settable through option string
Summary: Closes https://github.com/facebook/rocksdb/pull/1941

Differential Revision: D4637253

Pulled By: siying

fbshipit-source-id: a59dcdb
2017-03-02 10:24:12 -08:00
Aaron Gao
e877afa08b Remove bulk loading and auto_roll_logger in rocksdb_lite
Summary:
shrink lite size
Closes https://github.com/facebook/rocksdb/pull/1929

Differential Revision: D4622059

Pulled By: siying

fbshipit-source-id: 050b796
2017-02-28 11:09:11 -08:00
xiusir
90d8355075 Fix the wrong address for PREFETCH in DynamicBloom::Prefetch
Summary:
- Change data_[b] to data_[b / 8] in DynamicBloom::Prefetch, as b means the b-th bit in data_ and data_[b / 8] is the proper byte in data_.
Closes https://github.com/facebook/rocksdb/pull/1935

Differential Revision: D4628696

Pulled By: siying

fbshipit-source-id: bc5a0c6
2017-02-28 10:39:11 -08:00
Islam AbdelRahman
08864df212 Move advanced column family options to advanced_options.h
Summary:
For the sake of making our options simpler, we should keep options.h as simple as possible and move more advanced/less common options to advaned_options.h

I started with ColumnFamilyOptions and also did some re-ordering

I have moved all ColumnFamilyOptions to advanced_options.h and only left these options in options.h

```
const Comparator* comparator = BytewiseComparator();
std::shared_ptr<MergeOperator> merge_operator = nullptr;
const CompactionFilter* compaction_filter = nullptr;
std::shared_ptr<CompactionFilterFactory> compaction_filter_factory = nullptr;
size_t write_buffer_size = 64 << 20;
CompressionType compression;
int level0_file_num_compaction_trigger = 4;
bool disable_auto_compactions = false;
```
Please feel free to comment on specific options if you think they should be advanced or should not be
Closes https://github.com/facebook/rocksdb/pull/1847

Differential Revision: D4519996

Pulled By: IslamAbdelRahman

fbshipit-source-id: abebd9a
2017-02-27 17:54:14 -08:00
Tamir Duberstein
253799c06d Add missing include for abort()
Summary:
Fixes #1233 (again).
Closes https://github.com/facebook/rocksdb/pull/1931

Differential Revision: D4625289

Pulled By: ajkr

fbshipit-source-id: 70e774e
2017-02-27 17:24:13 -08:00
Siying Dong
8efb5ffa2a [rocksdb][PR] Remove option min_partial_merge_operands and verify_checksums_in_comp…
Summary:
…action

 The two options, min_partial_merge_operands and verify_checksums_in_compaction, are not seldom used. Remove them to reduce the total number of options. Also remove them from Java and C interface.
Closes https://github.com/facebook/rocksdb/pull/1902

Differential Revision: D4601219

Pulled By: siying

fbshipit-source-id: aad4cb2
2017-02-23 15:09:12 -08:00
Siying Dong
1ba2804b7f Remove XFunc tests
Summary:
Xfunc is hardly used. Remove it to keep the code simple.
Closes https://github.com/facebook/rocksdb/pull/1905

Differential Revision: D4603220

Pulled By: siying

fbshipit-source-id: 731f96d
2017-02-23 12:09:11 -08:00
Aaron Gao
1ef5f50e84 detect logical sector size
Summary:
querying logical sector size from the device instead of hardcoding it for linux platform.
Closes https://github.com/facebook/rocksdb/pull/1875

Differential Revision: D4591946

Pulled By: ajkr

fbshipit-source-id: 4e9805c
2017-02-23 11:25:36 -08:00
Aaron Gao
f206af56fc add use_direct_io() to ReadaheadRandomAccessFile
Summary:
Missing this function will cause RandomAccessFileReader not doing alignment in Direct IO mode, which introduce an IOError: invalid argument.
Closes https://github.com/facebook/rocksdb/pull/1900

Differential Revision: D4601261

Pulled By: lightmark

fbshipit-source-id: c3eadf1
2017-02-22 14:54:11 -08:00
Aaron Gao
286a36db7f posix writablefile truncate
Summary:
we occasionally missing this call so the file size will be wrong
Closes https://github.com/facebook/rocksdb/pull/1894

Differential Revision: D4598446

Pulled By: lightmark

fbshipit-source-id: 42b6ef5
2017-02-22 10:09:14 -08:00
Daniel Black
f0879e4c39 Page size isn't always 4k on linux
Summary:
Some places autodetected. These are the two places that didn't.

closes #1498

Still unsure if the following instances of 4 * 1024 need fixing in:
util/io_posix.h
include/rocksdb/table.h (appears to be blocksize and different)
utilities/persistent_cache/block_cache_tier.cc
utilities/persistent_cache/persistent_cache_test.h
include/rocksdb/env.h
util/env_posix.cc
db/column_family.cc
Closes https://github.com/facebook/rocksdb/pull/1499

Differential Revision: D4593640

Pulled By: yiwu-arbug

fbshipit-source-id: efc48de
2017-02-21 16:39:14 -08:00
Yulia Kartseva
ebc8a79980 alignment is on in ReadaheadRandomAccessFile::Read()
Summary: Closes https://github.com/facebook/rocksdb/pull/1857

Differential Revision: D4534518

Pulled By: wat-ze-hex

fbshipit-source-id: b456946
2017-02-18 12:09:12 -08:00
Marcin Dlugajczyk
a618a16f44 New subcode for IOError to detect the ESTALE errno
Summary:
I'd like to propose a patch to expose a new IOError type with subcode kStaleFile to allow to detect when ESTALE error is returned. This allows the rocksdb consumers to handle this error separately from other IOErrors.

I've also added a missing string representation for the kDeadlock subcode, I believe calling ToString() on Status object with that subcode would result in an out of band access in the msgs array,

Please let me know if you have any questions or would like me to make any changes to this pull request.
Closes https://github.com/facebook/rocksdb/pull/1748

Differential Revision: D4387675

Pulled By: IslamAbdelRahman

fbshipit-source-id: 67feb13
2017-02-17 10:54:13 -08:00
Aaron Gao
db2b4eb50e avoid direct io in rocksdb_lite
Summary:
fix lite bugs
disable direct io in lite mode
Closes https://github.com/facebook/rocksdb/pull/1870

Differential Revision: D4559866

Pulled By: yiwu-arbug

fbshipit-source-id: 3761c51
2017-02-16 10:39:13 -08:00
Xiaofei Du
7106a994fe Use monotonic time points in write_controller.cc and rate_limiter.cc
Summary:
NowMicros() provides non-monotonic time. When wall clock is
synchronized or changed, the non-monotonicity time points will affect write rate
controllers. This patch changes write_controller.cc and rate_limiter.cc to use
monotonic time points.
Closes https://github.com/facebook/rocksdb/pull/1865

Differential Revision: D4561732

Pulled By: siying

fbshipit-source-id: 95ece62
2017-02-14 18:24:24 -08:00
Sagar Vemuri
eb912a927e Remove disableDataSync option
Summary:
Remove disableDataSync, and another similarly named disable_data_sync options.
This is being done to simplify options, and also because the performance gains of this feature can be achieved by other methods.
Closes https://github.com/facebook/rocksdb/pull/1859

Differential Revision: D4541292

Pulled By: sagar0

fbshipit-source-id: 5b3a6ca
2017-02-13 11:09:13 -08:00
James Sun
53bb01516d [rocksdb][PR] compaction_style and compaction_pri should output their value as a st…
Summary:
…ring

Replace the numerical output for compaction_style and compaction_pri
with strings
Closes https://github.com/facebook/rocksdb/pull/1817

Differential Revision: D4482796

Pulled By: highker

fbshipit-source-id: 5785768
2017-02-07 10:39:12 -08:00
Maysam Yabandeh
69d5262c81 Two-level Indexes
Summary:
Partition Index blocks and use a Partition-index as a 2nd level index.

The two-level index can be used by setting
BlockBasedTableOptions::kTwoLevelIndexSearch as the index type and
configuring BlockBasedTableOptions::index_per_partition

t15539501
Closes https://github.com/facebook/rocksdb/pull/1814

Differential Revision: D4473535

Pulled By: maysamyabandeh

fbshipit-source-id: bffb87e
2017-02-06 16:39:12 -08:00
Dmitri Smirnov
0a4cdde50a Windows thread
Summary:
introduce new methods into a public threadpool interface,
- allow submission of std::functions as they allow greater flexibility.
- add Joining methods to the implementation to join scheduled and submitted jobs with
  an option to cancel jobs that did not start executing.
- Remove ugly `#ifdefs` between pthread and std implementation, make it uniform.
- introduce pimpl for a drop in replacement of the implementation
- Introduce rocksdb::port::Thread typedef which is a replacement for std::thread.  On Posix Thread defaults as before std::thread.
- Implement WindowsThread that allocates memory in a more controllable manner than windows std::thread with a replaceable implementation.
- should be no functionality changes.
Closes https://github.com/facebook/rocksdb/pull/1823

Differential Revision: D4492902

Pulled By: siying

fbshipit-source-id: c74cb11
2017-02-06 14:54:18 -08:00
Dmitri Smirnov
add8b50cc9 Move ThreadLocal implementation into .cc
Summary: Closes https://github.com/facebook/rocksdb/pull/1829

Differential Revision: D4502314

Pulled By: siying

fbshipit-source-id: f46fac1
2017-02-02 14:09:12 -08:00
Siying Dong
f289d9f4ac Fix OSX build break after the fallocate change
Summary:
The recent update about fallocate failed OSX build. Fix it.
Closes https://github.com/facebook/rocksdb/pull/1830

Differential Revision: D4500235

Pulled By: siying

fbshipit-source-id: a5f2b40
2017-02-02 10:39:11 -08:00
Siying Dong
4a3e7d320c Change the default of delayed slowdown value to 16MB/s
Summary:
Change the default of delayed slowdown value to 16MB/s and further increase the L0 stop condition to 36 files.
Closes https://github.com/facebook/rocksdb/pull/1821

Differential Revision: D4489229

Pulled By: siying

fbshipit-source-id: 1003981
2017-02-01 20:39:17 -08:00
Siying Dong
0513e21f9b RangeSync() should work with ROCKSDB_FALLOCATE_PRESENT not set
Summary: Closes https://github.com/facebook/rocksdb/pull/1824

Differential Revision: D4493862

Pulled By: siying

fbshipit-source-id: c168446
2017-02-01 10:24:20 -08:00
Islam AbdelRahman
8b369ae5bd Cleaner default options using C++11 in-class init
Summary:
C++11 in-class initialization is cleaner and makes it the default more explicit to our users and more visible.
Use it for ColumnFamilyOptions and DBOptions
Closes https://github.com/facebook/rocksdb/pull/1822

Differential Revision: D4490473

Pulled By: IslamAbdelRahman

fbshipit-source-id: c493a87
2017-01-31 18:09:15 -08:00
Islam AbdelRahman
ec79a7b53c Dedup code in option.cc and db_options.cc
Summary:
The code in DBOptions::Dump is simply a duplicate of the code in ImmutableDBOptions::Dump and MutableDBOptions.Dump

consolidate duplicate code.

tested visually
Closes https://github.com/facebook/rocksdb/pull/1818

Differential Revision: D4486710

Pulled By: IslamAbdelRahman

fbshipit-source-id: 7085189
2017-01-31 17:39:12 -08:00
Siying Dong
2d75cd40d3 NewLRUCache() to pick number of shard bits based on capacity if not given
Summary:
If the users use the NewLRUCache() without passing in the number of shard bits, instead of using hard-coded 6, we'll determine it based on capacity.
Closes https://github.com/facebook/rocksdb/pull/1584

Differential Revision: D4242517

Pulled By: siying

fbshipit-source-id: 86b0f18
2017-01-27 06:39:12 -08:00
Andrew Kryczka
94a0c32e73 Fix LRU Ref() for handles with external references only
Summary:
For case !handle->InCache() && handle->refs >= 1 (the third case mentioned in lru_cache.h), the key was overwritten by Insert(). In this case, the refcount can still be incremented, and the cache handle will never enter LRU list. Fix Ref() logic for this case.
Closes https://github.com/facebook/rocksdb/pull/1808

Differential Revision: D4467656

Pulled By: ajkr

fbshipit-source-id: c0784d8
2017-01-26 10:54:15 -08:00
Andrew Kryczka
17c1180603 Generalize Env registration framework
Summary:
The Env registration framework supports registering client Envs and selecting which one to instantiate according to a text field. This enabled things like adding the -env_uri argument to db_bench, so the same binary could be reused with different Envs just by changing CLI config.

Now this problem has come up again in a non-Env context, as I want to instantiate a client Statistics implementation from db_bench, which is configured entirely via text parameters. Also, in the future we may wish to use it for deserializing client objects when loading OPTIONS file.

This diff generalizes the Env registration logic to work with arbitrary types.

- Generalized registration and instantiation code by templating them
- The entire implementation is in a header file as that's Google style guide's recommendation for template definitions
- Pattern match with std::regex_match rather than checking prefix, which was the previous behavior
- Rename functions/files to be non-Env-specific
Closes https://github.com/facebook/rocksdb/pull/1776

Differential Revision: D4421933

Pulled By: ajkr

fbshipit-source-id: 34647d1
2017-01-25 16:09:14 -08:00
sdong
07dddd5f7e EnvPosixTestWithParam should wait for all threads to finish
Summary:
If we don't wait for the threads to finish after each run, the thread queue may not be empty while the next test starts to run, which can cause unexpected behaviors.

Also make some of the relaxed read/write more restrict.
Closes https://github.com/facebook/rocksdb/pull/1590

Reviewed By: AsyncDBConnMarkedDownDBException

Differential Revision: D4245922

Pulled By: AsyncDBConnMarkedDownDBException

fbshipit-source-id: f83b74b
2017-01-25 15:54:13 -08:00
Hyeonseok Oh
f2b4939da4 fixed typo
Summary:
I fixed exisit -> exist
Closes https://github.com/facebook/rocksdb/pull/1799

Differential Revision: D4451466

Pulled By: yiwu-arbug

fbshipit-source-id: b447c3a
2017-01-23 12:54:13 -08:00
Siying Dong
0e8dfd6062 Fix OptimizeForPointLookup()
Summary:
If users directly call OptimizeForPointLookup(), it is broken as the option isn't compatible with parallel memtable insert. Fix it by using memtable bloomo filter instead.
Closes https://github.com/facebook/rocksdb/pull/1791

Differential Revision: D4442836

Pulled By: siying

fbshipit-source-id: bf6c9cd
2017-01-20 10:54:12 -08:00
Yi Wu
9239103cd4 Flush job should release reference current version if sync log failed
Summary:
Fix the bug when sync log fail, FlushJob::Run() will not be execute and
reference to cfd->current() will not be release.
Closes https://github.com/facebook/rocksdb/pull/1792

Differential Revision: D4441316

Pulled By: yiwu-arbug

fbshipit-source-id: 5523e28
2017-01-19 23:09:15 -08:00
Yi Wu
602c13a964 Remove fadvise with direct IO read
Summary:
Remove the logic since we don't use buffer cache with direct IO. Resolve
read regression we currently have.
Closes https://github.com/facebook/rocksdb/pull/1782

Differential Revision: D4430408

Pulled By: yiwu-arbug

fbshipit-source-id: 5557bba
2017-01-18 12:09:10 -08:00
Kefu Chai
e8a096000b util/thread_local.h: silence a clang-build warning
Summary:
otherwise clang complains with

/home/jenkins/workspace/ceph-master/src/rocksdb/util/thread_local.h:205:5:
error: macro expansion producing 'defined' has undefined behavior
[-Werror,-Wexpansion-to-defined]
^
/home/jenkins/workspace/ceph-master/src/rocksdb/util/thread_local.h:22:4:
note: expanded from macro 'ROCKSDB_SUPPORT_THREAD_LOCAL'
!defined(OS_WIN) && !defined(OS_MACOSX) && !defined(IOS_CROSS_COMPILE)
^`

Signed-off-by: Kefu Chai <tchaikov@gmail.com>
Closes https://github.com/facebook/rocksdb/pull/1757

Differential Revision: D4394140

Pulled By: siying

fbshipit-source-id: f0beda0
2017-01-15 13:24:16 -08:00
Aaron Gao
3e6899d116 change UseDirectIO() to use_direct_io()
Summary:
also change variable name `direct_io_` to `use_direct_io_` in WritableFile to make it consistent with read path.
Closes https://github.com/facebook/rocksdb/pull/1770

Differential Revision: D4416435

Pulled By: lightmark

fbshipit-source-id: 4143c53
2017-01-13 12:09:15 -08:00
Aaron Gao
d4e07a8459 fix warning of unused direct io helper functions
Summary:
add build guard
Closes https://github.com/facebook/rocksdb/pull/1771

Differential Revision: D4410779

Pulled By: siying

fbshipit-source-id: 3796c30
2017-01-12 12:39:14 -08:00
Aaron Gao
dc2584eea0 direct reads refactor
Summary:
direct IO reads refactoring
remove unnecessary classes and unified interfaces
tested with db_bench

need more change for options and ON/OFF for different files.
Since disabled is default, it should be fine now
Closes https://github.com/facebook/rocksdb/pull/1636

Differential Revision: D4307189

Pulled By: lightmark

fbshipit-source-id: 6991e22
2017-01-11 16:54:12 -08:00
Anirban Rahut
62384ebe9c Guarding extra fallocate call with TRAVIS because its not working pro…
Summary:
…perly on travis

 There is some old code in PosixWritableFile::Close(), which
truncates the file to the measured size and then does an extra fallocate
with KEEP_SIZE. This is commented as a failsafe because in some
cases ftruncate doesn't do the right job (I don't know of an instance of
this btw). However doing an fallocate with KEEP_SIZE should not increase
the file size. However on Travis Worker which is Docker (likely AUFS )
its not working. There are comments on web that show that the AUFS
author had initially not implemented fallocate, and then did it later.
So not sure what is the quality of the implementation.
Closes https://github.com/facebook/rocksdb/pull/1765

Differential Revision: D4401340

Pulled By: anirbanr-fb

fbshipit-source-id: e2d8100
2017-01-11 14:24:13 -08:00
Andrew Kryczka
fe395fb63d Allow incrementing refcount on cache handles
Summary:
Previously the only way to increment a handle's refcount was to invoke Lookup(), which (1) did hash table lookup to get cache handle, (2) incremented that handle's refcount. For a future DeleteRange optimization, I added a function, Ref(), for when the caller already has a cache handle and only needs to do (2).
Closes https://github.com/facebook/rocksdb/pull/1761

Differential Revision: D4397114

Pulled By: ajkr

fbshipit-source-id: 9addbe5
2017-01-10 16:54:20 -08:00
Dmitri Smirnov
3c233ca4ea Fix Windows environment issues
Summary:
Enable directIO on WritableFileImpl::Append
     with offset being current length of the file.
     Enable UniqueID tests on Windows, disable others but
     leeting them to compile. Unique tests are valuable to
     detect failures on different filesystems and upcoming
     ReFS.
     Clear output in WinEnv Getchildren.This is different from
     previous strategy, do not touch output on failure.
     Make sure DBTest.OpenWhenOpen works with windows error message
Closes https://github.com/facebook/rocksdb/pull/1746

Differential Revision: D4385681

Pulled By: IslamAbdelRahman

fbshipit-source-id: c07b702
2017-01-09 15:54:12 -08:00
Maysam Yabandeh
d0ba8ec8f9 Revert "PinnableSlice"
Summary:
This reverts commit 54d94e9c2c.

The pull request was landed by mistake.
Closes https://github.com/facebook/rocksdb/pull/1755

Differential Revision: D4391678

Pulled By: maysamyabandeh

fbshipit-source-id: 36d5149
2017-01-08 14:24:12 -08:00
Maysam Yabandeh
54d94e9c2c PinnableSlice
Summary:
Currently the point lookup values are copied to a string provided by the user.
This incures an extra memcpy cost. This patch allows doing point lookup
via a PinnableSlice which pins the source memory location (instead of
copying their content) and releases them after the content is consumed
by the user. The old API of Get(string) is translated to the new API
underneath.

 Here is the summary for improvements:
 1. value 100 byte: 1.8%  regular, 1.2% merge values
 2. value 1k   byte: 11.5% regular, 7.5% merge values
 3. value 10k byte: 26% regular,    29.9% merge values

 The improvement for merge could be more if we extend this approach to
 pin the merge output and delay the full merge operation until the user
 actually needs it. We have put that for future work.

PS:
Sometimes we observe a small decrease in performance when switching from
t5452014 to this patch but with the old Get(string) API. The difference
is a little and could be noise. More importantly it is safely
cancelled
Closes https://github.com/facebook/rocksdb/pull/1732

Differential Revision: D4374613

Pulled By: maysamyabandeh

fbshipit-source-id: a077f1a
2017-01-08 13:54:13 -08:00
Islam AbdelRahman
ac73d7558b Add GetSupportedCompressions() convenience function
Summary:
This function will return a list of supported compression types in RocksDB
This is needed for MyRocks https://github.com/facebook/mysql-5.6/pull/446
Closes https://github.com/facebook/rocksdb/pull/1747

Differential Revision: D4385921

Pulled By: IslamAbdelRahman

fbshipit-source-id: 2f5b59f
2017-01-06 11:24:14 -08:00
Adam Retter
85ac1a320a Fix rocksdb::Status::getState
Summary:
This fixes the Java API for Status#getState use in Native code and also simplifies the implementation of rocksdb::Status::getState.
Closes https://github.com/facebook/rocksdb/issues/1688
Closes https://github.com/facebook/rocksdb/pull/1714

Differential Revision: D4364181

Pulled By: yiwu-arbug

fbshipit-source-id: 8e073b4
2017-01-03 18:39:14 -08:00
Siying Dong
17a4b75cc3 Always fsync the file after file copying
Summary:
File copying happens when creating checkpoints and bulkloading files from different FS partition. We should fsync the files when copying them to guarantee durability. A side effect will be that the dirty pages in file system buffers won't grow too large.
Closes https://github.com/facebook/rocksdb/pull/1728

Differential Revision: D4371083

Pulled By: siying

fbshipit-source-id: 579e14c
2016-12-28 19:09:16 -08:00
Yi Wu
ab48c165a9 Print cache options to info log
Summary:
Improve cache options logging to info log.
Also print the value of
cache_index_and_filter_blocks_with_high_priority.
Closes https://github.com/facebook/rocksdb/pull/1709

Differential Revision: D4358776

Pulled By: yiwu-arbug

fbshipit-source-id: 8f030a0
2016-12-22 14:54:19 -08:00
Aaron Gao
972f96b3fb direct io write support
Summary:
rocksdb direct io support

```
[gzh@dev11575.prn2 ~/rocksdb] ./db_bench -benchmarks=fillseq --num=1000000
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB:    version 5.0
Date:       Wed Nov 23 13:17:43 2016
CPU:        40 * Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
CPUCache:   25600 KB
Keys:       16 bytes each
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Write rate: 0 bytes/second
Compression: Snappy
Memtablerep: skip_list
Perf Level: 1
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
DB path: [/tmp/rocksdbtest-112628/dbbench]
fillseq      :       4.393 micros/op 227639 ops/sec;   25.2 MB/s

[gzh@dev11575.prn2 ~/roc
Closes https://github.com/facebook/rocksdb/pull/1564

Differential Revision: D4241093

Pulled By: lightmark

fbshipit-source-id: 98c29e3
2016-12-22 13:09:19 -08:00
Islam AbdelRahman
989e644ed8 Remove sst_file_manager option from LITE
Summary:
Remove sst_file_manager option from LITE
Closes https://github.com/facebook/rocksdb/pull/1690

Differential Revision: D4341331

Pulled By: IslamAbdelRahman

fbshipit-source-id: 9f9328d
2016-12-21 17:54:21 -08:00
Jianpeng Ma
bd6cf7b51d WritableFileWriter: default buffer size equal min(64k,options.writabl?
Summary:
?e_file_max_buffer_size)

If we overwrite WritableFile and has a buffer which has the same
function of buf_. We hope remove the cache function of
WritableFileWriter. So using options.writable_file_max_buffer_size = 0
to disable cache function.

Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
Closes https://github.com/facebook/rocksdb/pull/1628

Differential Revision: D4307219

Pulled By: yiwu-arbug

fbshipit-source-id: 77a6e26
2016-12-16 13:09:14 -08:00
Daniel Black
816c1e30ca gcc-7 requires include <functional> for std::function
Summary:
Fixes compile error:

In file included from ./util/statistics.h:17:0,
                 from ./util/stop_watch.h:8,
                 from ./util/perf_step_timer.h:9,
                 from ./util/iostats_context_imp.h:8,
                 from ./util/posix_logger.h:27,
                 from ./port/util_logger.h:18,
                 from ./db/auto_roll_logger.h:15,
                 from db/auto_roll_logger.cc:6:
./util/thread_local.h:65:16: error: 'function' in namespace 'std' does not name a template type
   typedef std::function<void(void*, void*)> FoldFunc;
Closes https://github.com/facebook/rocksdb/pull/1656

Differential Revision: D4318702

Pulled By: yiwu-arbug

fbshipit-source-id: 8c5d17a
2016-12-16 11:24:18 -08:00
Daniel Black
0ab6fc167f Gcc-7 buffer size insufficient
Summary:
Bunch of commits related to insufficient buffer size. Errors in individual commits.
Closes https://github.com/facebook/rocksdb/pull/1673

Differential Revision: D4332127

Pulled By: IslamAbdelRahman

fbshipit-source-id: 878f73c
2016-12-14 19:24:26 -08:00
Daniel Black
b7239bf7e0 Gcc 7 fallthrough
Summary:
hopefully the last of the gcc-7 compile errors
Closes https://github.com/facebook/rocksdb/pull/1675

Differential Revision: D4332106

Pulled By: IslamAbdelRahman

fbshipit-source-id: 139448c
2016-12-14 19:24:25 -08:00
Daniel Black
477b6ea578 std::remove_if requires <algorithm>
Summary:
fixes error (that occurred on gcc-7):

error:

util/env_basic_test.cc: In member function 'virtual rocksdb::Status rocksdb::NormalizingEnvWrapper::GetChildren(const string&, std::vector<std::__cxx11::basic_string<char> >*)':
util/env_basic_test.cc:27:21: error: 'remove_if' is not a member of 'std'
       result->erase(std::remove_if(result->begin(), result->end(),
                     ^~~
Closes https://github.com/facebook/rocksdb/pull/1674

Differential Revision: D4331221

Pulled By: ajkr

fbshipit-source-id: 9bbdc78
2016-12-14 17:09:14 -08:00
Daniel Black
e097222e64 util/logging.cc: buffer of insufficient size (gcc-7 -Werror=format-length)
Summary:
util/logging.cc💯13: error: output may be truncated before the last format character [-Werror=format-length=]
 std::string NumberToHumanString(int64_t num) {
             ^~~~~~~~~~~~~~~~~~~
util/logging.cc:106:59: note: format output between 3 and 19 bytes into a destination of size 16
     snprintf(buf, sizeof(buf), "%" PRIi64 "K", num / 1000);
Closes https://github.com/facebook/rocksdb/pull/1653

Differential Revision: D4318687

Pulled By: yiwu-arbug

fbshipit-source-id: 3a5c931
2016-12-13 18:39:14 -08:00
Daniel Black
bfbcec2339 Gcc 7 error expansion to defined
Summary:
sorry if these gcc-7/clang-4 cleanups are getting tedious.
Closes https://github.com/facebook/rocksdb/pull/1658

Differential Revision: D4318792

Pulled By: yiwu-arbug

fbshipit-source-id: 8e85891
2016-12-13 18:39:14 -08:00
Daniel Black
c3e5ee7154 util/histogram.cc: HistogramStat::toString buffer insufficient
Summary:
Increased buffer size to 1650.

util/histogram.cc: In member function 'std::__cxx11::string rocksdb::HistogramStat::ToString() const':
util/histogram.cc:189:13: error: '%.2f' directive output truncated writing between 4 and 313 bytes into a region of size 0 [-Werror=format-length=]
 std::string HistogramStat::ToString() const {
             ^~~~~~~~~~~~~
util/histogram.cc:205:30: note: format output between 69 and 1614 bytes into a destination of size 200
            Percentile(99.99));
                              ^
cc1plus: all warnings being treated as errors
Makefile:1521: recipe for target 'util/histogram.o' failed
Closes https://github.com/facebook/rocksdb/pull/1660

Differential Revision: D4318820

Pulled By: yiwu-arbug

fbshipit-source-id: 45ae6ea
2016-12-13 14:09:12 -08:00
Andrew Kryczka
f0c509e2c8 Return finer-granularity status from Env::GetChildren*
Summary:
It'd be nice to use the error status type to distinguish
between user error and system error. For example, GetChildren can fail
listing a backup directory's contents either because a bad path was provided
(user error) or because an operation failed, e.g., a remote storage service
call failed (system error). In the former case, we want to continue and treat
the backup directory as empty; in the latter case, we want to immediately
propagate the error to the caller.

This diff uses NotFound to indicate user error and IOError to indicate
system error. Previously IOError indicated both.
Closes https://github.com/facebook/rocksdb/pull/1644

Differential Revision: D4312157

Pulled By: ajkr

fbshipit-source-id: 51b4f24
2016-12-12 12:54:13 -08:00
Manuel Ung
2005c88a75 Implement non-exclusive locks
Summary:
This is an implementation of non-exclusive locks for pessimistic transactions. It is relatively simple and does not prevent starvation (ie. it's possible that request for exclusive access will never be granted if there are always threads holding shared access). It is done by changing `KeyLockInfo` to hold an set a transaction ids, instead of just one, and adding a flag specifying whether this lock is currently held with exclusive access or not.

Some implementation notes:
- Some lock diagnostic functions had to be updated to return a set of transaction ids for a given lock, eg. `GetWaitingTxn` and `GetLockStatusData`.
- Deadlock detection is a bit more complicated since a transaction can now wait on multiple other transactions. A BFS is done in this case, and deadlock detection depth is now just a limit on the number of transactions we visit.
- Expirable transactions do not work efficiently with shared locks at the moment, but that's okay for now.
Closes https://github.com/facebook/rocksdb/pull/1573

Differential Revision: D4239097

Pulled By: lth

fbshipit-source-id: da7c074
2016-12-05 17:39:17 -08:00
Anton Safonov
9053fe2a5c Made delete_obsolete_files_period_micros option dynamic
Summary:
Made delete_obsolete_files_period_micros option dynamic. It can be updating using DB::SetDBOptions().
Closes https://github.com/facebook/rocksdb/pull/1595

Differential Revision: D4246569

Pulled By: tonek

fbshipit-source-id: d23f560
2016-12-05 14:24:16 -08:00
Islam AbdelRahman
4a21b1402c Cache heap::downheap() root comparison (optimize heap cmp call)
Summary:
Reduce number of comparisons in heap by caching which child node in the first level is smallest (left_child or right_child)
So next time we can compare directly against the smallest child

I see that the total number of calls to comparator drops significantly when using this optimization

Before caching (~2mil key comparison for iterating the DB)
```
$ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq" --db="/dev/shm/heap_opt" --use_existing_db --disable_auto_compactions --cache_size=1000000000  --perf_level=2
readseq      :       0.338 micros/op 2959201 ops/sec;  327.4 MB/s user_key_comparison_count = 2000008
```
After caching (~1mil key comparison for iterating the DB)
```
$ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq" --db="/dev/shm/heap_opt" --use_existing_db --disable_auto_compactions --cache_size=1000000000 --perf_level=2
readseq      :       0.309 micros/op 3236801 ops/sec;  358.1 MB/s user_key_comparison_count = 1000011
```

It also improves
Closes https://github.com/facebook/rocksdb/pull/1600

Differential Revision: D4256027

Pulled By: IslamAbdelRahman

fbshipit-source-id: 76fcc66
2016-12-01 13:39:14 -08:00
Islam AbdelRahman
e39d080871 Fix travis (compile for clang < 3.9)
Summary:
Travis fail because it uses clang 3.6 which don't recognize
`__attribute__((__no_sanitize__("undefined")))`
Closes https://github.com/facebook/rocksdb/pull/1601

Differential Revision: D4257175

Pulled By: IslamAbdelRahman

fbshipit-source-id: fb4d1ab
2016-12-01 10:09:22 -08:00
Islam AbdelRahman
52fd1ff2c2 disable UBSAN for functions with intentional -ve shift / overflow
Summary:
disable UBSAN for functions with intentional left shift on -ve number / overflow

These functions are
rocksdb:: Hash
FixedLengthColBufEncoder::Append
FaultInjectionTest:: Key
Closes https://github.com/facebook/rocksdb/pull/1577

Differential Revision: D4240801

Pulled By: IslamAbdelRahman

fbshipit-source-id: 3e1caf6
2016-11-28 17:54:12 -08:00
Islam AbdelRahman
63c30de80d fix options_test ubsan
Summary:
Having -ve value for max_write_buffer_number does not make sense and cause us to do a left shift on a -ve value number
Closes https://github.com/facebook/rocksdb/pull/1579

Differential Revision: D4240798

Pulled By: IslamAbdelRahman

fbshipit-source-id: bd6267e
2016-11-28 16:39:14 -08:00
Mike Kolupaev
236d4c67e9 Less linear search in DBIter::Seek() when keys are overwritten a lot
Summary:
In one deployment we saw high latencies (presumably from slow iterator operations) and a lot of CPU time reported by perf with this stack:

```
  rocksdb::MergingIterator::Next
  rocksdb::DBIter::FindNextUserEntryInternal
  rocksdb::DBIter::Seek
```

I think what's happening is:
1. we create a snapshot iterator,
2. we do lots of Put()s for the same key x; this creates lots of entries in memtable,
3. we seek the iterator to a key slightly smaller than x,
4. the seek walks over lots of entries in memtable for key x, skipping them because of high sequence numbers.

CC IslamAbdelRahman
Closes https://github.com/facebook/rocksdb/pull/1413

Differential Revision: D4083879

Pulled By: IslamAbdelRahman

fbshipit-source-id: a83ddae
2016-11-28 10:24:11 -08:00
Siying Dong
cd7c4143d7 Improve Write Stalling System
Summary:
Current write stalling system has the problem of lacking of positive feedback if the restricted rate is already too low. Users sometimes stack in very low slowdown value. With the diff, we add a positive feedback (increasing the slowdown value) if we recover from slowdown state back to normal. To avoid the positive feedback to keep the slowdown value to be to high, we add issue a negative feedback every time we are close to the stop condition. Experiments show it is easier to reach a relative balance than before.

Also increase level0_stop_writes_trigger default from 24 to 32. Since level0_slowdown_writes_trigger default is 20, stop trigger 24 only gives four files as the buffer time to slowdown writes. In order to avoid stop in four files while 20 files have been accumulated, the slowdown value must be very low, which is amost the same as stop. It also doesn't give enough time for the slowdown value to converge. Increase it to 32 will smooth out the system.
Closes https://github.com/facebook/rocksdb/pull/1562

Differential Revision: D4218519

Pulled By: siying

fbshipit-source-id: 95e4088
2016-11-23 09:24:15 -08:00
Nick Terrell
4444256ab7 Remove use of deprecated LZ4 function
Summary:
LZ4 1.7.3 emits warnings when calling the deprecated function `LZ4_compress_limitedOutput_continue()`.  Starting in r129, LZ4 introduces `LZ4_compress_fast_continue()` as a replacement, and the two functions calls are [exactly equivalent](https://github.com/lz4/lz4/blob/dev/lib/lz4.c#L1408).
Closes https://github.com/facebook/rocksdb/pull/1532

Differential Revision: D4199240

Pulled By: siying

fbshipit-source-id: 138c2bc
2016-11-21 12:24:14 -08:00
Changli Gao
548d7fb261 Fix fd leak when using direct IOs
Summary:
We should close the fd, before overriding it. This bug was
introduced by f89caa127b
Closes https://github.com/facebook/rocksdb/pull/1553

Differential Revision: D4214101

Pulled By: siying

fbshipit-source-id: 0d65de0
2016-11-21 12:24:13 -08:00
Andrew Kryczka
fd43ee09da Range deletion microoptimizations
Summary:
- Made RangeDelAggregator's InternalKeyComparator member a reference-to-const so we don't need to copy-construct it. Also added InternalKeyComparator to ImmutableCFOptions so we don't need to construct one for each DBIter.
- Made MemTable::NewRangeTombstoneIterator and the table readers' NewRangeTombstoneIterator() functions return nullptr instead of NewEmptyInternalIterator to avoid the allocation. Updated callers accordingly.
Closes https://github.com/facebook/rocksdb/pull/1548

Differential Revision: D4208169

Pulled By: ajkr

fbshipit-source-id: 2fd65cf
2016-11-21 12:24:13 -08:00
Changli Gao
a0deec960f Fix deadlock when calling getMergedHistogram
Summary:
When calling StatisticsImpl::HistogramInfo::getMergedHistogram(), if
there is a dying thread, which is calling
ThreadLocalPtr::StaticMeta::OnThreadExit() to merge its thread values to
HistogramInfo, deadlock will occur. Because the former try to hold
merge_lock then ThreadMeta::mutex_, but the later try to hold
ThreadMeta::mutex_ then merge_lock. In short, the locking order isn't
the same.

This patch addressed this issue by releasing merge_lock before folding
thread values.
Closes https://github.com/facebook/rocksdb/pull/1552

Differential Revision: D4211942

Pulled By: ajkr

fbshipit-source-id: ef89bcb
2016-11-20 18:24:12 -08:00
Manuel Ung
e63350e726 Use more efficient hash map for deadlock detection
Summary:
Currently, deadlock cycles are held in std::unordered_map. The problem with it is that it allocates/deallocates memory on every insertion/deletion. This limits throughput since we're doing this expensive operation while holding a global mutex. Fix this by using a vector which caches memory instead.

Running the deadlock stress test, this change increased throughput from 39k txns/s -> 49k txns/s. The effect is more noticeable in MyRocks.
Closes https://github.com/facebook/rocksdb/pull/1545

Differential Revision: D4205662

Pulled By: lth

fbshipit-source-id: ff990e4
2016-11-19 11:39:15 -08:00
Siying Dong
73843aa636 Direct I/O Reads Handle the last sector correctly.
Summary:
Currently, in the Direct I/O read mode, the last sector of the file, if not full, is not handled correctly. If the return value of pread is not multiplier of kSectorSize, we still go ahead and continue reading, even if the buffer is not aligned. With the commit, if the return value is not multiplier of kSectorSize, and all but the last sector has been read, we simply return.
Closes https://github.com/facebook/rocksdb/pull/1550

Differential Revision: D4209609

Pulled By: lightmark

fbshipit-source-id: cb0b439
2016-11-18 19:24:13 -08:00
Maysam Yabandeh
9d60151b04 Implement PositionedAppend for PosixWritableFile
Summary:
This patch clarifies the contract of PositionedAppend with some unit
tests and also implements it for PosixWritableFile. (Tasks: 14524071)
Closes https://github.com/facebook/rocksdb/pull/1514

Differential Revision: D4204907

Pulled By: maysamyabandeh

fbshipit-source-id: 06eabd2
2016-11-18 17:24:13 -08:00
Siying Dong
972e3ff295 Enable allow_concurrent_memtable_write and enable_write_thread_adaptive_yield by default
Summary: Closes https://github.com/facebook/rocksdb/pull/1496

Differential Revision: D4168080

Pulled By: siying

fbshipit-source-id: 056ae62
2016-11-16 09:39:09 -08:00
Yi Wu
1543d5d92e Report memory usage by memtable insert hints map.
Summary:
It is hard to measure acutal memory usage by std containers. Even
providing a custom allocator will miss count some of the usage. Here we
only do a wild guess on its memory usage.
Closes https://github.com/facebook/rocksdb/pull/1511

Differential Revision: D4179945

Pulled By: yiwu-arbug

fbshipit-source-id: 32ab929
2016-11-15 20:24:13 -08:00
Artemiy Kolesnikov
91300d01f6 Dynamic max_total_wal_size option
Summary: Closes https://github.com/facebook/rocksdb/pull/1509

Differential Revision: D4176426

Pulled By: yiwu-arbug

fbshipit-source-id: b57689d
2016-11-14 22:54:17 -08:00
Yi Wu
1ea79a78c9 Optimize sequential insert into memtable - Part 1: Interface
Summary:
Currently our skip-list have an optimization to speedup sequential
inserts from a single stream, by remembering the last insert position.
We extend the idea to support sequential inserts from multiple streams,
and even tolerate small reordering wihtin each stream.

This PR is the interface part adding the following:
- Add `memtable_insert_prefix_extractor` to allow specifying prefix for each key.
- Add `InsertWithHint()` interface to memtable, to allow underlying
  implementation to return a hint of insert position, which can be later
  pass back to optimize inserts.
- Memtable will maintain a map from prefix to hints and pass the hint
  via `InsertWithHint()` if `memtable_insert_prefix_extractor` is non-null.
Closes https://github.com/facebook/rocksdb/pull/1419

Differential Revision: D4079367

Pulled By: yiwu-arbug

fbshipit-source-id: 3555326
2016-11-13 19:09:18 -08:00
Lijun Tang
adb665e0bf Allowed delayed_write_rate option to be dynamically set.
Summary: Closes https://github.com/facebook/rocksdb/pull/1488

Differential Revision: D4157784

Pulled By: siying

fbshipit-source-id: f150081
2016-11-12 15:54:11 -08:00
Andrew Kryczka
4e20c5da20 Store internal keys in TombstoneMap
Summary:
This fixes a correctness issue where ranges with same begin key would overwrite each other.

This diff uses InternalKey as TombstoneMap's key such that all tombstones have unique keys even when their start keys overlap. We also update TombstoneMap to use an internal key comparator.

End-to-end tests pass and are here (https://gist.github.com/ajkr/851ffe4c1b8a15a68d33025be190a7d9) but cannot be included yet since the DeleteRange() API is yet to be checked in. Note both tests failed before this fix.
Closes https://github.com/facebook/rocksdb/pull/1484

Differential Revision: D4155248

Pulled By: ajkr

fbshipit-source-id: 304b4b9
2016-11-09 15:09:18 -08:00
Andrew Kryczka
f998c9790f DeleteRange Get support
Summary:
During Get()/MultiGet(), build up a RangeDelAggregator with range
tombstones as we search through live memtable, immutable memtables, and
SST files. This aggregator is then used by memtable.cc's SaveValue() and
GetContext::SaveValue() to check whether keys are covered.

added tests for Get on memtables/files; end-to-end tests mainly in https://reviews.facebook.net/D64761
Closes https://github.com/facebook/rocksdb/pull/1456

Differential Revision: D4111271

Pulled By: ajkr

fbshipit-source-id: 6e388d4
2016-11-03 18:54:20 -07:00