Commit Graph

32 Commits

Author SHA1 Message Date
Mike Kolupaev
e85f2c64cb Prevent empty memtables from using a lot of memory
Summary:
This fixes OOMs that we (logdevice) are currently having in production.

SkipListRep constructor does a couple small allocations from ConcurrentArena (see InlineSkipList constructor). ConcurrentArena would sometimes allocate an entire block for that, which is a few megabytes (we use Options::arena_block_size = 4 MB). So an empty memtable can take take 4 MB of memory. We have ~40k column families (spread across 15 DB instances), so 4 MB per empty memtable easily OOMs a machine for us.

This PR makes ConcurrentArena always allocate from Arena's inline block when possible. So as long as InlineSkipList's initial allocations are below 2 KB there would be no blocks allocated for empty memtables.
Closes https://github.com/facebook/rocksdb/pull/2569

Differential Revision: D5404029

Pulled By: al13n321

fbshipit-source-id: 568ec22a3fd1a485c06123f6b2dfc5e9ef67cd23
2017-07-28 15:58:43 -07:00
Siying Dong
3c327ac2d0 Change RocksDB License
Summary: Closes https://github.com/facebook/rocksdb/pull/2589

Differential Revision: D5431502

Pulled By: siying

fbshipit-source-id: 8ebf8c87883daa9daa54b2303d11ce01ab1f6f75
2017-07-15 16:11:23 -07:00
Siying Dong
95b0e89b5d Improve write buffer manager (and allow the size to be tracked in block cache)
Summary:
Improve write buffer manager in several ways:
1. Size is tracked when arena block is allocated, rather than every allocation, so that it can better track actual memory usage and the tracking overhead is slightly lower.
2. We start to trigger memtable flush when 7/8 of the memory cap hits, instead of 100%, and make 100% much harder to hit.
3. Allow a cache object to be passed into buffer manager and the size allocated by memtable can be costed there. This can help users have one single memory cap across block cache and memtable.
Closes https://github.com/facebook/rocksdb/pull/2350

Differential Revision: D5110648

Pulled By: siying

fbshipit-source-id: b4238113094bf22574001e446b5d88523ba00017
2017-06-02 14:26:56 -07:00
Siying Dong
d616ebea23 Add GPLv2 as an alternative license.
Summary: Closes https://github.com/facebook/rocksdb/pull/2226

Differential Revision: D4967547

Pulled By: siying

fbshipit-source-id: dd3b58ae1e7a106ab6bb6f37ab5c88575b125ab4
2017-04-27 18:06:12 -07:00
Baraa Hamodi
21e95811d1 Updated all copyright headers to the new format. 2016-02-09 15:12:00 -08:00
Nathan Bronson
7d87f02799 support for concurrent adds to memtable
Summary:
This diff adds support for concurrent adds to the skiplist memtable
implementations.  Memory allocation is made thread-safe by the addition of
a spinlock, with small per-core buffers to avoid contention.  Concurrent
memtable writes are made via an additional method and don't impose a
performance overhead on the non-concurrent case, so parallelism can be
selected on a per-batch basis.

Write thread synchronization is an increasing bottleneck for higher levels
of concurrency, so this diff adds --enable_write_thread_adaptive_yield
(default off).  This feature causes threads joining a write batch
group to spin for a short time (default 100 usec) using sched_yield,
rather than going to sleep on a mutex.  If the timing of the yield calls
indicates that another thread has actually run during the yield then
spinning is avoided.  This option improves performance for concurrent
situations even without parallel adds, although it has the potential to
increase CPU usage (and the heuristic adaptation is not yet mature).

Parallel writes are not currently compatible with
inplace updates, update callbacks, or delete filtering.
Enable it with --allow_concurrent_memtable_write (and
--enable_write_thread_adaptive_yield).  Parallel memtable writes
are performance neutral when there is no actual parallelism, and in
my experiments (SSD server-class Linux and varying contention and key
sizes for fillrandom) they are always a performance win when there is
more than one thread.

Statistics are updated earlier in the write path, dropping the number
of DB mutex acquisitions from 2 to 1 for almost all cases.

This diff was motivated and inspired by Yahoo's cLSM work.  It is more
conservative than cLSM: RocksDB's write batch group leader role is
preserved (along with all of the existing flush and write throttling
logic) and concurrent writers are blocked until all memtable insertions
have completed and the sequence number has been advanced, to preserve
linearizability.

My test config is "db_bench -benchmarks=fillrandom -threads=$T
-batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T
-level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
-disable_auto_compactions --max_write_buffer_number=8
-max_background_flushes=8 --disable_wal --write_buffer_size=160000000
--block_size=16384 --allow_concurrent_memtable_write" on a two-socket
Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive.  With 1
thread I get ~440Kops/sec.  Peak performance for 1 socket (numactl
-N1) is slightly more than 1Mops/sec, at 16 threads.  Peak performance
across both sockets happens at 30 threads, and is ~900Kops/sec, although
with fewer threads there is less performance loss when the system has
background work.

Test Plan:
1. concurrent stress tests for InlineSkipList and DynamicBloom
2. make clean; make check
3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench
4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench
5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench
6. make clean; OPT=-DROCKSDB_LITE make check
7. verify no perf regressions when disabled

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba

Differential Revision: https://reviews.facebook.net/D50589
2015-12-25 11:03:40 -08:00
Yueh-Hsuan Chiang
e4861e7d68 Fixed a compile error in util/arena.h
Summary:
Fixed the compile error in util/arena.h caused by not
including TLB related header.

Test Plan: make db_stress

Reviewers: igor, sdong, anthony, IslamAbdelRahman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47697
2015-09-28 11:51:32 -07:00
Yueh-Hsuan Chiang
0fdb4f1688 Fixed a compile warning in util/arena.cc when hugetlb is not supported.
Summary:
Fixed the following compile warning when hugetlb is not supported.

    ./util/arena.h:102:10: error: private field 'hugetlb_size_' is not used [-Werror,-Wunused-private-field]
      size_t hugetlb_size_ = 0;
             ^
    1 error generated.

Test Plan: make db_stress

Reviewers: igor, sdong, anthony, IslamAbdelRahman

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D47691
2015-09-28 11:43:04 -07:00
stash93
53996149d4 Removing unnecessary kInlineSize
Summary: Remove unnecessary rocksdb::kInlineSize, since it's not used and there is rocksdb::Arena::kInlineSize.

Test Plan: make all check

Reviewers: igor

Reviewed By: igor

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D34905
2015-03-12 21:13:53 +03:00
Jonah Cohen
a14b7873ee Enforce write buffer memory limit across column families
Summary:
Introduces a new class for managing write buffer memory across column
families.  We supplement ColumnFamilyOptions::write_buffer_size with
ColumnFamilyOptions::write_buffer, a shared pointer to a WriteBuffer
instance that enforces memory limits before flushing out to disk.

Test Plan: Added SharedWriteBuffer unit test to db_test.cc

Reviewers: sdong, rven, ljin, igor

Reviewed By: igor

Subscribers: tnovak, yhchiang, dhruba, xjin, MarkCallaghan, yoshinorim

Differential Revision: https://reviews.facebook.net/D22581
2014-12-02 12:09:20 -08:00
Shaohua Li
1410180167 Make arena use hugepage if possible
Summary:
arena doesn't use huge page by default. This change will make it happen
if possible. A new paramerter is added for Arena(). If it's set, Arena
will use huge page always. If huge page allocation fails, Arena
allocation will fallback to malloc().

Test Plan:
Change util/arena_test to support huge page allocation.
Run below tests:
1. normal regression test:
  make check
2. Check if huge page allocation works
  echo 50 > /proc/sys/vm/nr_hugepages
  make check

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba

Differential Revision: https://reviews.facebook.net/D28647
2014-11-21 14:11:40 -08:00
Igor Canadi
c7ee9c3ab7 Fix -Wnon-virtual-dtor errors
Summary: This breaks mongo+rocks build https://mci.10gen.com/task_log_raw/rocksdb_ubuntu1404_rocksdb_c6e8e3d868660dc66b3bbd438cdc135df6356c5a_14_11_10_21_36_10_compile_ubuntu1404_rocksdb/0?type=T

Test Plan: m check + -Wnon-virtual-dtor

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D28653
2014-11-10 17:39:38 -05:00
sdong
df9069d23f In DB::NewIterator(), try to allocate the whole iterator tree in an arena
Summary:
In this patch, try to allocate the whole iterator tree starting from DBIter from an arena
1. ArenaWrappedDBIter is created when serves as the entry point of an iterator tree, with an arena in it.
2. Add an option to create iterator from arena for following iterators: DBIter, MergingIterator, MemtableIterator, all mem table's iterators, all table reader's iterators and two level iterator.
3. MergeIteratorBuilder is created to incrementally build the tree of internal iterators. It is passed to mem table list and version set and add iterators to it.

Limitations:
(1) Only DB::NewIterator() without tailing uses the arena. Other cases, including readonly DB and compactions are still from malloc
(2) Two level iterator itself is allocated in arena, but not iterators inside it.

Test Plan: make all check

Reviewers: ljin, haobo

Reviewed By: haobo

Subscribers: leveldb, dhruba, yhchiang, igor

Differential Revision: https://reviews.facebook.net/D18513
2014-06-02 17:44:57 -07:00
Chilledheart
fa16ef394c Fix errors while building with clang 2014-05-15 12:34:53 +08:00
Igor Canadi
a490d2d3a0 Revert "Define kInlineSize in .cc file instead of header"
This reverts commit 35a8873aa6.
2014-05-14 12:36:21 -07:00
Igor Canadi
35a8873aa6 Define kInlineSize in .cc file instead of header 2014-05-14 12:29:46 -07:00
sdong
3e4a9ec241 Arena to inline 2KB of data in it.
Summary:
In order to use arena to a use case that the total allocation size might be small (LogBuffer is already such a case), inline 1KB of data in it, so that it can be mostly in stack or inline in another class.

If always inlining 2KB is a concern, I could make it a template to determine what to inline. However, dependents need to changes. Doesn't go with it for now

Test Plan: make all check.

Reviewers: haobo, igor, yhchiang, dhruba

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18609
2014-05-14 11:49:01 -07:00
sdong
3a171dcb51 Pass logger to memtable rep and TLB page allocation error logged to info logs
Summary:
TLB page allocation errors are now logged to info logs, instead of stderr.
In order to do that, mem table rep's factory functions take a info logger now.

Test Plan: make all check

Reviewers: haobo, igor, yhchiang

Reviewed By: yhchiang

CC: leveldb, yhchiang, dhruba

Differential Revision: https://reviews.facebook.net/D18471
2014-05-05 16:43:37 -07:00
sdong
4a7c747064 Revert "Revert "Allow allocating dynamic bloom, plain table indexes and hash linked list from huge page TLB""
And make the default 0 for hash linked list memtable

This reverts commit d69dc64be7.
2014-05-04 13:56:29 -07:00
Igor Canadi
d69dc64be7 Revert "Allow allocating dynamic bloom, plain table indexes and hash linked list from huge page TLB"
This reverts commit 7dafa3a1d7.
2014-05-04 08:37:09 -07:00
sdong
7dafa3a1d7 Allow allocating dynamic bloom, plain table indexes and hash linked list from huge page TLB
Summary: Add an option to allocate a piece of memory from huge page TLB. Add options to trigger it in dynamic bloom, plain table indexes andhash linked list hash table.

Test Plan: make all check

Reviewers: haobo, ljin

Reviewed By: haobo

CC: nkg-, dhruba, leveldb, igor, yhchiang

Differential Revision: https://reviews.facebook.net/D18357
2014-04-30 11:02:26 -07:00
Igor Canadi
3c75cc15a9 Fix HashSkipList and HashLinkedList SIGSEGV
Summary:
Original Summary:
Yesterday, @ljin and I were debugging various db_stress issues. We suspected one of them happens when we concurrently call NewIterator without prefix_seek on HashSkipList. This test demonstrates it.

Update:
Arena is not thread-safe!! When creating a new full iterator, we *have* to create a new arena, otherwise we're doomed.

Test Plan: SIGSEGV and assertion-throwing test now works!

Reviewers: ljin, haobo, sdong

Reviewed By: sdong

CC: leveldb, ljin

Differential Revision: https://reviews.facebook.net/D16857
2014-03-14 10:02:04 -07:00
Kai Liu
11da8bc5df A heuristic way to check if a memtable is full
Summary:
This is is based on https://reviews.facebook.net/D15027. It's not finished but I would like to give a prototype to avoid arena over-allocation while making better use of the already allocated memory blocks.

Instead of check approximate memtable size, we will take a deeper look at the arena, which incorporate essential idea that @sdong suggests: flush when arena has allocated its last and the last is "almost full"

Test Plan: N/A

Reviewers: haobo, sdong

Reviewed By: sdong

CC: leveldb, sdong

Differential Revision: https://reviews.facebook.net/D15051
2014-03-12 16:40:14 -07:00
kailiu
906f3dca72 Add a hash-index component for block
Summary:
this is the key component extracted from diff: https://reviews.facebook.net/D14271
I separate it to a dedicated patch to make the review easier.

Test Plan: added a unit test and passed it.

Reviewers: haobo, sdong, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16245
2014-03-03 21:11:49 -08:00
kailiu
4e0298f23c Clean up arena API
Summary:
Easy thing goes first. This patch moves arena to internal dir; based
on which, the coming patch will deal with memtable_rep.

Test Plan: make check

Reviewers: haobo, sdong, dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15615
2014-01-30 22:10:10 -08:00
Xing Jin
0f0a24e298 Make arena block size configurable
Summary:
Add an option for arena block size, default value 4096 bytes. Arena will allocate blocks with such size.

I am not sure about passing parameter to skiplist in the new virtualized framework, though I talked to Jim a bit. So add Jim as reviewer.

Test Plan:
new unit test, I am running db_test.

For passing paramter from configured option to Arena, I tried tests like:

  TEST(DBTest, Arena_Option) {
  std::string dbname = test::TmpDir() + "/db_arena_option_test";
  DestroyDB(dbname, Options());

  DB* db = nullptr;
  Options opts;
  opts.create_if_missing = true;
  opts.arena_block_size = 1000000; // tested 99, 999999
  Status s = DB::Open(opts, dbname, &db);
  db->Put(WriteOptions(), "a", "123");
  }

and printed some debug info. The results look good. Any suggestion for such a unit-test?

Reviewers: haobo, dhruba, emayanke, jpaton

Reviewed By: dhruba

CC: leveldb, zshao

Differential Revision: https://reviews.facebook.net/D11799
2013-07-31 12:42:23 -07:00
Hans Wennborg
36a5f8ed7f A number of fixes:
- Replace raw slice comparison with a call to user comparator.
  Added test for custom comparators.

- Fix end of namespace comments.

- Fixed bug in picking inputs for a level-0 compaction.

  When finding overlapping files, the covered range may expand
  as files are added to the input set.  We now correctly expand
  the range when this happens instead of continuing to use the
  old range.  For example, suppose L0 contains files with the
  following ranges:

      F1: a .. d
      F2:    c .. g
      F3:       f .. j

  and the initial compaction target is F3.  We used to search
  for range f..j which yielded {F2,F3}.  However we now expand
  the range as soon as another file is added.  In this case,
  when F2 is added, we expand the range to c..j and restart the
  search.  That picks up file F1 as well.

  This change fixes a bug related to deleted keys showing up
  incorrectly after a compaction as described in Issue 44.

(Sync with upstream @25072954)
2011-10-31 17:22:06 +00:00
dgrogan@chromium.org
69c6d38342 reverting disastrous MOE commit, returning to r21
git-svn-id: https://leveldb.googlecode.com/svn/trunk@23 62dab493-f737-651d-591e-8d6aee1b9529
2011-04-19 23:11:15 +00:00
dgrogan@chromium.org
b743906eea Revision created by MOE tool push_codebase.
MOE_MIGRATION=


git-svn-id: https://leveldb.googlecode.com/svn/trunk@22 62dab493-f737-651d-591e-8d6aee1b9529
2011-04-19 23:01:25 +00:00
dgrogan@chromium.org
b409afe968 chmod a-x
git-svn-id: https://leveldb.googlecode.com/svn/trunk@21 62dab493-f737-651d-591e-8d6aee1b9529
2011-04-18 23:15:58 +00:00
dgrogan@chromium.org
f779e7a5d8 @20602303. Default file permission is now 755.
git-svn-id: https://leveldb.googlecode.com/svn/trunk@20 62dab493-f737-651d-591e-8d6aee1b9529
2011-04-12 19:38:58 +00:00
jorlow@chromium.org
f67e15e50f Initial checkin.
git-svn-id: https://leveldb.googlecode.com/svn/trunk@2 62dab493-f737-651d-591e-8d6aee1b9529
2011-03-18 22:37:00 +00:00