Commit Graph

167 Commits

Author SHA1 Message Date
Yueh-Hsuan Chiang
bf287b76e0 Add structures for exposing thread events and operations.
Summary:
Add structures for exposing events and operations.  Event describes
high-level action about a thread such as doing compaciton or
doing flush, while an operation describes lower-level action
of a thread such as reading / writing a SST table, waiting for
mutex.  Events and operations are designed to be independent.
One thread would typically involve in one event and one operation.

Code instrument will be in a separate diff.

Test Plan:
Add unit-tests in thread_list_test
make dbg -j32
./thread_list_test
export ROCKSDB_TESTS=ThreadList
./db_test

Reviewers: ljin, igor, sdong

Reviewed By: sdong

Subscribers: rven, jonahcohen, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D29781
2014-12-30 10:39:13 -08:00
Yueh-Hsuan Chiang
45bab305f9 Move GetThreadList() feature under Env.
Summary:
GetThreadList() feature depends on the thread creation and destruction, which is currently handled under Env.
This patch moves GetThreadList() feature under Env to better manage the dependency of GetThreadList() feature
on thread creation and destruction.

Renamed ThreadStatusImpl to ThreadStatusUpdater.  Add ThreadStatusUtil, which is a static class contains
utility functions for ThreadStatusUpdater.

Test Plan: run db_test, thread_list_test and db_bench and verify the life cycle of Env and ThreadStatusUpdater is properly managed.

Reviewers: igor, sdong

Reviewed By: sdong

Subscribers: ljin, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D30057
2014-12-22 12:20:17 -08:00
Venkatesh Radhakrishnan
7198ed5a2e Handle errors during pthread calls
Summary: Release locks before calling exit.

Test Plan: Force errors in debugger and verify correctness

Reviewers: igor, yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D30423
2014-12-17 16:25:09 -08:00
Igor Canadi
cb82d7b081 Fix #434
Summary: Why do we assert here? This doesn't seem like user friendly thing to do :)

Test Plan: none

Reviewers: sdong, yhchiang, rven

Reviewed By: rven

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D30027
2014-12-09 10:22:07 -08:00
Yueh-Hsuan Chiang
a97314219e Fix compile error in ROCKSDB_LITE 2014-11-25 23:39:52 -08:00
Yueh-Hsuan Chiang
d0c5f28a5c Introduce GetThreadList API
Summary:
Add GetThreadList API, which allows developer to track the
status of each process.  Currently, calling GetThreadList will
only get the list of background threads in RocksDB with their
thread-id and thread-type (priority) set.  Will add more support
on this in the later diffs.

ThreadStatus currently has the following properties:

  // An unique ID for the thread.
  const uint64_t thread_id;

  // The type of the thread, it could be ROCKSDB_HIGH_PRIORITY,
  // ROCKSDB_LOW_PRIORITY, and USER_THREAD
  const ThreadType thread_type;

  // The name of the DB instance where the thread is currently
  // involved with.  It would be set to empty string if the thread
  // does not involve in any DB operation.
  const std::string db_name;

  // The name of the column family where the thread is currently
  // It would be set to empty string if the thread does not involve
  // in any column family.
  const std::string cf_name;

  // The event that the current thread is involved.
  // It would be set to empty string if the information about event
  // is not currently available.

Test Plan:
./thread_list_test
export ROCKSDB_TESTS=GetThreadList
./db_test

Reviewers: rven, igor, sdong, ljin

Reviewed By: ljin

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D25047
2014-11-20 10:49:32 -08:00
Venkatesh Radhakrishnan
6c1b040cc9 Provide openable snapshots
Summary: Store links to live files in directory on same disk

Test Plan:
Take snapshot and open it. Added a test GetSnapshotLink in
db_test.

Reviewers: sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D28713
2014-11-14 11:38:26 -08:00
Igor Canadi
25f273027b Fix iOS compile with -Wshorten-64-to-32
Summary: So iOS size_t is 32-bit, so we need to static_cast<size_t> any uint64_t :(

Test Plan: TARGET_OS=IOS make static_lib

Reviewers: dhruba, ljin, yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D28743
2014-11-13 14:39:30 -05:00
Igor Canadi
767777c2bd Turn on -Wshorten-64-to-32 and fix all the errors
Summary:
We need to turn on -Wshorten-64-to-32 for mobile. See D1671432 (internal phabricator) for details.

This diff turns on the warning flag and fixes all the errors. There were also some interesting errors that I might call bugs, especially in plain table. Going forward, I think it makes sense to have this flag turned on and be very very careful when converting 64-bit to 32-bit variables.

Test Plan: compiles

Reviewers: ljin, rven, yhchiang, sdong

Reviewed By: yhchiang

Subscribers: bobbaldwin, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D28689
2014-11-11 16:47:22 -05:00
sdong
09899f0b51 DB::Open() to automatically increase thread pool size if it is smaller than max number of parallel compactions or flushes
Summary:
With the patch, thread pool size will be automatically increased if DB's options ask for more parallelism of compactions or flushes.

Too many users have been confused by the API. Change it to make it harder for users to make mistakes

Test Plan: Add two unit tests to cover the function.

Reviewers: yhchiang, rven, igor, MarkCallaghan, ljin

Reviewed By: ljin

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D27555
2014-11-03 17:22:34 -08:00
Lei Jin
17be187ff9 dummy var to suppress compiler warning/error
Summary: Revmoed this in D25641, causing compiler complain. put it back

Test Plan: make release

Reviewers: igor, yhchiang, rven, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D27891
2014-10-29 15:38:34 -07:00
Lei Jin
44f0ff31c2 use fallocate(FALLOC_FL_PUNCH_HOLE) to release unused blocks at the end of file
Summary:
ftruncate does not always free preallocated unused space at the end of file.
In some cases, we pin too much disk space than it should

Test Plan: env_test

Reviewers: sdong, rven, yhchiang, igor

Reviewed By: igor

Subscribers: nkg-, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D25641
2014-10-29 12:24:49 -07:00
Igor Canadi
965d9d50b8 Fix timing 2014-10-24 11:58:15 -07:00
Igor Canadi
001ce64dc7 Use chrono for timing
Summary: Since we depend on C++11, we might as well use it for timing, instead of this platform-depended code.

Test Plan: Ran autovector_test, which reports time and confirmed that output is similar to master

Reviewers: ljin, sdong, yhchiang, rven, dhruba

Reviewed By: dhruba

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D25587
2014-10-24 10:11:57 -07:00
Lei Jin
7e9f28cb23 limit max bytes that can be read/written per pread/write syscall
Summary:
BlockBasedTable sst file size can grow to a large size when universal
compaction is used. When index block exceeds 2G, pread seems to fail and
return truncated data and causes "trucated block" error. I tried to use
```
  #define _FILE_OFFSET_BITS 64
```
But the problem still persists. Splitting a big write/read into smaller
batches seems to solve the problem.

Test Plan:
successfully compacted a case with resulting sst file at ~90G (2.1G
index block size)

Reviewers: yhchiang, igor, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D22569
2014-08-29 21:21:49 -07:00
Igor Canadi
d9c0785812 Fix assertion in PosixRandomAccessFile
Summary:
See https://github.com/facebook/rocksdb/issues/244#issuecomment-53372297
Also see this: https://github.com/facebook/rocksdb/blob/master/util/env_posix.cc#L1075

Test Plan: compiles

Reviewers: yhchiang, ljin, sdong

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D22419
2014-08-26 15:28:36 -04:00
ZHANG Biao
8dfe2fdd51 fix compile error under Mac OS X 2014-08-14 20:01:01 +08:00
Lei Jin
58c49466d2 Allow env_posix to lower background thread IO priority
Summary: This is a linux-specific system call.

Test Plan: ran db_bench

Reviewers: igor, yhchiang, sdong

Reviewed By: sdong

Subscribers: haobo, leveldb

Differential Revision: https://reviews.facebook.net/D21183
2014-08-13 20:49:58 -07:00
Lei Jin
534357ca3a integrate rate limiter into rocksdb
Summary:
Add option and plugin rate limiter for PosixWritableFile. The rate
limiter only applies to flush and compaction. WAL and MANIFEST are
excluded from this enforcement.

Test Plan: db_test

Reviewers: igor, yhchiang, sdong

Reviewed By: sdong

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19425
2014-07-08 12:31:49 -07:00
Yueh-Hsuan Chiang
90a6aca48e Finer report I/O stats about Flush and Compaction.
Summary:
This diff allows the I/O stats about Flush and Compaction to be reported
in a more accurate way.  Instead of measuring the size of a file, it
measure I/O cost in per read / write basis.

Test Plan: make all check

Reviewers: sdong, igor, ljin

Reviewed By: ljin

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19383
2014-07-03 16:28:03 -07:00
Igor Canadi
d3f63f03ad Fix 32-bit errors
Summary: https://www.facebook.com/groups/rocksdb.dev/permalink/590438347721350/

Test Plan: compiles

Reviewers: sdong, ljin, yhchiang

Reviewed By: yhchiang

Subscribers: leveldb

Differential Revision: https://reviews.facebook.net/D19197
2014-07-02 11:40:16 +02:00
sdong
9899b12780 ThreadID printed when Thread terminating in the same format as posix_logger
Summary: 220132b65e correctly fixed the issue of thread ID printing when terminating a thread. Nothing wrong with it. This diff prints the ID in the same way as in PosixLogger::logv() so that users can be more easily to correlates them.

Test Plan: run env_test and make sure it prints correctly.

Reviewers: igor, haobo, ljin, yhchiang

Reviewed By: yhchiang

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D18819
2014-05-29 11:11:08 -07:00
Chilledheart
81b498bc15 Print pthread_t in a more safe way 2014-05-22 01:24:42 +08:00
sdong
bd1105aa5a Print out thread ID while thread terminates for decreased pool size.
Summary: Per request from @nkg-, temporarily print thread ID when a thread terminates. It is a temp solution as we try to minimized stderr messages.

Test Plan: env_test

Reviewers: haobo, igor, dhruba

Reviewed By: igor

CC: nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D18753
2014-05-19 15:18:02 -07:00
sdong
3df07d1703 ThreadPool to allow decrease number of threads and increase of number of threads is to be instantly scheduled
Summary:
Add a feature to decrease the number of threads in thread pool.
Also instantly schedule more threads if number of threads is increased.

Here is the way it is implemented: each background thread needs its thread ID. After decreasing number of threads, all threads are woken up. The thread with the largest thread ID will terminate. If there are more threads to terminate, the thread will wake up all threads again.

Another change is made so that when number of threads is increased, more threads are created and all previous excessive threads are woken up to do the work.

Test Plan: Add a unit test.

Reviewers: haobo, dhruba

Reviewed By: haobo

CC: yhchiang, igor, nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D18675
2014-05-19 11:52:12 -07:00
Igor Canadi
72ff275e3c Fix TransactionLogIterator EOF caching
Summary:
When TransactionLogIterator comes to EOF, it calls UnmarkEOF and continues reading. However, if glibc cached the EOF status of the file, it will get EOF again, even though the new data might have been written to it.

This has been causing errors in Mac OS.

Test Plan: test passes, was failing before

Reviewers: dhruba, haobo, sdong

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18381
2014-04-28 23:30:27 -04:00
Igor Canadi
c2da9e5997 Flush before Fsync()/Sync()
Summary: Calling Fsync()/Sync() on a file should give the guarantee that whatever you written to the file is now persisted. This is currently not the case, since we might have some data left in application cache as we do Fsync()/Sync(). For example, BuildTable() calls Fsync() without the flush, assuming all sst data is now persisted, but it's actually not. This may result in big inconsistencies.

Test Plan: no test

Reviewers: sdong, dhruba, haobo, ljin, yhchiang

Reviewed By: sdong

CC: leveldb

Differential Revision: https://reviews.facebook.net/D18159
2014-04-21 17:45:04 -07:00
Yueh-Hsuan Chiang
fa84eb1f7b Fixed a compile error which tries to check whether a size_t < 0 in env_posix.cc
Summary:
Fixed a compile error which tries to check whether a size_t < 0 in env_posix.cc

util/env_posix.cc:180:16: error: comparison of unsigned expression < 0 is always false [-Werror,-Wtautological-compare]
    } while (r < 0 && errno == EINTR);
             ~ ^ ~
1 error generated.

Test Plan: make check all

Reviewers: igor, haobo

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17379
2014-04-01 11:09:06 -07:00
Igor Canadi
726c8084cd Retry FS system calls on EINTR
Summary: EINTR means 'please retry'. We don't do that currenty. We should.

Test Plan: make check, although it doesn't really test the new code. we'll just have to believe in the code!

Reviewers: haobo, ljin

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17349
2014-03-31 14:45:26 -07:00
Igor Canadi
64ae6e9eb9 Don't preallocate log files 2014-03-28 15:04:43 -07:00
Igor Canadi
5c44a8db61 fallocate_with_keep_size is false for LogWrites 2014-03-25 12:53:23 -07:00
Igor Canadi
22507aff6c Fix compile issue in Mac OS
Summary:
Compile issues are:
* Unused variable env_
* Unused fallocate_with_keep_size_

Test Plan: compiles

Reviewers: dhruba, haobo, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D17043
2014-03-19 15:40:12 -07:00
Igor Canadi
f26cb0f093 Optimize fallocation
Summary:
Based on my recent findings (posted in our internal group), if we use fallocate without KEEP_SIZE flag, we get superior performance of fdatasync() in append-only workloads.

This diff provides an option for user to not use KEEP_SIZE flag, thus optimizing his sync performance by up to 2x-3x.

At one point we also just called posix_fallocate instead of fallocate, which isn't very fast: http://code.woboq.org/userspace/glibc/sysdeps/posix/posix_fallocate.c.html (tl;dr it manually writes out zero bytes to allocate storage). This diff also fixes that, by first calling fallocate and then posix_fallocate if fallocate is not supported.

Test Plan: make check

Reviewers: dhruba, sdong, haobo, ljin

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16761
2014-03-17 21:52:14 -07:00
sdong
01dcef114b Env to add a function to allow users to query waiting queue length
Summary: Add a function to Env so that users can query the waiting queue length of each thread pool

Test Plan: add a test in env_test

Reviewers: haobo

Reviewed By: haobo

CC: dhruba, igor, yhchiang, ljin, nkg-, leveldb

Differential Revision: https://reviews.facebook.net/D16755
2014-03-11 10:19:02 -07:00
Yumikiyo Osanai
056a0286d2 Modify the compile error about ftruncate()
Summary:
Change to store the return value from ftruncate().
The reason is that ftruncate() has "warn_unused_result" attribute in some environment.

Signed-off-by: Yumikiyo Osanai <yumios.art@gmail.com>
2014-03-08 02:14:34 +09:00
Igor Canadi
26ac5603f4 Truncate unused space on PosixWritableFile::Close()
Summary:
Blocks allocated with fallocate will take extra space on disk even if they are unused and the file is close.

Now we remove the extra blocks at the end of the file by calling `ftruncate`.

Test Plan: added a test to env_test

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D16647
2014-03-06 15:59:27 -08:00
Lei Jin
b2795b799e thread local pointer storage
Summary:
This is not a generic thread local implementation in the sense that it
only takes pointer. But it does support multiple instances per thread
and lets user plugin function to perform cleanup when thread exits or an
instance gets destroyed.

Test Plan: unit test for now

Reviewers: haobo, igor, sdong, dhruba

Reviewed By: igor

CC: leveldb, kailiu

Differential Revision: https://reviews.facebook.net/D16131
2014-02-25 17:47:37 -08:00
Igor Canadi
d53b188228 Fix some errors detected by coverity scan
Summary: Nothing major, just an extra return line and posibility of leaking fb in NewRandomRWFile

Test Plan: make check

Reviewers: kailiu, dhruba

Reviewed By: kailiu

CC: leveldb

Differential Revision: https://reviews.facebook.net/D15993
2014-02-06 21:59:44 -08:00
kailiu
84f8185fc0 Merge branch 'master' into performance
Conflicts:
	HISTORY.md
	db/db_impl.cc
	db/memtable.cc
2014-02-05 21:21:00 -08:00
Kai Liu
87bda51d77 Merge pull request #58 from mlin/no-stdout
Eliminate stdout message when launching a posix thread.
2014-02-03 00:38:11 -08:00
kailiu
a5e220f5ef Merge branch 'master' into performance
Conflicts:
	Makefile
	db/db_impl.cc
	db/db_test.cc
	db/memtable_list.cc
	db/memtable_list.h
	table/block_based_table_reader.cc
	table/table_test.cc
	util/cache.cc
	util/coding.cc
2014-01-28 10:35:55 -08:00
Igor Canadi
832158e7f7 Fsync directory after we create a new file
Summary:
@dhruba, I'm not sure where we need to sync the directory. I implemented the function in Env() and added the dir sync just after we close the newly created file in the builder.

Should I also add FsyncDir() to new files that get created by a compaction?

Test Plan: Confirmed that FsyncDir is returning Status::OK()

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D14751
2014-01-27 11:02:21 -08:00
Mike Lin
4c75e21c20 Eliminate stdout message when launching a posix thread.
This seems out of place as it's the only time RocksDB prints to stdout in the
normal course of operations. Thread IDs can still be retrieved from the LOG
file: cut -d ' ' -f2 LOG | sort | uniq | egrep -x '[0-9a-f]+'
2014-01-07 10:44:02 -08:00
kailiu
f1cec73a76 Merge branch 'master' into performance
Conflicts:
	db/db_impl.cc
	db/db_test.cc
	db/memtable.cc
	db/version_set.cc
	include/rocksdb/statistics.h
2013-12-27 12:23:17 -08:00
James Golick
c28dd2a891 oops - missed a spot 2013-12-11 11:18:00 -08:00
James Golick
43c386b72e only try to use fallocate if it's actually present on the system 2013-12-10 22:34:19 -08:00
kailiu
e1d92dfd2e Fix a bunch of mac compilation issues in performance branch 2013-12-04 23:00:33 -08:00
lovro
45a2f2d8d3 Fix build without glibc
Summary: The preprocessor does not follow normal rules of && evaluation, tries to evaluate __GLIBC_PREREQ(2, 12) even though the defined() check fails.  This breaks the build if __GLIBC_PREREQ is absent.

Test Plan: Try adding #undef __GLIBC_PREREQ above the offending line, build no longer breaks

Reviewed By: igor

Blame Rev: 4c81383628
2013-12-01 11:32:54 -08:00
Kai Liu
1966b63137 Merge branch 'master' into perf 2013-11-27 11:47:40 -08:00
lovro
4c81383628 Set background thread name with pthread_setname_np()
Summary: Makes it easier to monitor performance with top

Test Plan: ./manual_compaction_test with `top -H` running.  Previously was two `manual_compacti`, now one shows `rocksdb:bg0`.

Reviewers: igor, dhruba

Reviewed By: igor

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14367
2013-11-27 11:28:06 -08:00
Siying Dong
8aac46d686 [RocksDB Performance Branch] Fix a regression bug of munmap
Summary:
Fix a stupid bug I just introduced in b59d4d5a50, which I didn't even mean to include.
GCC might remove the munmap.

Test Plan: Run it and make sure munmap succeeds

Reviewers: haobo, kailiu

Reviewed By: kailiu

CC: dhruba, reconnect.grayhat, leveldb

Differential Revision: https://reviews.facebook.net/D14361
2013-11-26 14:05:37 -08:00
Siying Dong
b59d4d5a50 A Simple Plain Table
Summary:
A Simple plain table format. No block structure. When creating the table reader, scanning the full table to create indexes.

Test Plan:Add unit test

Reviewers:haobo,dhruba,kailiu

CC:

Task ID: #

Blame Rev:
2013-11-20 18:44:22 -08:00
kailiu
97d8e573a6 make util/env_posix.cc work under mac
Summary: This diff invoves some more complicated issues in the posix environment.

Test Plan: works under mac os. will need to verify dev box.

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D14061
2013-11-16 23:44:39 -08:00
Dhruba Borthakur
b4ad5e89ae Implement a compressed block cache.
Summary:
Rocksdb can now support a uncompressed block cache, or a compressed
block cache or both. Lookups first look for a block in the
uncompressed cache, if it is not found only then it is looked up
in the compressed cache. If it is found in the compressed cache,
then it is uncompressed and inserted into the uncompressed cache.

It is possible that the same block resides in the compressed cache
as well as the uncompressed cache at the same time. Both caches
have their own individual LRU policy.

Test Plan: Unit test case attached.

Reviewers: kailiu, sdong, haobo, leveldb

Reviewed By: haobo

CC: xjin, haobo

Differential Revision: https://reviews.facebook.net/D12675
2013-11-01 14:31:35 -07:00
Igor Canadi
b572e81f94 Flush Log every 5 seconds
Summary: This might help with p99 performance, but does not solve the real problem. More discussion on #2947135

Test Plan: make check

Reviewers: dhruba, haobo

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13809
2013-10-31 15:36:40 -07:00
Mayank Agarwal
9b50106f9a Dbid feature
Summary:
Create a new type of file on startup if it doesn't already exist called DBID.
This will store a unique number generated from boost library's uuid header file.
The use-case is to identify the case of a db losing all its data and coming back up either empty or from an image(backup/live replica's recovery)
the key point to note is that DBID is not stored in a backup or db snapshot
It's preferable to use Boost for uuid because:
1) A non-standard way of generating uuid is not good
2) /proc/sys/kernel/random/uuid generates a uuid but only on linux environments and the solution would not be clean
3) c++ doesn't have any direct way to get a uuid
4) Boost is a very good library that was already having linkage in rocksdb from third-party
Note: I had to update the TOOLCHAIN_REV in build files to get latest verison of boost from third-party as the older version had a bug.
I had to put Wno-uninitialized in Makefile because boost-1.51 has an unitialized variable and rocksdb would not comiple otherwise. Latet open-source for boost is 1.54 but is not there in third-party. I have notified the concerned people in fbcode about it.
@kailiu : While releasing to third-party, an additional dependency will need to be created for boost in TARGETS file. I can help identify.

Test Plan:
Expand db_test to test 2 cases
1) Restarting db with Id file present - verify that no change to Id
2)Restarting db with Id file deleted - verify that a different Id is there after reopen
Also run make all check

Reviewers: dhruba, haobo, kailiu, sdong

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13587
2013-10-22 12:23:34 -07:00
Dhruba Borthakur
9cd221094c Add appropriate LICENSE and Copyright message.
Summary:
Add appropriate LICENSE and Copyright message.

Test Plan:
make check

Reviewers:

CC:

Task ID: #

Blame Rev:
2013-10-16 17:48:41 -07:00
Igor Canadi
d0beadd456 Env class that can randomly read and write
Summary: I have implemented basic simple use case that I need for External Value Store I'm working on. There is a potential for making this prettier by refactoring/combining WritableFile and RandomAccessFile, avoiding some copypasta. However, I decided to implement just the basic functionality, so I can continue working on the other diff.

Test Plan: Added a unittest

Reviewers: dhruba, haobo, kailiu

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13365
2013-10-10 00:03:08 -07:00
Dhruba Borthakur
4463b11cad Migrate names of properties from 'leveldb' prefix to 'rocksdb' prefix.
Summary: Migrate names of properties from 'leveldb' prefix to 'rocksdb' prefix.

Test Plan: make check

Reviewers: emayanke, haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13311
2013-10-06 00:14:26 -07:00
Dhruba Borthakur
a143ef9b38 Change namespace from leveldb to rocksdb
Summary:
Change namespace from leveldb to rocksdb. This allows a single
application to link in open-source leveldb code as well as
rocksdb code into the same process.

Test Plan: compile rocksdb

Reviewers: emayanke

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13287
2013-10-04 11:59:26 -07:00
Dhruba Borthakur
87d6eb2f6b Implement apis in the Environment to clear out pages in the OS cache.
Summary:
Added a new api to the Environment that allows clearing out not-needed
pages from the OS cache. This will be helpful when the compressed
block cache replaces the OS cache.

Test Plan: EnvPosixTest.InvalidateCache

Reviewers: haobo

Reviewed By: haobo

CC: leveldb

Differential Revision: https://reviews.facebook.net/D13041
2013-09-23 22:05:03 -07:00
Rajat Goel
11c65021fb Revert "Minor fixes found while trying to compile it using clang on Mac OS X"
This reverts commit 5f2c136c32.
2013-09-15 23:01:26 -07:00
Rajat Goel
5f2c136c32 Minor fixes found while trying to compile it using clang on Mac OS X 2013-09-15 22:06:14 -07:00
Haobo Xu
1565dab809 [RocksDB] Enhance Env to support two thread pools LOW and HIGH
Summary:
this is the ground work for separating memtable flush jobs to their own thread pool.
Both SetBackgroundThreads and Schedule take a third parameter Priority to indicate which thread pool they are working on. The names LOW and HIGH are just identifiers for two different thread pools, and does not indicate real difference in 'priority'. We can set number of threads in the pools independently.
The thread pool implementation is refactored.

Test Plan: make check

Reviewers: dhruba, emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D12885
2013-09-12 16:15:36 -07:00
Haobo Xu
f2f4c8072f [RocksDB] Added nano second stopwatch and new perf counters to track block read cost
Summary: The pupose of this diff is to expose per user-call level precise timing of block read, so that we can answer questions like: a Get() costs me 100ms, is that somehow related to loading blocks from file system, or sth else? We will answer that with EXACTLY how many blocks have been read, how much time was spent on transfering the bytes from os, how much time was spent on checksum verification and how much time was spent on block decompression, just for that one Get. A nano second stopwatch was introduced to track time with higher precision. The cost/precision of the stopwatch is also measured in unit-test. On my dev box, retrieving one time instance costs about 30ns, on average. The deviation of timing results is good enough to track 100ns-1us level events. And the overhead could be safely ignored for 100us level events (10000 instances/s), for example, a viewstate thrift call.

Test Plan: perf_context_test, also testing with viewstate shadow traffic.

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb, xjin

Differential Revision: https://reviews.facebook.net/D12351
2013-09-07 21:14:54 -07:00
Dhruba Borthakur
1186192ed1 Replace include/leveldb with include/rocksdb.
Summary: Replace include/leveldb with include/rocksdb.

Test Plan:
make clean; make check
make clean; make release

Differential Revision: https://reviews.facebook.net/D12489
2013-08-23 10:51:00 -07:00
Haobo Xu
96be2c4ee0 [RocksDB] Add mmap_read option for db_stress
Summary: as title, also removed an incorrect assertion

Test Plan: make check; db_stress --mmap_read=1; db_stress --mmap_read=0

Reviewers: dhruba, emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11367
2013-06-19 10:28:32 -07:00
Haobo Xu
3cc1af2062 [RocksDB] Option for incremental sync
Summary: This diff added an option to control the incremenal sync frequency. db_bench has a new flag bytes_per_sync for easy tuning exercise.

Test Plan: make check; db_bench

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11295
2013-06-18 15:00:32 -07:00
Haobo Xu
778e179046 [RocksDB] Sync file to disk incrementally
Summary:
During compaction, we sync the output files after they are fully written out. This causes unnecessary blocking of the compaction thread and burstiness of the write traffic.
This diff simply asks the OS to sync data incrementally as they are written, on the background. The hope is that, at the final sync, most of the data are already on disk and we would block less on the sync call. Thus, each compaction runs faster and we could use fewer number of compaction threads to saturate IO.
In addition, the write traffic will be smoothed out, hopefully reducing the IO P99 latency too.

Some quick tests show 10~20% improvement in per thread compaction throughput. Combined with posix advice on compaction read, just 5 threads are enough to almost saturate the udb flash bandwidth for 800 bytes write only benchmark.
What's more promising is that, with saturated IO, iostat shows average wait time is actually smoother and much smaller.
For the write only test 800bytes test:
Before the change:  await  occillate between 10ms and 3ms
After the change: await ranges 1-3ms

Will test against read-modify-write workload too, see if high read latency P99 could be resolved.

Will introduce a parameter to control the sync interval in a follow up diff after cleaning up EnvOptions.

Test Plan: make check; db_bench; db_stress

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11115
2013-06-12 12:53:59 -07:00
Haobo Xu
bdf1085944 [RocksDB] cleanup EnvOptions
Summary:
This diff simplifies EnvOptions by treating it as POD, similar to Options.
- virtual functions are removed and member fields are accessed directly.
- StorageOptions is removed.
- Options.allow_readahead and Options.allow_readahead_compactions are deprecated.
- Unused global variables are removed: useOsBuffer, useFsReadAhead, useMmapRead, useMmapWrite

Test Plan: make check; db_stress

Reviewers: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D11175
2013-06-12 11:17:19 -07:00
Haobo Xu
ab8d2f6ab2 [RocksDB] [Performance] Allow different posix advice to be applied to the same table file
Summary:
Current posix advice implementation ties up the access pattern hint with the creation of a file.
It is not possible to apply different advice for different access (random get vs compaction read),
without keeping two open files for the same table. This patch extended the RandomeAccessFile interface
to accept new access hint at anytime. Particularly, we are able to set different access hint on the same
table file based on when/how the file is used.
Two options are added to set the access hint, after the file is first opened and after the file is being
compacted.

Test Plan: make check; db_stress; db_bench

Reviewers: dhruba

Reviewed By: dhruba

CC: MarkCallaghan, leveldb

Differential Revision: https://reviews.facebook.net/D10905
2013-05-30 19:08:44 -07:00
Vamsi Ponnekanti
760dd4750f [Kill randomly at various points in source code for testing]
Summary:
This is initial version. A few ways in which this could
be extended in the future are:
(a) Killing from more places in source code
(b) Hashing stack and using that hash in determining whether to crash.
    This is to avoid crashing more often at source lines that are executed
    more often.
(c) Raising exceptions or returning errors instead of killing

Test Plan:
This whole thing is for testing.

Here is part of output:

python2.7 tools/db_crashtest2.py -d 600
Running db_stress

db_stress retncode -15 output LevelDB version     : 1.5
Number of threads   : 32
Ops per thread      : 10000000
Read percentage     : 50
Write-buffer-size   : 4194304
Delete percentage   : 30
Max key             : 1000
Ratio #ops/#keys    : 320000
Num times DB reopens: 0
Batches/snapshots   : 1
Purge redundant %   : 50
Num keys per lock   : 4
Compression         : snappy
------------------------------------------------
No lock creation because test_batches_snapshots set
2013/04/26-17:55:17  Starting database operations
Created bg thread 0x7fc1f07ff700
... finished 60000 ops
Running db_stress

db_stress retncode -15 output LevelDB version     : 1.5
Number of threads   : 32
Ops per thread      : 10000000
Read percentage     : 50
Write-buffer-size   : 4194304
Delete percentage   : 30
Max key             : 1000
Ratio #ops/#keys    : 320000
Num times DB reopens: 0
Batches/snapshots   : 1
Purge redundant %   : 50
Num keys per lock   : 4
Compression         : snappy
------------------------------------------------
Created bg thread 0x7ff0137ff700
No lock creation because test_batches_snapshots set
2013/04/26-17:56:15  Starting database operations
... finished 90000 ops

Revert Plan: OK

Task ID: #2252691

Reviewers: dhruba, emayanke

Reviewed By: emayanke

CC: leveldb, haobo

Differential Revision: https://reviews.facebook.net/D10581
2013-05-21 18:21:49 -07:00
Kai Liu
958b9c80e1 Avoid global static initialization in Env::Default()
Summary:
Mark's task description from #2316777

Env::Default() comes from util/env_posix.cc

This is a static global.

static PosixEnv default_env;

Env* Env::Default() {
  return &default_env;
}

-----

These globals assume default_env was initialized first. I don't think that is safe or correct to do (http://stackoverflow.com/questions/1005685/c-static-initialization-order)

const string AutoRollLoggerTest::kTestDir(
test::TmpDir() + "/db_log_test");
const string AutoRollLoggerTest::kLogFile(
test::TmpDir() + "/db_log_test/LOG");
Env* AutoRollLoggerTest::env = Env::Default();

Test Plan:
run make clean && make && make check
But how can I know if it works in Ubuntu?

Reviewers: MarkCallaghan, chip

Reviewed By: chip

CC: leveldb, dhruba, haobo

Differential Revision: https://reviews.facebook.net/D10491
2013-04-22 18:10:28 -07:00
Dhruba Borthakur
3cb7bf8170 Initialize parameters in the constructor.
Summary:
RocksDB doesn't build on Ubuntu VM .. shoudl be fixed with this patch.

g++ --version
g++ (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3

util/env_posix.cc:68:24: sorry, unimplemented: non-static data member initializers
util/env_posix.cc:68:24: error: ISO C++ forbids in-class initialization of non-const static member ‘use_os_buffer’
util/env_posix.cc:113:24: sorry, unimplemented: non-static data member initializers
util/env_posix.cc:113:24: error: ISO C++ forbids in-class initialization of non-const static member ‘use_os_buffer

Test Plan: make check

Reviewers: sheki, leveldb

Reviewed By: sheki

Differential Revision: https://reviews.facebook.net/D10461
2013-04-22 14:41:45 -07:00
Mayank Agarwal
6594fef7ef Exit and Join the background compaction threads while running rocksdb tests
Summary:
The background compaction threads are never exitted and therefore caused
memory-leaks while running rpcksdb tests. Have changed the PosixEnv destructor to exit and join them and changed the tests likewise
The memory leaked has reduced from 320 bytes to 64 bytes in all the tests. The 64
bytes is relating to
pthread_exit, but still have to figure out why. The stack-trace right now with
table_test.cc = 64 bytes in 1 blocks are possibly lost in loss record 4 of 5
   at 0x475D8C: malloc (jemalloc.c:914)
   by 0x400D69E: _dl_map_object_deps (dl-deps.c:505)
   by 0x4013393: dl_open_worker (dl-open.c:263)
   by 0x400F015: _dl_catch_error (dl-error.c:178)
   by 0x4013B2B: _dl_open (dl-open.c:569)
   by 0x5D3E913: do_dlopen (dl-libc.c:86)
   by 0x400F015: _dl_catch_error (dl-error.c:178)
   by 0x5D3E9D6: __libc_dlopen_mode (dl-libc.c:47)
   by 0x5048BF3: pthread_cancel_init (unwind-forcedunwind.c:53)
   by 0x5048DC9: _Unwind_ForcedUnwind (unwind-forcedunwind.c:126)
   by 0x5046D9F: __pthread_unwind (unwind.c:130)
   by 0x50413A4: pthread_exit (pthreadP.h:289)

Test Plan: make all check

Reviewers: dhruba, sheki, haobo

Reviewed By: dhruba

CC: leveldb, chip

Differential Revision: https://reviews.facebook.net/D9573
2013-04-10 14:50:25 -07:00
heyongqiang
e21ba94a69 Set FD_CLOEXEC after each file open
Summary: as subject. This is causing problem in adsconv. Ideally, this flags should be set in open. But that is only supported in Linux kernel ≥2.6.23 and glibc ≥2.7.

Test Plan:
db_test

run db_test

Reviewers: dhruba, MarkCallaghan, haobo

Reviewed By: dhruba

CC: leveldb, chip

Differential Revision: https://reviews.facebook.net/D10089
2013-04-10 14:44:06 -07:00
Mayank Agarwal
adb4e4509b Fixing delete in env_posix.cc
Summary: Was deleting incorrectly. Should delete the whole array.

Test Plan: make;valgrind stops complaining about Mismatched free/delete

Reviewers: dhruba, sheki

Reviewed By: sheki

CC: leveldb, haobo

Differential Revision: https://reviews.facebook.net/D10059
2013-04-09 11:49:35 -07:00
Haobo Xu
d815082159 [RocksDB] env_posix cleanup
Summary:
1. SetBackgroundThreads was not thread safe
2. queue_size_ does not seem necessary
3. moved condition signal after shared state change. Even though the original
   order is in practice ok (because the mutex is still held), it looks fishy
   and non-intuitive.

Test Plan: make check

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb, zshao

Differential Revision: https://reviews.facebook.net/D9825
2013-04-02 11:36:51 -07:00
Mayank Agarwal
38d54832f7 Initialize variable in constructor for PosixEnv::checkedDiskForMmap_
Summary: This caused compilation problems on some gcc platforms during the third-partyrelease

Test Plan: make

Reviewers: sheki

Reviewed By: sheki

Differential Revision: https://reviews.facebook.net/D9627
2013-03-21 11:26:50 -07:00
Dhruba Borthakur
ad96563b79 Ability to configure bufferedio-reads, filesystem-readaheads and mmap-read-write per database.
Summary:
This patch allows an application to specify whether to use bufferedio,
reads-via-mmaps and writes-via-mmaps per database. Earlier, there
was a global static variable that was used to configure this functionality.

The default setting remains the same (and is backward compatible):
 1. use bufferedio
 2. do not use mmaps for reads
 3. use mmap for writes
 4. use readaheads for reads needed for compaction

I also added a parameter to db_bench to be able to explicitly specify
whether to do readaheads for compactions or not.

Test Plan: make check

Reviewers: sheki, heyongqiang, MarkCallaghan

Reviewed By: sheki

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9429
2013-03-20 23:14:03 -07:00
Abhishek Kona
1ba5abca97 Use posix_fallocate as default.
Summary:
Ftruncate does not throw an error on disk-full. This causes Sig-bus in
the case where the database tries to issue a Put call on a full-disk.

Use posix_fallocate for allocation instead of truncate.
Add a check to use MMaped files only on ext4, xfs and tempfs, as
posix_fallocate is very slow on ext3 and older.

Test Plan: make all check

Reviewers: dhruba, chip

Reviewed By: dhruba

CC: adsharma, leveldb

Differential Revision: https://reviews.facebook.net/D9291
2013-03-13 13:50:26 -07:00
Abhishek Kona
c41f1e995c Codemod NULL to nullptr
Summary:
scripted NULL to nullptr in
* include/leveldb/
* db/
* table/
* util/

Test Plan: make all check

Reviewers: dhruba, emayanke

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D9003
2013-02-28 18:04:58 -08:00
Kosie van der Merwe
4dcc0c89f4 Fixed cache key for block cache
Summary:
Added function to `RandomAccessFile` to generate an unique ID for that file. Currently only `PosixRandomAccessFile` has this behaviour implemented and only on Linux.

Changed how key is generated in `Table::BlockReader`.

Added tests to check whether the unique ID is stable, unique and not a prefix of another unique ID. Added tests to see that `Table` uses the cache more efficiently.

Test Plan: make check

Reviewers: chip, vamsi, dhruba

Reviewed By: chip

CC: leveldb

Differential Revision: https://reviews.facebook.net/D8145
2013-01-31 15:20:24 -08:00
Chip Turner
2c3565285e Add OS_LINUX ifdef protections around fallocate parts
Summary: fallocate is linux only, so let's protect it with ifdef's

Test Plan: make

Reviewers: sheki, dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D8223
2013-01-28 12:03:35 -08:00
Chip Turner
3dafdfb2c4 Use fallocate to prevent excessive allocation of sst files and logs
Summary:
On some filesystems, pre-allocation can be a considerable
amount of space.  xfs in our production environment pre-allocates by
1GB, for instance.  By using fallocate to inform the kernel of our
expected file sizes, we eliminate this wasteage (that isn't recovered
until the file is closed which, in the case of LOG files, can be a
considerable amount of time).

Test Plan:
created an xfs loopback filesystem, mounted with
allocsize=4M, and ran db_stress.  LOG file without this change was 4M,
and with it it was 128k then grew to normal size.

Reviewers: dhruba

Reviewed By: dhruba

CC: adsharma, leveldb

Differential Revision: https://reviews.facebook.net/D7953
2013-01-24 12:25:13 -08:00
Chip Turner
2fdf91a4f8 Fix a number of object lifetime/ownership issues
Summary:
Replace manual memory management with std::unique_ptr in a
number of places; not exhaustive, but this fixes a few leaks with file
handles as well as clarifies semantics of the ownership of file handles
with log classes.

Test Plan: db_stress, make check

Reviewers: dhruba

Reviewed By: dhruba

CC: zshao, leveldb, heyongqiang

Differential Revision: https://reviews.facebook.net/D8043
2013-01-23 16:54:11 -08:00
Chip Turner
a2dcd79c1e Add optional clang compile mode
Summary:
clang is an alternate compiler based on llvm.  It produces
nicer error messages and finds some bugs that gcc doesn't, such as the
size_t change in this file (which caused some write return values to be
misinterpreted!)

Clang isn't the default; to try it, do "USE_CLANG=1 make" or "export
USE_CLANG=1" then make as normal

Test Plan: "make check" and "USE_CLANG=1 make check"

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D7899
2013-01-15 18:48:37 -08:00
Kosie van der Merwe
d6e873f22f Added clearer error message for failure to create db directory in DBImpl::Recover()
Summary:
Changed CreateDir() to CreateDirIfMissing() so a directory that already exists now causes and error.

Fixed CreateDirIfMissing() and added Env.DirExists()

Test Plan:
make check to test for regessions

Ran the following to test if the error message is not about lock files not existing
./db_bench --db=dir/testdb

After creating a file "testdb", ran the following to see if it failed with sane error message:
./db_bench --db=testdb

Reviewers: dhruba, emayanke, vamsi, sheki

Reviewed By: emayanke

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7707
2013-01-07 10:11:18 -08:00
Dhruba Borthakur
38671c4d54 Fix a race condition while processing tasks by background threads.
Summary:
Suppose you submit 100 background tasks one after another. The first
enqueu task finds that the queue is empty and wakes up one worker thread.
Now suppose that all remaining 99 work items are enqueued, they do not
wake up any worker threads because the queue is already non-empty.
This causes a situation when there are 99 tasks in the task queue but
only one worker thread is processing a task while the remaining
worker threads are waiting.
The fix is to always wakeup one worker thread while enqueuing a task.

I also added a check to count the number of elements in the queue
to help in debugging.

Test Plan: make clean check.

Reviewers: chip

Reviewed By: chip

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7203
2012-12-09 17:15:27 -08:00
sheki
d4627e6de4 Move WAL files to archive directory, instead of deleting.
Summary:
Create a directory "archive" in the DB directory.
During DeleteObsolteFiles move the WAL files (*.log) to the Archive directory,
instead of deleting.

Test Plan: Created a DB using DB_Bench. Reopened it. Checked if files move.

Reviewers: dhruba

Reviewed By: dhruba

Differential Revision: https://reviews.facebook.net/D6975
2012-11-28 17:28:08 -08:00
Abhishek Kona
d29f181923 Fix all the lint errors.
Summary:
Scripted and removed all trailing spaces and converted all tabs to
spaces.

Also fixed other lint errors.
All lint errors from this point of time should be taken seriously.

Test Plan: make all check

Reviewers: dhruba

Reviewed By: dhruba

CC: leveldb

Differential Revision: https://reviews.facebook.net/D7059
2012-11-28 17:18:41 -08:00
Dhruba Borthakur
1ca0584345 This is the mega-patch multi-threaded compaction
published in https://reviews.facebook.net/D5997.

Summary:
This patch allows compaction to occur in multiple background threads
concurrently.

If a manual compaction is issued, the system falls back to a
single-compaction-thread model. This is done to ensure correctess
and simplicity of code. When the manual compaction is finished,
the system resumes its concurrent-compaction mode automatically.

The updates to the manifest are done via group-commit approach.

Test Plan: run db_bench
2012-10-19 14:00:53 -07:00
Dhruba Borthakur
c1006d4276 An configurable option to write data using write instead of mmap.
Summary:
We have seen that reading data via the pread call (instead of
mmap) is much faster on Linux 2.6.x kernels. This patch makes
an equivalent option to switch off mmaps for the write path
as well.

db_bench --mmap_write=0 will use write() instead of mmap() to
write data to a file.

This change is backward compatible, the default
option is to continue using mmap for writing to a file.

Test Plan: "make check all"

Differential Revision: https://reviews.facebook.net/D5781
2012-10-03 17:08:13 -07:00
Dhruba Borthakur
ae36e509f8 The BackupAPI should also list the length of the manifest file.
Summary:
The GetLiveFiles() api lists the set of sst files and the current
MANIFEST file. But the database continues to append new data to the
MANIFEST file even when the application is backing it up to the
backup location. This means that the database-version that is
stored in the MANIFEST FILE in the backup location
does not correspond to the sst files returned by GetLiveFiles.

This API adds a new parameter to GetLiveFiles. This new parmeter
returns the current size of the MANIFEST file.

Test Plan: Unit test attached.

Reviewers: heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D5631
2012-09-25 03:13:25 -07:00
Dhruba Borthakur
9e84834eb4 Allow a configurable number of background threads.
Summary:
The background threads are necessary for compaction.
For slower storage, it might be necessary to have more than
one compaction thread per DB. This patch allows creating
a configurable number of worker threads.
The default reamins at 1 (to maintain backward compatibility).

Test Plan:
run all unit tests. changes to db-bench coming in
a separate patch.

Reviewers: heyongqiang

Reviewed By: heyongqiang

CC: MarkCallaghan

Differential Revision: https://reviews.facebook.net/D5559
2012-09-19 15:51:08 -07:00
heyongqiang
b85cdca690 add a global var leveldb::useMmapRead to enable mmap Summary:
Summary:
as subject. this can be used for benchmarking.
If we want it for some cases, we can do more changes to make this part of the option.

Test Plan: db_test

Reviewers: dhruba

CC: MarkCallaghan

Differential Revision: https://reviews.facebook.net/D5451
2012-09-16 22:07:35 -07:00
Mark Callaghan
33323f2111 Remove use of mmap for random reads
Summary:
Reads via mmap on concurrent workloads are much slower than pread.
For example on a 24-core server with storage that can do 100k IOPS or more
I can get no more than 10k IOPS with mmap reads and 32+ threads.

Test Plan: db_bench benchmarks

Reviewers: dhruba, heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D5433
2012-09-14 16:43:50 -07:00
Dhruba Borthakur
93f4952089 Ability to switch off filesystem read-aheads
Summary:
Ability to switch off filesystem read-aheads. This change is
backward-compatible: the default setting is to allow file
system read-aheads.

Test Plan: run benchmarks

Reviewers: heyongqiang, adsharma

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D5391
2012-09-13 12:09:56 -07:00
Dhruba Borthakur
4028ae7d31 Do not cache readahead-pages in the OS cache.
Summary:
When posix_fadvise(offset, offset) is usedm it frees up only those
pages in that specified range. But the filesystem could have done some
read-aheads and those get cached in the OS cache.

Do not cache readahead-pages in the OS cache.

Test Plan: run db_bench benchmark.

Reviewers: vamsi, heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D5379
2012-09-13 10:56:02 -07:00
Dhruba Borthakur
407727b75f Fix compiler warnings. Use uint64_t instead of uint.
Summary: Fix compiler warnings. Use uint64_t instead of uint.

Test Plan: build using -Wall

Reviewers: heyongqiang

Reviewed By: heyongqiang

Differential Revision: https://reviews.facebook.net/D5355
2012-09-12 14:42:36 -07:00