162 Commits

Author SHA1 Message Date
Mikhail Antonov
7fe3b32896 Added support for differential snapshots
Summary:
The motivation for this PR is to add to RocksDB support for differential (incremental) snapshots, as snapshot of the DB changes between two points in time (one can think of it as diff between to sequence numbers, or the diff D which can be thought of as an SST file or just set of KVs that can be applied to sequence number S1 to get the database to the state at sequence number S2).

This feature would be useful for various distributed storages layers built on top of RocksDB, as it should help reduce resources (time and network bandwidth) needed to recover and rebuilt DB instances as replicas in the context of distributed storages.

From the API standpoint that would like client app requesting iterator between (start seqnum) and current DB state, and reading the "diff".

This is a very draft PR for initial review in the discussion on the approach, i'm going to rework some parts and keep updating the PR.

For now, what's done here according to initial discussions:

Preserving deletes:
 - We want to be able to optionally preserve recent deletes for some defined period of time, so that if a delete came in recently and might need to be included in the next incremental snapshot it would't get dropped by a compaction. This is done by adding new param to Options (preserve deletes flag) and new variable to DB Impl where we keep track of the sequence number after which we don't want to drop tombstones, even if they are otherwise eligible for deletion.
 - I also added a new API call for clients to be able to advance this cutoff seqnum after which we drop deletes; i assume it's more flexible to let clients control this, since otherwise we'd need to keep some kind of timestamp < -- > seqnum mapping inside the DB, which sounds messy and painful to support. Clients could make use of it by periodically calling GetLatestSequenceNumber(), noting the timestamp, doing some calculation and figuring out by how much we need to advance the cutoff seqnum.
 - Compaction codepath in compaction_iterator.cc has been modified to avoid dropping tombstones with seqnum > cutoff seqnum.

Iterator changes:
 - couple params added to ReadOptions, to optionally allow client to request internal keys instead of user keys (so that client can get the latest value of a key, be it delete marker or a put), as well as min timestamp and min seqnum.

TableCache changes:
 - I modified table_cache code to be able to quickly exclude SST files from iterators heep if creation_time on the file is less then iter_start_ts as passed in ReadOptions. That would help a lot in some DB settings (like reading very recent data only or using FIFO compactions), but not so much for universal compaction with more or less long iterator time span.

What's left:

 - Still looking at how to best plug that inside DBIter codepath. So far it seems that FindNextUserKeyInternal only parses values as UserKeys, and iter->key() call generally returns user key. Can we add new API to DBIter as internal_key(), and modify this internal method to optionally set saved_key_ to point to the full internal key? I don't need to store actual seqnum there, but I do need to store type.
Closes https://github.com/facebook/rocksdb/pull/2999

Differential Revision: D6175602

Pulled By: mikhail-antonov

fbshipit-source-id: c779a6696ee2d574d86c69cec866a3ae095aa900
2017-11-01 18:56:43 -07:00
Yi Wu
66a2c44ef4 Add DB::Properties::kEstimateOldestKeyTime
Summary:
With FIFO compaction we would like to get the oldest data time for monitoring. The problem is we don't have timestamp for each key in the DB. As an approximation, we expose the earliest of sst file "creation_time" property.

My plan is to override the property with a more accurate value with blob db, where we actually have timestamp.
Closes https://github.com/facebook/rocksdb/pull/2842

Differential Revision: D5770600

Pulled By: yiwu-arbug

fbshipit-source-id: 03833c8f10bbfbee62f8ea5c0d03c0cafb5d853a
2017-10-23 15:27:27 -07:00
Artem Danilov
8a6708f5f2 Extend property map with compaction stats
Summary:
This branch extends existing property map which keeps values in doubles to keep values in strings so that it can be used to provide wider range of properties. The immediate need for that is to provide IO stall stats in an easy parseable way to MyRocks which is also part of this branch.
Closes https://github.com/facebook/rocksdb/pull/2794

Differential Revision: D5717676

Pulled By: Tema

fbshipit-source-id: e34ba5b79ba774697f7b97ce1138d8fd55471b8a
2017-08-30 15:26:55 -07:00
Aaron G
7848f0b24c add VerifyChecksum() to db.h
Summary:
We need a tool to check any sst file corruption in the db.
It will check all the sst files in current version and read all the blocks (data, meta, index) with checksum verification. If any verification fails, the function will return non-OK status.
Closes https://github.com/facebook/rocksdb/pull/2498

Differential Revision: D5324269

Pulled By: lightmark

fbshipit-source-id: 6f8a272008b722402a772acfc804524c9d1a483b
2017-08-09 15:58:13 -07:00
Siying Dong
21696ba502 Replace dynamic_cast<>
Summary:
Replace dynamic_cast<> so that users can choose to build with RTTI off, so that they can save several bytes per object, and get tiny more memory available.
Some nontrivial changes:
1. Add Comparator::GetRootComparator() to get around the internal comparator hack
2. Add the two experiemental functions to DB
3. Add TableFactory::GetOptionString() to avoid unnecessary casting to get the option string
4. Since 3 is done, move the parsing option functions for table factory to table factory files too, to be symmetric.
Closes https://github.com/facebook/rocksdb/pull/2645

Differential Revision: D5502723

Pulled By: siying

fbshipit-source-id: fd13cec5601cf68a554d87bfcf056f2ffa5fbf7c
2017-07-28 16:27:16 -07:00
Sagar Vemuri
72502cf227 Revert "comment out unused parameters"
Summary:
This reverts the previous commit 1d7048c5985e60be8e356663ec3cb6d020adb44d, which broke the build.

Did a `git revert 1d7048c`.
Closes https://github.com/facebook/rocksdb/pull/2627

Differential Revision: D5476473

Pulled By: sagar0

fbshipit-source-id: 4756ff5c0dfc88c17eceb00e02c36176de728d06
2017-07-21 18:26:26 -07:00
Victor Gao
1d7048c598 comment out unused parameters
Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually.

Reviewed By: igorsugak

Differential Revision: D5454343

fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2
2017-07-21 14:57:44 -07:00
Siying Dong
3c327ac2d0 Change RocksDB License
Summary: Closes https://github.com/facebook/rocksdb/pull/2589

Differential Revision: D5431502

Pulled By: siying

fbshipit-source-id: 8ebf8c87883daa9daa54b2303d11ce01ab1f6f75
2017-07-15 16:11:23 -07:00
Maysam Yabandeh
499ebb3ab5 Optimize for serial commits in 2PC
Summary:
Throughput: 46k tps in our sysbench settings (filling the details later)

The idea is to have the simplest change that gives us a reasonable boost
in 2PC throughput.

Major design changes:
1. The WAL file internal buffer is not flushed after each write. Instead
it is flushed before critical operations (WAL copy via fs) or when
FlushWAL is called by MySQL. Flushing the WAL buffer is also protected
via mutex_.
2. Use two sequence numbers: last seq, and last seq for write. Last seq
is the last visible sequence number for reads. Last seq for write is the
next sequence number that should be used to write to WAL/memtable. This
allows to have a memtable write be in parallel to WAL writes.
3. BatchGroup is not used for writes. This means that we can have
parallel writers which changes a major assumption in the code base. To
accommodate for that i) allow only 1 WriteImpl that intends to write to
memtable via mem_mutex_--which is fine since in 2PC almost all of the memtable writes
come via group commit phase which is serial anyway, ii) make all the
parts in the code base that assumed to be the only writer (via
EnterUnbatched) to also acquire mem_mutex_, iii) stat updates are
protected via a stat_mutex_.

Note: the first commit has the approach figured out but is not clean.
Submitting the PR anyway to get the early feedback on the approach. If
we are ok with the approach I will go ahead with this updates:
0) Rebase with Yi's pipelining changes
1) Currently batching is disabled by default to make sure that it will be
consistent with all unit tests. Will make this optional via a config.
2) A couple of unit tests are disabled. They need to be updated with the
serial commit of 2PC taken into account.
3) Replacing BatchGroup with mem_mutex_ got a bit ugly as it requires
releasing mutex_ beforehand (the same way EnterUnbatched does). This
needs to be cleaned up.
Closes https://github.com/facebook/rocksdb/pull/2345

Differential Revision: D5210732

Pulled By: maysamyabandeh

fbshipit-source-id: 78653bd95a35cd1e831e555e0e57bdfd695355a4
2017-06-24 14:11:29 -07:00
yizhu.sun
f5ba131bf8 Fixed some spelling mistakes
Summary: Closes https://github.com/facebook/rocksdb/pull/2314

Differential Revision: D5079601

Pulled By: sagar0

fbshipit-source-id: ae5696fd735718f544435c64c3179c49b8c04349
2017-05-17 23:12:36 -07:00
Mikhail Antonov
ba685a472a Support ingest_behind for IngestExternalFile
Summary:
First cut for early review; there are few conceptual points to answer and some code structure issues.

For conceptual points -

 - restriction-wise, we're going to disallow ingest_behind if (use_seqno_zero_out=true || disable_auto_compaction=false), the user is responsible to properly open and close DB with required params
 - we wanted to ingest into reserved bottom most level. Should we fail fast if bottom level isn't empty, or should we attempt to ingest if file fits there key-ranges-wise?
 - Modifying AssignLevelForIngestedFile seems the place we we'd handle that.

On code structure - going to refactor GenerateAndAddExternalFile call in the test class to allow passing instance of IngestionOptions, that's just going to incur lots of changes at callsites.
Closes https://github.com/facebook/rocksdb/pull/2144

Differential Revision: D4873732

Pulled By: lightmark

fbshipit-source-id: 81cb698106b68ef8797f564453651d50900e153a
2017-05-17 11:42:42 -07:00
Yi Wu
2cd00773c7 Add bulk create/drop column family API
Summary:
Adding DB::CreateColumnFamilie() and DB::DropColumnFamilies() to bulk create/drop column families. This is to address the problem creating/dropping 1k column families takes minutes. The bottleneck is we persist options files for every single column family create/drop, and it parses the persisted options file for verification, which take a lot CPU time.

The new APIs simply create/drop column families individually, and persist options file once at the end. This improves create 1k column families to within ~0.1s. Further improvement can be merge manifest write to one IO.
Closes https://github.com/facebook/rocksdb/pull/2248

Differential Revision: D5001578

Pulled By: yiwu-arbug

fbshipit-source-id: d4e00bda671451e0b314c13e12ad194b1704aa03
2017-05-07 23:20:46 -07:00
Siying Dong
c49d704656 Add DB:ResetStats()
Summary:
Add a function to allow users to reset internal stats without restarting the DB.
Closes https://github.com/facebook/rocksdb/pull/2167

Differential Revision: D4907939

Pulled By: siying

fbshipit-source-id: ab2dd85b88aabe9380da7485320a1d460d3e1f68
2017-04-18 16:56:48 -07:00
Siying Dong
8f47a97512 File level histogram should be printed per CF, not per DB
Summary:
Currently level histogram is only printed out for DB stats and for default CF. This is confusing. Change to print for every CF instead.
Closes https://github.com/facebook/rocksdb/pull/2126

Differential Revision: D4865373

Pulled By: siying

fbshipit-source-id: 1c853e0ac66e00120ee931cabc9daf69ccc2d577
2017-04-11 08:42:03 -07:00
Andrew Kryczka
4e0065015d make all DB::Get overloads virtual
Summary:
some fbcode services override it, we need to keep it virtual.

original change: #1756
Closes https://github.com/facebook/rocksdb/pull/2065

Differential Revision: D4808123

Pulled By: ajkr

fbshipit-source-id: 5eaeea7
2017-03-30 23:39:14 -07:00
Siying Dong
67d7623794 Expose the stalling information through DB::GetProperty()
Summary:
Add two DB properties: rocksdb.actual_delayed_write_rate and rocksdb.is_write_stooped, for people to know whether current writes are being throttled.
Closes https://github.com/facebook/rocksdb/pull/2043

Differential Revision: D4782975

Pulled By: siying

fbshipit-source-id: 6b2f5cf
2017-03-29 11:54:20 -07:00
Maysam Yabandeh
21d8c31796 remove LIKELY from public headers
Summary:
Fixes #2008
Closes https://github.com/facebook/rocksdb/pull/2009

Differential Revision: D4739885

Pulled By: maysamyabandeh

fbshipit-source-id: 36fba5d
2017-03-20 12:54:14 -07:00
Maysam Yabandeh
11526252cc Pinnableslice (2nd attempt)
Summary:
PinnableSlice

    Summary:
    Currently the point lookup values are copied to a string provided by the
    user. This incures an extra memcpy cost. This patch allows doing point lookup
    via a PinnableSlice which pins the source memory location (instead of
    copying their content) and releases them after the content is consumed
    by the user. The old API of Get(string) is translated to the new API
    underneath.

    Here is the summary for improvements:

    value 100 byte: 1.8% regular, 1.2% merge values
    value 1k byte: 11.5% regular, 7.5% merge values
    value 10k byte: 26% regular, 29.9% merge values
    The improvement for merge could be more if we extend this approach to
    pin the merge output and delay the full merge operation until the user
    actually needs it. We have put that for future work.

    PS:
    Sometimes we observe a small decrease in performance when switching from
    t5452014 to this patch but with the old Get(string) API. The d
Closes https://github.com/facebook/rocksdb/pull/1756

Differential Revision: D4391738

Pulled By: maysamyabandeh

fbshipit-source-id: 6f3edd3
2017-03-13 11:54:10 -07:00
Islam AbdelRahman
1560b2f5f0 Temporarly return deprecated functions to fix MongoRocks build
Summary:
MongoRocks is still using some deprecated functions, return them temporarily
Closes https://github.com/facebook/rocksdb/pull/1892

Differential Revision: D4592451

Pulled By: IslamAbdelRahman

fbshipit-source-id: 5e6be3e
2017-02-21 12:54:11 -08:00
Islam AbdelRahman
7ab0051835 Remove deprecated DB::AddFile and DB::CompactRange
Summary:
Remove functions that we deprecated long time ago in db.h
Closes https://github.com/facebook/rocksdb/pull/1878

Differential Revision: D4576521

Pulled By: IslamAbdelRahman

fbshipit-source-id: dfddad1
2017-02-17 10:54:13 -08:00
Vitaliy Liptchinsky
1aaa898cf1 Adding GetApproximateMemTableStats method
Summary:
Added method that returns approx num of entries as well as size for memtables.
Closes https://github.com/facebook/rocksdb/pull/1841

Differential Revision: D4511990

Pulled By: VitaliyLi

fbshipit-source-id: 9a4576e
2017-02-06 14:54:16 -08:00
oranagra
b96372dead improving the C wrapper
Summary:
- rocksdb_property_int (so that we don't have to parse strings)
- and rocksdb_set_options (to allow controlling options via strings)
- a few other missing options exposed
- a documentation comment fix
Closes https://github.com/facebook/rocksdb/pull/1793

Differential Revision: D4456569

Pulled By: yiwu-arbug

fbshipit-source-id: 9f1fac1
2017-01-27 17:39:16 -08:00
Vitaliy Liptchinsky
e840213d6e Change DB::GetApproximateSizes for more flexibility needed for MyRocks
Summary:
Added an option to GetApproximateSizes to exclude file stats, as MyRocks has those counted exactly and we need only stats from memtables.
Closes https://github.com/facebook/rocksdb/pull/1787

Differential Revision: D4441111

Pulled By: IslamAbdelRahman

fbshipit-source-id: c11f4c3
2017-01-20 09:39:11 -08:00
Maysam Yabandeh
d0ba8ec8f9 Revert "PinnableSlice"
Summary:
This reverts commit 54d94e9c2cc0bf6eeb2a165ada33fa9c174f0b16.

The pull request was landed by mistake.
Closes https://github.com/facebook/rocksdb/pull/1755

Differential Revision: D4391678

Pulled By: maysamyabandeh

fbshipit-source-id: 36d5149
2017-01-08 14:24:12 -08:00
Maysam Yabandeh
54d94e9c2c PinnableSlice
Summary:
Currently the point lookup values are copied to a string provided by the user.
This incures an extra memcpy cost. This patch allows doing point lookup
via a PinnableSlice which pins the source memory location (instead of
copying their content) and releases them after the content is consumed
by the user. The old API of Get(string) is translated to the new API
underneath.

 Here is the summary for improvements:
 1. value 100 byte: 1.8%  regular, 1.2% merge values
 2. value 1k   byte: 11.5% regular, 7.5% merge values
 3. value 10k byte: 26% regular,    29.9% merge values

 The improvement for merge could be more if we extend this approach to
 pin the merge output and delay the full merge operation until the user
 actually needs it. We have put that for future work.

PS:
Sometimes we observe a small decrease in performance when switching from
t5452014 to this patch but with the old Get(string) API. The difference
is a little and could be noise. More importantly it is safely
cancelled
Closes https://github.com/facebook/rocksdb/pull/1732

Differential Revision: D4374613

Pulled By: maysamyabandeh

fbshipit-source-id: a077f1a
2017-01-08 13:54:13 -08:00
Siying Dong
438f22bc56 Fix bug of Checkpoint loses recent transactions with 2PC
Summary:
If 2PC is enabled, checkpoint may not copy previous log files that contain uncommitted prepare records. In this diff we keep those files.
Closes https://github.com/facebook/rocksdb/pull/1724

Differential Revision: D4368319

Pulled By: siying

fbshipit-source-id: cc2c746
2016-12-28 12:24:16 -08:00
Andrew Kryczka
489d142808 DeleteRange interface
Summary:
Expose DeleteRange() interface since we think the implementation is functionally correct now.
Closes https://github.com/facebook/rocksdb/pull/1503

Differential Revision: D4171921

Pulled By: ajkr

fbshipit-source-id: 5e21c98
2016-11-15 15:24:16 -08:00
Maysam Yabandeh
361010d447 Exporting compaction stats in the form of a map
Summary:
Currently the compaction stats are printed to stdout. We want to export the compaction stats in a map format so that the upper layer apps (e.g., MySQL) could present
the stats in any format required by the them.
Closes https://github.com/facebook/rocksdb/pull/1477

Differential Revision: D4149836

Pulled By: maysamyabandeh

fbshipit-source-id: b3df19f
2016-11-11 20:54:14 -08:00
Islam AbdelRahman
869ae5d786 Support IngestExternalFile (remove AddFile restrictions)
Summary:
Changes in the diff

API changes:
- Introduce IngestExternalFile to replace AddFile (I think this make the API more clear)
- Introduce IngestExternalFileOptions (This struct will encapsulate the options for ingesting the external file)
- Deprecate AddFile() API

Logic changes:
- If our file overlap with the memtable we will flush the memtable
- We will find the first level in the LSM tree that our file key range overlap with the keys in it
- We will find the lowest level in the LSM tree above the the level we found in step 2 that our file can fit in and ingest our file in it
- We will assign a global sequence number to our new file
- Remove AddFile restrictions by using global sequence numbers

Other changes:
- Refactor all AddFile logic to be encapsulated in ExternalSstFileIngestionJob

Test Plan:
unit tests (still need to add more)
addfile_stress (https://reviews.facebook.net/D65037)

Reviewers: yiwu, andrewkr, lightmark, yhchiang, sdong

Reviewed By: sdong

Subscribers: jkedgar, hcz, andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D65061
2016-10-20 17:05:32 -07:00
Yi Wu
e29d3b67c2 Make max_background_compactions and base_background_compactions dynamic changeable
Summary:
Add DB::SetDBOptions to dynamic change max_background_compactions and base_background_compactions.
I'll add more dynamic changeable options soon.

Test Plan: unit test.

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64749
2016-10-14 12:25:39 -07:00
Yi Wu
9ed928e7a9 Split DBOptions into ImmutableDBOptions and MutableDBOptions
Summary: Use ImmutableDBOptions/MutableDBOptions internally and DBOptions only for user-facing APIs. MutableDBOptions is barely a placeholder for now. I'll start to move options to MutableDBOptions in following diffs.

Test Plan:
  make all check

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D64065
2016-09-23 16:34:04 -07:00
Yi Wu
17f76fc564 DB::GetOptions() reflect dynamic changed options
Summary: DB::GetOptions() reflect dynamic changed options.

Test Plan: See the new unit test.

Reviewers: yhchiang, sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D63903
2016-09-14 22:10:28 -07:00
Yi Wu
a88677d2cf Remove ImmutableCFOptions from public API
Summary: There's no reference to ImmutableCFOptions elsewhere in /include/rocksdb. ImmutableCFOptions was introduced in this commit (5665e5e285) but later its reference in /include/rocksdb/table.h is removed.

Test Plan:
  make all check

Reviewers: IslamAbdelRahman, sdong, yhchiang

Reviewed By: yhchiang

Subscribers: yhchiang, andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D63177
2016-09-02 14:16:31 -07:00
Zongzhi Chen
98d0b78eac Added check_snapshot option in the DB's AddFile function (#1261)
* Added check_snapshot option in the DB's AddFile function

* change check_snapshot to skip_snapshot_check

* add unit test for skip_snapshot_check

* Add skip_snapshot_check comment
2016-08-09 18:14:13 -07:00
Aaron Gao
343304e1d3 Use StopWatch to do statistic job in db_impl_add_file.cc
Summary:
patch for diff https://reviews.facebook.net/D58587
Also change StopWatch class to add a fifth param named overwrite which decides whether to overwrite *elapse or add on it.

Test Plan: make all check -j64

Reviewers: sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D61239
2016-08-02 14:53:29 -07:00
Jay Edgar
cdc4eb6892 Add a GetComparator() function to the ColumnFamilyHandle base class so that the user's comparator can be retrieved.
Summary: MyRocks is adding support for the user of the SstFileWriter which needs a comparator.  It would be more convenient to get the comparator from the column family (which already has to have it) than to have caller keep track of it.

Test Plan: Standard tests (adding one for the new method)

Reviewers: IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba

Differential Revision: https://reviews.facebook.net/D61155
2016-08-02 14:34:57 -07:00
Aaron Gao
dda6c72ac8 Add DestroyColumnFamilyHandle(ColumnFamilyHandle**) to db.h
Summary:
add DestroyColumnFamilyHandle(ColumnFamilyHandle**) to close column family instead of deleting cfh*
User should call this to close a cf and then we can detect the deletion in this function.

Test Plan: make all check -j64

Reviewers: andrewkr, yiwu, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D60765
2016-07-13 17:59:25 -07:00
Aaron Gao
8e6b38d895 update DB::AddFile to ingest list of sst files
Summary:
DB::AddFile(std::string file_path) API that allow them to ingest an SST file created using SstFileWriter
We want to update this interface to be able to accept a list of files that will be ingested, DB::AddFile(std::vector<std::string> file_path_list).

Test Plan:
Add test case `AddExternalSstFileList` in `DBSSTTest`. To make sure:
1. files key ranges are not overlapping with each other
2. each file key range dont overlap with the DB key range
3. make sure no snapshots are held

Reviewers: andrewkr, sdong, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D58587
2016-07-11 10:43:12 -07:00
Andrew Kryczka
3b7ed677de ColumnFamilyOptions API [CF + RepairDB part 3/3]
Summary:
Overload RepairDB to take vector-of-ColumnFamilyDescriptor, which tells
us CF name + options. Also takes a ColumnFamilyOptions for unspecified column
families encountered during the repair.

One potentially confusing thing is that we store options in the constructor and
don't invoke AddColumnFamily() until discovering the CF in ScanTable. This is
because we don't know the CF ID until we find a table belonging to that CF.

Depends on D59781.

Test Plan:
  $ ./repair_test

Reviewers: yhchiang, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D59853
2016-06-24 16:29:43 -07:00
Islam AbdelRahman
fa813f7478 Update DB::AddFile() to ingest the file to the lowest possible level
Summary:
DB::AddFile() right now always add the ingested file to L0
update the logic to add the file to the lowest possible level

Test Plan: unit tests

Reviewers: jkedgar, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, yoshinorim

Differential Revision: https://reviews.facebook.net/D59637
2016-06-21 17:57:59 -07:00
Andrew Kryczka
73a847ef89 Add per-level compression ratio property
Summary:
This is needed so we can measure compression ratio improvements
achieved by D52287.

The property compares raw data size against the total file size for a given
level. If the level is empty it should return 0.0.

Test Plan: new unit test

Reviewers: IslamAbdelRahman, yhchiang, sdong

Reviewed By: sdong

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D56967
2016-04-20 18:46:54 -07:00
sdong
294bdf9ee2 Change Property name from "rocksdb.current_version_number" to "rocksdb.current-super-version-number"
Summary: I realized I again is wrong about the naming convention. Let me change it to the correct one.

Test Plan: Run unit tests.

Reviewers: IslamAbdelRahman, kradhakrishnan, yhchiang, andrewkr

Reviewed By: andrewkr

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D55041
2016-03-04 18:15:29 -08:00
sdong
432f3adf2c Add DB Property "rocksdb.current_version_number"
Summary: Add a DB Property "rocksdb.current_version_number" for users to monitor version changes and stale iterators.

Test Plan: Add a unit test.

Reviewers: andrewkr, yhchiang, kradhakrishnan, IslamAbdelRahman

Reviewed By: IslamAbdelRahman

Subscribers: leveldb, dhruba

Differential Revision: https://reviews.facebook.net/D54927
2016-03-01 10:55:40 -08:00
Baraa Hamodi
21e95811d1 Updated all copyright headers to the new format. 2016-02-09 15:12:00 -08:00
Andrew Kryczka
284aa613a7 Eliminate duplicated property constants
Summary:
Before this diff, there were duplicated constants to refer to properties (user-
facing API had strings and InternalStats had an enum). I noticed these were
inconsistent in terms of which constants are provided, names of constants, and
documentation of constants. Overall it seemed annoying/error-prone to maintain
these duplicated constants.

So, this diff gets rid of InternalStats's constants and replaces them with a map
keyed on the user-facing constant. The value in that map contains a function
pointer to get the property value, so we don't need to do string matching while
holding db->mutex_. This approach has a side benefit of making many small
handler functions rather than a giant switch-statement.

Test Plan: db_properties_test passes, running "make commit-prereq -j32"

Reviewers: sdong, yhchiang, kradhakrishnan, IslamAbdelRahman, rven, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D53253
2016-02-02 19:14:56 -08:00
Andrew Kryczka
3e9209a078 Updated GetProperty documentation
Summary: As titled. Also added the kBaseLevel string, which was missing earlier.

Test Plan: built

Reviewers: yhchiang, anthony, rven, kradhakrishnan, IslamAbdelRahman, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D53271
2016-01-25 14:04:55 -08:00
Andrew Kryczka
29289333d0 Add named constants for remaining properties
Summary:
There were just these two properties that didn't have any named
constant.

Test Plan:
build and below test

  $ ./db_properties_test --gtest_filter=DBPropertiesTest.NumImmutableMemTable

Reviewers: yhchiang, sdong

Reviewed By: sdong

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D53103
2016-01-21 10:59:36 -08:00
Yueh-Hsuan Chiang
6935eb24e0 Add ColumnFamilyHandle::GetDescriptor()
Summary:
This patch addes ColumnFamilyHandle::GetDescriptor(), which allows
developers to obtain the CF options and names of the associated column
family given its handle.

  // Returns the up-to-date descriptor used by the current handle.  Since it
  // returns the up-to-date information, this call might internally locks
  // and releases DB mutex to access the up-to-date CF options.
  virtual ColumnFamilyDescriptor GetDescriptor() = 0;

Test Plan: augment column_family_test

Reviewers: sdong, yoshinorim, IslamAbdelRahman, rven, kradhakrishnan, anthony

Reviewed By: anthony

Subscribers: dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D51543
2016-01-06 18:14:01 -08:00
Alex Yang
33e09c0e19 add call to install superversion and schedule work in enableautocompactions
Summary:
This patch fixes https://github.com/facebook/mysql-5.6/issues/121

There is a recent change in rocksdb to disable auto compactions on startup: https://reviews.facebook.net/D51147. However, there is a small timing window where a column family needs to be compacted and schedules a compaction, but the scheduled compaction fails when it checks the disable_auto_compactions setting. The expectation is once the application is ready, it will call EnableAutoCompactions() to allow new compactions to go through. However, if the Column family is stalled because L0 is full, and no writes can go through, it is possible the column family may never have a new compaction request get scheduled. EnableAutoCompaction() should probably schedule an new flush and compaction event when it resets disable_auto_compaction.

Using InstallSuperVersionAndScheduleWork, we call SchedulePendingFlush,
SchedulePendingCompaction, as well as MaybeScheduleFlushOrcompaction on all the
column families to avoid the situation above.

This is still a first pass for feedback.
Could also just call SchedePendingFlush and SchedulePendingCompaction directly.

Test Plan:
Run on Asan build
cd _build-5.6-ASan/ && ./mysql-test/mtr --mem --big --testcase-timeout=36000 --suite-timeout=12000 --parallel=16 --suite=rocksdb,rocksdb_rpl,rocksdb_sys_vars --mysqld=--default-storage-engine=rocksdb --mysqld=--skip-innodb --mysqld=--default-tmp-storage-engine=MyISAM --mysqld=--rocksdb rocksdb_rpl.rpl_rocksdb_stress_crash --repeat=1000

Ensure that it no longer hangs during the test.

Reviewers: hermanlee4, yhchiang, anthony

Reviewed By: anthony

Subscribers: leveldb, yhchiang, dhruba

Differential Revision: https://reviews.facebook.net/D51747
2015-12-21 10:06:49 -08:00
Alex Yang
e8180f9901 added public api to schedule flush/compaction, code to prevent race with db::open
Summary:
Fixes T8781168.

Added a new function EnableAutoCompactions in db.h to be publicly
avialable.  This allows compaction to be re-enabled after disabling it via
SetOptions

Refactored code to set the dbptr earlier on in TransactionDB::Open and DB::Open
Temporarily disable auto_compaction in TransactionDB::Open until dbptr is set to
prevent race condition.

Test Plan:
Ran make all check

verified fix on myrocks side:
was able to reproduce the seg fault with
../tools/mysqltest.sh --mem --force rocksdb.drop_table

method was to manually sleep the thread after DB::Open but before TransactionDB ptr was
assigned in transaction_db_impl.cc:
  DB::Open(db_options, dbname, column_families_copy, handles, &db);
  clock_t goal = (60000 * 10) + clock();
  while (goal > clock());
  ...dbptr(aka rdb) gets assigned below

verified my changes fixed the issue.

Also added unit test 'ToggleAutoCompaction' in transaction_test.cc

Reviewers: hermanlee4, anthony

Reviewed By: anthony

Subscribers: alex, dhruba

Differential Revision: https://reviews.facebook.net/D51147
2015-12-03 22:59:44 -08:00