Summary:
First cut for early review; there are few conceptual points to answer and some code structure issues.
For conceptual points -
- restriction-wise, we're going to disallow ingest_behind if (use_seqno_zero_out=true || disable_auto_compaction=false), the user is responsible to properly open and close DB with required params
- we wanted to ingest into reserved bottom most level. Should we fail fast if bottom level isn't empty, or should we attempt to ingest if file fits there key-ranges-wise?
- Modifying AssignLevelForIngestedFile seems the place we we'd handle that.
On code structure - going to refactor GenerateAndAddExternalFile call in the test class to allow passing instance of IngestionOptions, that's just going to incur lots of changes at callsites.
Closes https://github.com/facebook/rocksdb/pull/2144
Differential Revision: D4873732
Pulled By: lightmark
fbshipit-source-id: 81cb698106b68ef8797f564453651d50900e153a
Summary:
Windows build in AppVeyor is broken, I believe due to https://github.com/facebook/rocksdb/pull/2254.
Error messages:
```
c_test.obj : error LNK2019: unresolved external symbol rocksdb_get_pinned referenced in function CheckPinGet [C:\projects\rocksdb\build\c_test.vcxproj]
c_test.obj : error LNK2019: unresolved external symbol rocksdb_get_pinned_cf referenced in function CheckPinGetCF [C:\projects\rocksdb\build\c_test.vcxproj]
c_test.obj : error LNK2019: unresolved external symbol rocksdb_pinnableslice_destroy referenced in function CheckPinGet [C:\projects\rocksdb\build\c_test.vcxproj]
c_test.obj : error LNK2019: unresolved external symbol rocksdb_pinnableslice_value referenced in function CheckPinGet [C:\projects\rocksdb\build\c_test.vcxproj]
C:\projects\rocksdb\build\Debug\c_test.exe : fatal error LNK1120: 4 unresolved externals [C:\projects\rocksdb\build\c_test.vcxproj]
```
See, for example: https://ci.appveyor.com/project/Facebook/rocksdb/build/1.0.4420
Closes https://github.com/facebook/rocksdb/pull/2309
Differential Revision: D5076992
Pulled By: sagar0
fbshipit-source-id: bf4ca063a53b5a9042ba9f655f7c60c268ea5748
Summary:
I've added functions to the C API to support Transactions as requested in #1637 and to support Checkpoint.
I have also added the corresponding tests to c_test.c
For now, the following is omitted:
1. Optimistic Transactions
2. The column family variation of functions
Closes https://github.com/facebook/rocksdb/pull/2236
Differential Revision: D4989510
Pulled By: yiwu-arbug
fbshipit-source-id: 518cb39f76d5e9ec9690d633fcdc014b98958071
Summary:
Looks like std::snprintf is not available on all platforms (e.g. MSVC 2010). Change it back to snprintf, where we have a macro in port.h to workaround compatibility.
Closes https://github.com/facebook/rocksdb/pull/2308
Differential Revision: D5070988
Pulled By: yiwu-arbug
fbshipit-source-id: bedfc1660bab0431c583ad434b7e68265e1211b1
Summary:
This brings CMake builds further in line with builds that go through
the normal Makefile.
Closes https://github.com/facebook/rocksdb/pull/2300
Differential Revision: D5064631
Pulled By: yiwu-arbug
fbshipit-source-id: 7b2b2d5299f575f87badcf590cc95e040f14d52d
Summary:
We've had a couple CockroachDB users fail to build RocksDB on exotic platforms, so I figured I'd try my hand at solving these issues upstream. The problems stem from a) `USE_SSE=1` being too aggressive about turning on SSE4.2, even on toolchains that don't support SSE4.2 and b) RocksDB attempting to detect support for thread-local storage based on OS, even though it can vary by compiler on the same OS.
See the individual commit messages for details. Regarding SSE support, this PR should change virtually nothing for non-CMake based builds. `make`, `PORTABLE=1 make`, `USE_SSE=1 make`, and `PORTABLE=1 USE_SSE=1 make` function exactly as before, except that SSE support will be automatically disabled when a simple SSE4.2-using test program fails to compile, as it does on OpenBSD. (OpenBSD's ports GCC supports SSE4.2, but its binutils do not, so `__SSE_4_2__` is defined but an SSE4.2-using program will fail to assemble.) A warning is emitted in this case. The CMake build is modified to support the same set of options, except that `USE_SSE` is spelled `FORCE_SSE42` because `USE_SSE` is rather useless now that we can automatically detect SSE support, and I figure changing options in the CMake build is less disruptive than changing the non-CMake build.
I've tested these changes on all the platforms I can get my hands on (macOS, Windows MSVC, Windows MinGW, and OpenBSD) and it all works splendidly. Let me know if there's anything you object to—I obviously don't mean to break any of your build pipelines in the process of fixing ours downstream.
Closes https://github.com/facebook/rocksdb/pull/2199
Differential Revision: D5054042
Pulled By: yiwu-arbug
fbshipit-source-id: 938e1fc665c049c02ae15698e1409155b8e72171
Summary:
Currently, the RPM package will install the lib and header files into `/usr/package/lib` and `/usr/package/include` which is not in the default search paths. It is reasonable to install them under `/usr/lib` and `/usr/include` so that no extra configuration is required.
Closes https://github.com/facebook/rocksdb/pull/2221
Differential Revision: D5054030
Pulled By: yiwu-arbug
fbshipit-source-id: 1d23de5ff21f07e6738c9dfa04429acd7a839143
Summary:
snprintf is in <stdio.h> and not in namespace std.
Closes https://github.com/facebook/rocksdb/pull/2287
Reviewed By: anirbanr-fb
Differential Revision: D5054752
Pulled By: yiwu-arbug
fbshipit-source-id: 356807ec38f3c7d95951cdb41f31a3d3ae0714d4
Summary:
`java/**.asc` is not a correct gitignore pattern
See https://git-scm.com/docs/gitignore for the list of allowed `**` patterns
It seems reasonable to assume that intention is `java/**/*.asc`
The reason why it bothers me is the fact that ripgrep parses .gitignore files
and complains about invalid pattern
https://github.com/BurntSushi/ripgrep
Closes https://github.com/facebook/rocksdb/pull/2214
Differential Revision: D5063030
Pulled By: yiwu-arbug
fbshipit-source-id: ddd6682b81f03134be15f20fd596130776b69695
Summary:
- Introduced an include/ file dedicated to db-related debug functions to avoid making db.h more complex
- Added debugging function, `GetAllKeyVersions()`, to return a listing of internal data for a range of user keys. The new `struct KeyVersion` exposes data similar to internal key without exposing any internal type.
- Migrated the "ldb idump" subcommand to use this function
- The API takes an inclusive-exclusive range to match behavior of "ldb idump". This will be quite annoying for users who want to query a single user key's versions :(.
Closes https://github.com/facebook/rocksdb/pull/2232
Differential Revision: D4976007
Pulled By: ajkr
fbshipit-source-id: cab375da53a7595d6575af2b7e3b776aa3ad793e
Summary:
As an alternative to Vagrant, we can now also use Docker to cross-build RocksDB. The advantages are:
1. The Docker images are fixed; they include all the latest updates and build tools.
2. The Vagrant image, required scripts that ran for every build that would update CentOS and install the buildtools. This lead to slow repeatable builds, we don't need to do this with Docker as they are already in the provided images.
The Docker images I have used have their Docker build files here: https://github.com/evolvedbinary/docker-rocksjava and the images themselves are available from Docker hub: https://hub.docker.com/r/evolvedbinary/rocksjava/
I have added the following targets to the `Makefile`:
1. `rocksdbjavastaticreleasedocker` this uses Docker to perform the cross-builds. It is basically the Docker version of the existing Vagrant `rocksdbjavastaticrelease` target.
2. `rocksdbjavastaticpublishdocker` delegates to `rocksdbjavastaticreleasedocker` and then `rocksdbjavastaticpublishcentral` to upload the artiacts to Maven Central. Equivalent to the existing Vagrant target: `rocksdbjavastaticpublish`
Closes https://github.com/facebook/rocksdb/pull/2278
Differential Revision: D5048206
Pulled By: yiwu-arbug
fbshipit-source-id: 78fa96ef9d966fe09638ed01de282cd4e31961a9
Summary:
try to clean up the type conversions and hope it passes on windows.
one interesting thing I learned is that bitshift operations are special: in `x << y`, the result type depends only on the type of `x`, unlike most arithmetic operations where the result type depends on both operands' types.
Closes https://github.com/facebook/rocksdb/pull/2277
Differential Revision: D5050145
Pulled By: ajkr
fbshipit-source-id: f3309e77526ac9612c632bf93a62d99757af9a29
Summary:
Some of the file from #2269 didn't add to CMake file. Adding them to fix window build.
Closes https://github.com/facebook/rocksdb/pull/2276
Differential Revision: D5043487
Pulled By: yiwu-arbug
fbshipit-source-id: 4eba853e9d92574353abce21d77d30e47ce43d3d
Summary:
Moved the logic for core-local array out of ConcurrentArena and into a separate class because I want to reuse it for core-local stats.
Closes https://github.com/facebook/rocksdb/pull/2256
Differential Revision: D5011518
Pulled By: ajkr
fbshipit-source-id: a75a7b8f7b7a42fd6273489ada405f14c6be196a
Summary:
fix test failure of ReadAmpBitmap and ReadAmpBitmapLiveInCacheAfterDBClose.
test ReadAmpBitmapLiveInCacheAfterDBClose individually and make check
Closes https://github.com/facebook/rocksdb/pull/2271
Differential Revision: D5038133
Pulled By: lightmark
fbshipit-source-id: 803cd6f45ccfdd14a9d9473c8af311033e164be8
Summary:
- added a feature test in build_detect_platform to check whether sched_getcpu() is available. glibc offers it only on some platforms (e.g., linux but not mac); this way should be easier than maintaining a list of platforms on which it's available.
- refactored PhysicalCoreID() to be simpler / less repetitive. ordered the conditional compilation clauses from most-to-least preferred
Closes https://github.com/facebook/rocksdb/pull/2272
Differential Revision: D5038093
Pulled By: ajkr
fbshipit-source-id: 81d7db3cc620250de220bdeb3194b2b3d7673de7
Summary:
When building rocksdb in fbcode using `make`, util/build_version.cc is always updated (gitignore/hgignore doesn't apply because the file is already checked into fbcode). To use the rocksdb makefile from our own makefile, I would like an option to prevent the metadata update, which is of no value for us.
Closes https://github.com/facebook/rocksdb/pull/2264
Differential Revision: D5037846
Pulled By: siying
fbshipit-source-id: 9fa005725c5ecb31d9cbe2e738cbee209591f08a
Summary:
Updates to CentOS 5 have been archived as CentOS 5 is EOL. We now pull the updates from the vault. This is a stop gap solution, I will send a PR in a couple days which uses fixed Docker containers (with the updates pre-installed) instead.
sagar0 Here you go :-)
Closes https://github.com/facebook/rocksdb/pull/2270
Differential Revision: D5033637
Pulled By: sagar0
fbshipit-source-id: a9312dd1bc18bfb8653f06ffa0a1512b4415720d
Summary:
Consider BlockReadAmpBitmap with bytes_per_bit = 32. Suppose bytes [a, b) were used, while bytes [a-32, a)
and [b+1, b+33) weren't used; more formally, the union of ranges passed to BlockReadAmpBitmap::Mark() contains [a, b) and doesn't intersect with [a-32, a) and [b+1, b+33). Then bits [floor(a/32), ceil(b/32)] will be set, and so the number of useful bytes will be estimated as (ceil(b/32) - floor(a/32)) * 32, which is on average equal to b-a+31.
An extreme example: if we use 1 byte from each block, it'll be counted as 32 bytes from each block.
It's easy to remove this bias by slightly changing the semantics of the bitmap. Currently each bit represents a byte range [i*32, (i+1)*32).
This diff makes each bit represent a single byte: i*32 + X, where X is a random number in [0, 31] generated when bitmap is created. So, e.g., if you read a single byte at random, with probability 31/32 it won't be counted at all, and with probability 1/32 it will be counted as 32 bytes; so, on average it's counted as 1 byte.
*But there is one exception: the last bit will always set with the old way.*
(*) - assuming read_amp_bytes_per_bit = 32.
Closes https://github.com/facebook/rocksdb/pull/2259
Differential Revision: D5035652
Pulled By: lightmark
fbshipit-source-id: bd98b1b9b49fbe61f9e3781d07f624e3cbd92356
Summary:
Updated PhysicalCoreID() to use sched_getcpu() on x86_64 for glibc >= 2.22. Added a new
function named GetCPUID() that calls sched_getcpu(), to avoid repeated code. This change is done as per the comments of PR: https://github.com/facebook/rocksdb/pull/2230
Signed-off-by: Jos Collin <jcollin@redhat.com>
Closes https://github.com/facebook/rocksdb/pull/2260
Differential Revision: D5025734
Pulled By: ajkr
fbshipit-source-id: f4cca68c12573cafcf8531e7411a1e733bbf8eef
Summary:
Based on my experience with linkbench, We should not skip loading bloom filter blocks when they are not available in block cache when using Iterator::Seek
Actually I am not sure why this behavior existed in the first place
Closes https://github.com/facebook/rocksdb/pull/2255
Differential Revision: D5010721
Pulled By: maysamyabandeh
fbshipit-source-id: 0af545a06ac4baeecb248706ec34d009c2480ca4
Summary:
RocksDB is compiled as part of MyRocks (MySQL storage engine) build.
MySQL already defines `CACHE_LINE_SIZE` and therefore we're getting a
conflict. Change RocksDB definition to be more cognizant of this.
Closes https://github.com/facebook/rocksdb/pull/2257
Differential Revision: D5013188
Pulled By: gunnarku
fbshipit-source-id: cfa76fe99f90dcd82aa09204e2f1f35e07a82b41
Summary:
Adding DB::CreateColumnFamilie() and DB::DropColumnFamilies() to bulk create/drop column families. This is to address the problem creating/dropping 1k column families takes minutes. The bottleneck is we persist options files for every single column family create/drop, and it parses the persisted options file for verification, which take a lot CPU time.
The new APIs simply create/drop column families individually, and persist options file once at the end. This improves create 1k column families to within ~0.1s. Further improvement can be merge manifest write to one IO.
Closes https://github.com/facebook/rocksdb/pull/2248
Differential Revision: D5001578
Pulled By: yiwu-arbug
fbshipit-source-id: d4e00bda671451e0b314c13e12ad194b1704aa03
Summary:
Any non-raw-data dependent object must be destructed before the table
closes. There was a bug of not doing that for filter object. This patch
fixes the bug and adds a unit test to prevent such bugs in future.
Closes https://github.com/facebook/rocksdb/pull/2246
Differential Revision: D5001318
Pulled By: maysamyabandeh
fbshipit-source-id: 6d8772e58765485868094b92964da82ef9730b6d
Summary:
- downcase includes for case-sensitive filesystems
- give targets the same name (librocksdb) on all platforms
With this patch it is possible to cross-compile RocksDB for Windows
from a Linux host using mingw.
cc yuslepukhin orgads
Closes https://github.com/facebook/rocksdb/pull/2107
Differential Revision: D4849784
Pulled By: siying
fbshipit-source-id: ad26ed6b4d393851aa6551e6aa4201faba82ef60
Summary:
Now if we have iterate_upper_bound set, we continue read until get a key >= upper_bound. For a lot of cases that neighboring data blocks have a user key gap between them, our index key will be a user key in the middle to get a shorter size. For example, if we have blocks:
[a b c d][f g h]
Then the index key for the first block will be 'e'.
then if upper bound is any key between 'd' and 'e', for example, d1, d2, ..., d99999999999, we don't have to read the second block and also know that we have done our iteration by reaching the last key that smaller the upper bound already.
This diff can reduce RA in most cases.
Closes https://github.com/facebook/rocksdb/pull/2239
Differential Revision: D4990693
Pulled By: lightmark
fbshipit-source-id: ab30ea2e3c6edf3fddd5efed3c34fcf7739827ff
Summary:
Allow an option for users to do some compaction in FIFO compaction, to pay some write amplification for fewer number of files.
Closes https://github.com/facebook/rocksdb/pull/2163
Differential Revision: D4895953
Pulled By: siying
fbshipit-source-id: a1ab608dd0627211f3e1f588a2e97159646e1231
Summary:
Changed dynamic leveling to stop setting the base level's size bound below `max_bytes_for_level_base`.
Behavior for config where `max_bytes_for_level_base == level0_file_num_compaction_trigger * write_buffer_size` and same amount of data in L0 and base-level:
- Before #2027, compaction scoring would favor base-level due to dividing by size smaller than `max_bytes_for_level_base`.
- After #2027, L0 and Lbase get equal scores. The disadvantage is L0 is often compacted before reaching the num files trigger since `write_buffer_size` can be bigger than the dynamically chosen base-level size. This increases write-amp.
- After this diff, L0 and Lbase still get equal scores. Now it takes `level0_file_num_compaction_trigger` files of size `write_buffer_size` to trigger L0 compaction by size, fixing the write-amp problem above.
Closes https://github.com/facebook/rocksdb/pull/2123
Differential Revision: D4861570
Pulled By: ajkr
fbshipit-source-id: 467ddef56ed1f647c14d86bb018bcb044c39b964
Summary:
When user doesn't set a limit on compaction output file size, let's use the sum of the input files' sizes. This will avoid passing UINT64_MAX as fallocate()'s length. Reported in #2249.
Test setup:
- command: `TEST_TMPDIR=/data/rocksdb-test/ strace -e fallocate ./db_compaction_test --gtest_filter=DBCompactionTest.ManualCompactionUnknownOutputSize`
- filesystem: xfs
before this diff:
`fallocate(10, 01, 0, 1844674407370955160) = -1 ENOSPC (No space left on device)`
after this diff:
`fallocate(10, 01, 0, 1977) = 0`
Closes https://github.com/facebook/rocksdb/pull/2252
Differential Revision: D5007275
Pulled By: ajkr
fbshipit-source-id: 4491404a6ae8a41328aede2e2d6f4d9ac3e38880