Summary:
While the secondary is replaying after the primary, the primary may switch to a new MANIFEST. The secondary is already able to detect and follow the primary to the new MANIFEST. However, the current implementation has a bug, described as follows.
The new MANIFEST's first records have been generated by VersionSet::WriteSnapshot to describe the current state of the column families and the db as of the MANIFEST creation. Since the secondary instance has already finished recovering upon start, there is no need for the secondary to process these records. Actually, if the secondary were to replay these records, the secondary may end up adding the same SST files **again** to each column family, causing consistency checks done by VersionBuilder to fail. Therefore, we record the number of records to skip at the beginning of the new MANIFEST and ignore them.
Test plan (on dev server)
```
$make clean && make -j32 all
$./db_secondary_test
```
All existing unit tests must pass as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5472
Differential Revision: D15866771
Pulled By: riversand963
fbshipit-source-id: a1eec4837fb2ad13059398efb0f437e74fd53bed
Summary:
`DBImplSecondary` calls `CheckConsistency()` during open. In the past, `DBImplSecondary` did not override this function thus `DBImpl::CheckConsistency()` is called.
The following can happen. The secondary instance is performing consistency check which calls `GetFileSize(file_path)` but the file at `file_path` is deleted by the primary instance. `DBImpl::CheckConsistency` does not account for this and fails the consistency check. This is undesirable. The solution is that, we call `DBImpl::CheckConsistency()` first. If it passes, then we are good. If not, we give it a second chance and handles the case of file(s) being deleted.
Test plan (on dev server):
```
$make clean && make -j20 all
$./db_secondary_test
```
All other existing unit tests must pass as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5469
Differential Revision: D15861845
Pulled By: riversand963
fbshipit-source-id: 507d72392508caed3cd003bb2e2aa43f993dd597
Summary:
In regular RocksDB instance, `MemTable::earliest_seqno_` is "db sequence number at the time of creation". However, we cannot use the db sequence number to set the value of `MemTable::earliest_seqno_` for secondary instance, i.e. `DBImplSecondary` due to the logic of MANIFEST and WAL replay.
When replaying the log files of the primary, the secondary instance first replays MANIFEST and updates the db sequence number if necessary. Next, the secondary replays WAL files, creates new memtables if necessary and inserts key-value pairs into memtables. The following can occur when the db has two or more column families.
Assume the db has column family "default" and "cf1". At a certain in time, both "default" and "cf1" have data in memtables.
1. Primary triggers a flush and flushes "cf1". "default" is **not** flushed.
2. Secondary replays the MANIFEST updates its db sequence number to the latest value learned from the MANIFEST.
3. Secondary starts to replay WAL that contains the writes to "default". It is possible that the write batches' sequence numbers are smaller than the db sequence number. In this case, these write batches will be skipped, and these updates will not be visible to reader until "default" is later flushed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5413
Differential Revision: D15637407
Pulled By: riversand963
fbshipit-source-id: 3de3fe35cfc6f1b9f844f3f926f0df29717b6580
Summary:
With this commit, RocksDB secondary instance respects atomic groups in version edits.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5411
Differential Revision: D15617512
Pulled By: HaoyuHuang
fbshipit-source-id: 913f4ede391d772dcaf5649e3cd2099fa292d120