Leveled - Safety Features within Design/Implementation

Question

Leveled - Safety Features within Design/Implementation

Opened this issue 7 months ago · 0 comments

This issue is a walk-through of the aspects of design/implementation which underpin the protection from corruption within leveled.

Concurrency Control

The first aspect of the safety of the system is the strict controls of over concurrency. It is assumed that parallelisation of activity a function of the parent application (e.g. Riak) - leveled can only do one thing at time, with the exception of snapshot-based query activity and compaction. The ability for a store to provide a consistent view is a function of the Actor model, with parallelism of activity being strictly controlled by the model. The orchestration between processes avoids clever tricks to improve performance at a single-store level.

Concurrency is managed by having dedicated Erlang processes (using gen_server for the primary workers, and gen_statem for the managing process for each file). There is a single Bookie leveled_bookie, which has a single Inker leveled_inker (to control the Journal - K/V store in order of receipt), and a single Penciller leveled_penciller (to control the Ledger - K/MD store in key order). The Inker and Penciller each have single clerk (leveled_iclerk, leveled_pclerk), which manage background compaction tasks - but these clerks prepare compaction changes, they must send a message to the Inker/Penciller to enact the change.

All files that make up the Journal or the Ledger are immutable - they are each controlled by a singleton process which can only read the file not mutate it. File processes (leveled_cdb for journal, and leveled_sst for ledger) are gen_statem processes - that exist as a writer (i.e. when being created by a clerk), then as a reader (when an active part of the journal/ledger), and finally delete_pending (when a compaction has removed them from the active ledger, but the they are waiting the final signal to terminated and clear). The only exception to this rule, is each Journal has a single active journal file which is an append-only file, with no support for in-place modifications - it appends new Key/values as they arrive, as well as supporting reads.

A snapshot to support a query or a fold, is a fresh Inker/Penciller process (in a read-only mode) - with a copy of the in-memory cache, and a reference to the manifest of file processes that represent the non-cached state of the store at the time snapshot was taken. The delete_pending state on file processes is used to handle the situation where a compaction event has rendered a file process redundant, but a snapshot is still active where that file may be read. The active Inker/Penciller does not need to inform the snapshot of any changes of state (I.e. due to compaction), there is no-coordination between active actors and snapshot actors - other than the fact that the active actors keep track of what snapshots were open on what manifest versions.

When a snapshot is finished, it will inform the active Inker/Penciller so they can be removed from the tracking. And file in a delete_pending state will periodically poll the Inker/Penciller to see if its uses has been complete - and the Inker/Penciller will confirm deletion if and only if all snapshots associated with the files last active manifest version have ended. Periodically the Inker/Penciller will check for snapshots exceeding their timeout, and assume they have silently closed. Snapshots that exceed their timeouts may therefore crash with noproc errors, but this crash has no impact on the active store.

File Transition

When changing versions of the Inker or Penciller Manifest, and when transitioning state between writer/reader in the leveled_sst files - the leveled_util:safe_rename/4 function is used to avoid issues with partially complete file-system transitions. safe_rename isn’t used for leveled_cdb, in this case the transition is split into two halves - with one process being triggered to complete the write, and perform the rename, and a second process to read the resultant file (see TODO - as there his no sync in this process). The safe_rename function is based on Scott Fritchie’s riak_core_util:replace_file/2.

The manifest for the Journal is strictly controlled by the single Inker process - the clerk proposes changes to the manifest, but the Inker must make them. The roles are reversed in the Ledger, where only the Penciller’s clerk can save the manifest, and it then exchanges a message with the Penciller for the Penciller to make that manifest active (and there is a need at this point to merge some state from the in-memory manifest - as snapshots may have closed).

The Penciller Manifest is checksummed, but not the Inker’s Manifest (see TODO). A history of the last 5 manifests is kept on disk, should there be an issue which requires investigation.

Journal and Checksums

Each new Key/Value is passed from the Bookie to the Inker, compressed and extended to include a checksum, and then appended to the active Journal. The checksum is a CRC32 of both the Key and the Value, and is appended to the value.

Each journal when complete (I.e. every file other than the active Journal) is a DJ Bernstein Constant Database file, except that the controlling process will check the Key/Value checksums on each operation. The purpose of the checksums is to return not_found on corruption, so that a higher-level anti-entropy mechanism (i.e. read repair or AAE) should eventually resolve this state. It is considered better to return no data rather than corrupted data.

There is an additional state for the leveled_cdb between writer and reader - rolling. In the rolling state then file is read-only from the perspective of K/V, but is still in the process of completing its hash table which will be used in the long-term to lookup keys (in this state an in-memory hash table compiled in the writer state is still used to manage reads. The hashtable is calculated in a spawned process to allow reads to occur concurrently.

On startup the active journal is read until the end of file is reached, or a there is a value which fails to be read (I.e. checksum failure) - and file is truncated at that point. All values beyond a corrupted value are considered absent. This startup process creates the in-memory hash table to be used for reading values.

The hash table at the end of the leveled_cdb files are not checksummed. Any corruption in the hash table will naturally result in not_found (i.e. as either a hash will not match, or a pointer will point to something which isn’t the start of a checksummed Key/Value).

Journal Compaction

When compaction occurs (after a compaction run passes the score threshold)

All new files are written
All new files are saved, renamed, and then successfully opened and read
A new version of the manifest is saved, renamed, and then successfully opened and read
The new manifest is now active, and removed files will enter into delete_pending state - the file is deleted from delete_pending state if either a shutdown signal is received, or a regular poll indicates that all snapshots have been released
Any failure or halt during this process prior to the manifest renaming will revert to the old pre-compaction state (and leave uncollected garbage files)

The last 5 copies of the manifest are retained to assist in any recovery

It is possible to configure a “Waste file path” - in this case journal files on deletion will be renamed into the waste file path rather than being deleted. In the case of a Waste FilePath, then it would be possible to manually recover the state of a previous Journal manifest, should the manifest be corrupted. However, garbage collection from the waste file path is not automated, and the default is to rely on higher-level anti-entropy mechanisms rather than this feature.

Key present in Ledger but missing from Journal

If a key is not present in Journal (I.e. a file has been deleted, a value has become corrupted), then on GET it will be not_found (and so read-repair will be triggered). The key will still be “loose present” in the Journal, in that its hash will still be in the leveled_cdb hash lookup. Loose presence is checked on random HEAD requests (1:100), and during AAE-related fold activity. When a journal file is compacted, the object will no longer be loosely present, and so absence will be detected by AAE not just read-repair. The frequency of presence checks is increased should a presence check ever fail - I.e. so if a cdb file is corrupted/removed, HEADs not just GETs will increasingly return not_found to accelerate read repairs.

If error logging indicates frequent CRC failures within a Journal file, it may be preferable that the Journal file be deleted to force loose-presence check failures for all values in the file. TODO - confirm the process here in test, should it be just deleted, or replaced with an empty file.

Ledger on Startup

The Ledger is not designed to be complete on shutdown (I.e. not everything must be persisted on shutdown - though it will try to write an up-to-date L0 file on shutdown if one is not already present). On startup the highest SQN is found, and the Journal is instructed to reload from that SQN. If there is a significant and unexpected issue, the Ledger can simply be removed, and next startup will reload from SQN 0. It is only safe to delete individual files to allow for startup, if the individual file is a “Level 0 file”. It is generally quicker to rebuild a ledger (order 1 hour) on startup, rather than rely on higher-level anti-entropy to resolve.

On reloading from a SQN (potentially 0) a strategy is required to recover index entries. It is possible to run leveled in retain mode, where Key/Values are never fully removed from the value in the Journal, a KeyDelta object is retained on replacement with all the Key Changes (i.e. index changes) made as part of that transaction, so that they can be replayed. This leaves forever uncollected garbage, and so is not recommended. The recommended setting is to use recalc. In this mode the leveled store uses the same diff index specs function as Riak to read the object in the Journal, and compare with the metadata in the Ledger and re-calculate what key/index changes are required. Note that this recommendation is not the default setting. TODO - Make recalc method the default in Riak.

Ledger and Checksums

The Keys/Metadata within a SST file are split into slots, and then into blocks. Each block is checksummed (CRC32), and a checksum failure (or any failure to read e.g. because of bit-flips in the block lengths) result in a not_found (with the assumption that this should prompt read-repair/AAE recovery).

All reads include a checksum check - so on compaction any Keys/Metadata pairs within a corrupted block will be removed from the Ledger, and in any AAE related fold the Keys/Clocks in a corrupted block will not show up. Missing data due to corrupted blocks is assumed to be resolved by re-adding the object (after a higher-level anti-entropy operation i.e. read-repair / kv_index_tictactree). Without higher level anti-entropy, then the store needs to be stopped, the ledger deleted, and the store restarted - in order to rebuild the Ledger from the Journal.

Each leveled_sst file has a table_summary that must be read at startup (and is only read from disk at startup, after startup it resides in the process loop state). This summary has a CRC32 checksum. A failure of the checksum will cause a crash that will ripple up the system. The safe way to resolve this issue is to delete the whole ledger, and restart the store so that the entire ledger can be rebuilt.

Ledger Compaction Events

All ledger compaction events follow the same routine as with the Journal. The penciller’s clerk polls the penciller to discover if there is a need for compaction at a level (looking down from the top of the LSM tree), and then selects a file at random from that level to compact into the levels below. A new state requires all components in the new state to be saved, renamed, opened and read correctly. Only once this is complete will the Penciller be informed of the manifest change.

The ledger is made up of a tree of SST files, with the current tree defined by the persisted ledger manifest. The last 5 copies of the manifest are kept to potentially assist in any recovery (although any unexpected event should be handled by wiping the ledger and restarting).

Tests

Eunit tests related to bit-flips/corruption:

leveled_sst:indexed_list_mixedkeys_bitflip_test/0
leveled_sst:corrupted_block_fetch_test/0
leveled_pmanifest:ext_keylookup_manifest_test/0
leveled_cdb:safe_read_test/0
leveled_cdb:get_positions_corruption_test/0
leveled_cdb:badly_written_test/0
leveled_cdb:corrupt_file_test/0
leveled_cdb:crc_corrupt_writer_test/0

CT tests related to bit-flips/corruption:

recovery_SUITE:aae_bustedjournal/1
recovery_SUITE:journal_compaction_bustedjournal/1
basic_SUITE:safereaderror_startup/1
basic_SUITE:remove_journal_test/1

The property-based tests also do aggressive stop/starts.

Issues

TODO

The Journal manifest does not contain a CRC check - so there is no checking that this isn’t corrupted on startup (whereas there is with the ledger manifest).
The leveled_cdb process should sync when calling cdb_complete before closing, to be consistent with leveled_util:safe_rename/4.
There is versioning on leveled_sst files, to allow for changes in layout, and also on object serialisation for storage in the Journal. However, versioning would be helpful elsewhere to allow for changes in layout (I.e. block-versions in leveled_sst files). Altering the layout of a file between releases has been achieved, but requires bespoke handling for each case.
recalc should be the default recovery method in Riak (currently recommended as the default, but actually set to retain).