occlum/ngo

[RFC] Introduce SwornDisk in NGO

lucassong-mh opened this issue · 0 comments

  • Feature Name: Introduce SwornDisk
  • Start Date: 2022-11-17

In a nutshell

This RFC issue consists of two parts: One is SwornDisk high-level design overview, which explains SwornDisk's "why, how, what". One is SwornDisk-Occlum's implementation review, which explains code structure and details of this SGX version.

Design Overview

SwornDisk: A Log-Structured Secure Block Device for TEEs

Objectives: confidentiality, integrity, freshness, anonymity, consistency, and (flush) atomicity

1

Motivation

  • Existing solutions for protecting the on-disk data for TEEs are far from satisfactory in terms of both
    security and performance (eCryptfs, fscrypt, dm-crypt, SGX-PFS)
  • Yet SGX-PFS has both performance issue (Slow random writes due to 2 × H write amplification) and security vulnerability (Unanticipated snapshot attacks, CVE-2022-27499, our website)
  • Unanticipated snapshot attack: The adversary can capture and replay transient on-disk states (due to cache eviction in TEE) which are un-aware to users

2

Background knowledge

  1. In-place updates MHT-based approach VS. out-of-place updates log-structured approach

3

  • Random writes are slower than sequential writes in HDD/SSD

  • Write amplification: 2 × H vs. 1 + ϵ (ϵ ≪ 1)

  1. Log-Structured Merge tree (LSM-tree)
  • A leveled, ordered, disk-oriented index structure for KV stores. The core idea is to use append-only(sequential) writes to suit write-intensive workloads, avoid fragmentation writes like B-trees.
  • The data are organized in memory of MemTable and in persistence of SST files.
  • The read performance is degraded and LSM-tree uses bloom-filter and compaction strategy to minimize.
  • Usecase: BigTable, Hbase, LevelDB, RocksDB

4

  • Workflow: KV pair → MemTable → Sorted String Table → Minor compaction to L0 → Major compaction to Li

Architecture

5

SwornDisk performs out-of-place data updates. It keeps the mapping between user-query block address (LBA) and eventually-persist block address (HBA) in TEE.

It introduces tailor-made LSM-tree to index confidential data and only use MHT to protect the index (much smaller than data) itself. Cascade updates of MHTs is avoided since all disk content of index are all immutable.

There is also a journal subsystem to summarize on-disk updates to ensure crash consistency and atomicity.

This technique minimizes write amplification, where each write generates one data block, one or more index records (due to compaction), and one journal record.

Block I/O operations

read()

params: start address LBA, a number of block buffers

  1. Retrieve the HBAs, encryption keys, and MACs of these blocks from secure index (LSM-tree)
  2. Read and decrypt the encrypted data blocks from the HBAs
  3. Return to user plaintext data after verification

read

write()

params: start address LBA, a number of block buffers

  1. Save data in segment buffer and notify user of completion immediately
  2. When segment buffer becomes full or flush request received,
  3. Encrypt each block with random key, calculate MAC, and persist the segment to allocated disk location
  4. New generated index records are inserted to LSM-tree(persist to index region), new journal records are persisted to journal region

write

flush()

params: none

  1. Trigger flushing the new data in the temporary segment buffers to the physical disk
  2. Write journal to ensure consistency and atomicity

trim()

params: start address LBA, end address LBA

  1. Similar to write, except no new data is written, only the index is updated to discard the specified data blocks

Garbage Collection (segment cleaning)

SwornDisk's log-structured design lets newer data and older data coexist. So during writing new data, older data must be invalidated to benefit incoming GC.

Before every writes, SwornDisk retrieves older index records and invalidate the corresponding HBA (in DST).

A periodic GC worker would choose a victim segment, migrate the still valid blocks and free this data segment.

Index region

  • Disk oriented secure LSM-tree (dsLSM-tree): Organize the disk content directly on a raw disk without the help of file systems.
  • Block Index Table (BIT): Replacement of traditional SST. BIT integrates an MHT with a B+ tree. Each node is fixed-size and authentication encrypted.
    • 6
    • Root node and internal nodes: manage child nodes [ LBA range, HBA, Key, MAC)
    • Leaf nodes: Array of data records [ LBA → (HBA, Key, MAC) ]

Journal region

Journal contains a series of records that summarize the information of each on-disk update of the secure data log and the secure index.

  • record contains cryptographic information about the corresponding on-disk updates;
  • journal block (composed of multiple records) is chained with each other, embedded the MAC of the previous one;

SwornDisk realizes consistency based on three internal journal operations: journaling, checkpointing, and recovery.

Journaling

Each on-disk update of the secure data log and the secure index is followed by writing a corresponding journal record for the durability and security of the update.

Record Types Description
Data log Summarizes the update to a data segment (data region)
BIT node Summarizes a new BIT node (index region)
BIT compaction Saves the progress of a BIT compaction
Checkpoint pack Summarizes a new checkpoint pack (checkpint region)
Commit Marks prior data/index as committed

Checkpointing

To reclaim the disk space consumed by outdated journal records and speed up the recovery process, SwornDisk periodically transforms journal records into a more compact format called checkpoint packs.

  • checkpoint region preserves backups of BITC, SVT, DST, and RIT;
  • checkpoint pack consists of the creation timestamp, the head and tail positions of the secure journal, and the bitmaps to choose valid backups for recovering;

Recovery

During recovery, SwornDisk selects the most recent checkpoint pack, from which it initializes its in-memory data structures. Then, it continues reading the rest of the journal, one record at a time, deciding whether it should be accepted to restore SwornDisk to a consistent state.

image

Checkpoint region

Consist of some auxiliary data structures for index query and segment management:

  • Block Index Table Catalog (BITC): Recording the metadata of a BIT [ BIT ID, level, key range, root node ]
    • Used for manage LSM-tree's BITs
  • Segment Validity Table (SVT): A bitmap where each bit indicates whether a segment is valid
    • Used for allocation/deallocation of data/index segments
  • Data Segment Table (DST): Contain per-segment metadata of the data segments (valid block bitmap)
    • Used for manage invalidation of blocks in each segment, and GC
  • Reverse Index Table (RIT): Mapping from HBAs to LBAs
    • Used for GC

Further discussion

Other important points worth to discuss but lack of space:

Compaction-based, delayed block reclamation; Flush atomicity based on commitment; Key acquisition and protection flow; space clipping; Performance tuning.

Implementation Review

[WIP]
ngoiostack