/ciff

Common Index File Format to to support interoperability between open-source IR engines

Primary LanguageJava

Common Index File Format

The Common Index File Format (CIFF) represents an attempt to build a binary data exchange format for open-source search engines to interoperate by sharing index structures. For more details, check out:

All data are contained in a single file, with the extension .ciff. The file comprises a sequence of delimited protobuf messages defined here, exactly as follows:

  • A Header
  • Exactly the number of PostingsList messages specified in the num_postings_lists field of the Header
  • Exactly the number of DocRecord messages specified in the num_docs field of the Header

See our design rationale for additional discussion.

Explained in terms of xkcd, we're trying to avoid this. Instead, CIFF aims to be this.

Getting Started

After cloning this repo, build CIFF with Maven:

mvn clean package appassembler:assemble

Reference Lucene Indexes

Currently, this repo provides an utility to export CIFF from Lucene, via Anserini. For reference, we provide exports from the Robust04 and ClueWeb12-B13 collections:

Collection Configuration Size MD5 Download
Robust04 CIFF export, complete 162M 01ce3b9ebfd664b48ffad072fbcae076 [Dropbox]
Robust04 CIFF export, queries only 16M 0a8ea07b6a262639e44ec959c4f53d44 [Dropbox]
Robust04 Source Lucene index 135M b993045adb24bcbe292d6ed73d5d47b6 [Dropbox]
ClueWeb12-B13 CIFF export, complete 25G 8fff3a57b9625eca94a286a61062ac82 [Dropbox]
ClueWeb12-B13 CIFF export, queries only 1.2G 45063400bd5823b7f7fec2bc5cbb2d36 [Dropbox]
ClueWeb12-B13 Source Lucene index 21G 6ad327c9c837787f7d9508462e5aa822 [Dropbox]

The follow invocation can be used to examine an export:

target/appassembler/bin/ReadCIFF -input robust04-complete-20200306.ciff.gz

We provide a full guide on how to replicate the above results here.

CIFF Importers

A CIFF export can be ingested into a number of different search systems.

Tips for writing your own CIFF Importer / Exporter

The systems above all provide concrete examples of taking an existing CIFF structure and converting it into a different (internal) index format. Most of the data/structures within the CIFF are quite straightforward and self-documenting. However, there are a few important details which should be noted.

  1. The default CIFF exports come from Anserini. Those exports are engineered to encode document identifiers as deltas (d-gaps). Hence, when decoding a CIFF structure, care needs to be taken to recover the original identifiers by computing a prefix sum across each postings list. See the discussion here.

  2. Since Anserini is based on Lucene, it is important to note that document lengths are encoded in a lossy manner. This means that the document lengths recorded in the DocRecord structure are approximate - see the discussion here.

  3. Multiple records are stored in a single file using Java protobuf's parseDelimitedFrom() and writeDelimitedTo() methods. Unfortunately, these methods are not available in the bindings for other languages. These can be trivially reimplemented be reading/writing the bytesize of the record using varint - see the discussion here.