The Common Index File Format (CIFF) was introduced as a binary data exchange format for open-source search engines to interoperate by sharing index structures.
CIFF has been adopted by the OpenWebSearch.EU project to distribute (partitions of) Web indexes.
This repository provides the code necessary to load a CIFF file through Arrow into DuckDB.
The goal is to load and transform the CIFF data into an index for the DuckDB Full Text Search extension. (The version provided has not yet completely achieved that goal.)
Install DuckDB CLI for testing:
wget https://artifacts.duckdb.org/latest/duckdb-binaries-linux.zip
unzip -p duckdb-binaries-linux.zip duckdb_cli-linux-amd64.zip | funzip > ./duckdb ; chmod a+rx ./duckdb
python ciff-arrow.py