You can find the pregenerated dataset on Huggingface (March 1, 2023):
If you want to regenerate the dataset with fresh Wikipedia/Wikidata dumps, you can build wikianc
from source by running the following command:
cargo build --release
NOTE: The program uses language specific filtering (i.e., the word "file") which only supports Croatian and English out of the box. Replace the relevant part in the parse_links
function to properly support your language.
wikianc
uses the mappings between Wikipedia titles and Wikidata QIDs generated by wiki2qid. Follow the instructions to generate the Apache Avro file containing the mappings first.
It also uses a Wikipedia dump in an ndjson
format which can be generated by following the instructions here.
Once you have the necessary data, you can generate the dataset with the following command:
cargo run --release -- \
--input-wiki "${WIKIPEDIA_NDJSON_FILE}" \
--input-wiki2qid "${MAPPINGS_FILE}" \
--output-dir "${OUTPUT_DIR}"
This will create 3 files named train.parquet
, validation.parquet
, and test.parquet
in the directory specified by ${OUTPUT_DIR}
.
The outputs are written into zstd compressed Apache Parquet files. You can see the details of the schema on Huggingface.
WikiAnc
uses as many threads as there are logical CPU cores. On the English dump from March 2023, containing ~6,600,000 articles, it takes ~11 minutes to complete with peak memory usage of ~52GB on an AMD Ryzen Threadripper 3970X CPU and an SSD.