This repository contains the code to filter the OSCAR dataset, it filters OSCAR's compressed jsonl files and converts them into parquet.
➜ oscar-filter -h
Filters OSCAR's compressed jsonl files and converts them into parquet.
Usage: oscar-filter [OPTIONS] <INPUT FOLDER> <DESTINATION FOLDER>
Arguments:
<INPUT FOLDER> Folder containing the OSCAR compressed jsonl files
<DESTINATION FOLDER> Destination folder for the parquet files
Options:
-t, --threads <NUMBER OF THREADS> Number of threads to use [default: 10]
-h, --help Print help
-V, --version Print version