Types and IO (Reader/Writer) for OSCAR Corpus processing and generation.
The crate provides basic abstractions around Corpus items and generic readers/writers useable in OSCAR Corpus files. At some time, it should replace reader implementations in both Ungoliant and oscar-tools.
oscar-io
aims to provide readers/writers for numerous types of OSCAR Corpora.
- Reader
- Uncompressed [oscar_doc::Reader::new]
- GZipped [oscar_doc::Reader::from_gzip]
- Parquet
- Writer
- Uncompressed [oscar_doc::Writer::new]
- GZipped [oscar_doc::Writer::new] (using a [GzEncoder] reader,
from_gzip
not yet implemented) - Parquet
- SplitReader (Should be unified with SplitReader with
split_size: Option<u64>
)- Uncompressed
- GZipped
- SplitWriter (Same)
- Uncompressed
- GZipped
- Reader
- Writer
- SplitReader (Should be unified with SplitReader with
split_size: Option<u64>
) - SplitWriter (Same)
- Reader
- Writer
- SplitReader
- SplitWriter