/humongous-rs

A Rust pipeline for extracting HUMONGOUS, a dataset of web-based text extracted from Common Crawl and ready for multilingual language modeling.

Primary LanguageRust

Watchers