pipeline for pre-processing warc files from CommonCrawl
initial inspiration came from https://arxiv.org/pdf/2306.01116.pdf (The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only)
pipeline for pre-processing warc files from CommonCrawl
initial inspiration came from https://arxiv.org/pdf/2306.01116.pdf (The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only)