/warc_processing

pipeline for pre-processing warc files from CommonCrawl

MIT LicenseMIT

warc_processng

pipeline for pre-processing warc files from CommonCrawl

initial inspiration came from https://arxiv.org/pdf/2306.01116.pdf (The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only)

Presentation1