
Source code to extract content from commoncrawl news corpus and upload to S3

Primary LanguagePython

Commoncrawl News Specific WARC File Parser

Aim of the project
Extracts documents from commoncrawl news specific warc-files



How to run

python main.py --month_id 01 --year_id 2020 --month_half first


  • improve the overall warc file parsing workflow
    • the workflow should be more robust
  • remove parameters and it should parse in a parameterless fashion
    • maybe the month and year parameters are stored somewhere else
  • should be run in aws spot instances
  • it should have autoscaling so that weird instances are killed and new instances are spawned