/ct_warc_to_doc

Source code to extract content from commoncrawl news corpus and upload to S3

Primary LanguagePython

Commoncrawl News Specific WARC File Parser

Aim of the project
Extracts documents from commoncrawl news specific warc-files

Requirements

AWS
EC2


How to run

python main.py --month_id 01 --year_id 2020 --month_half first

ToDo

  • improve the overall warc file parsing workflow
    • the workflow should be more robust
  • remove parameters and it should parse in a parameterless fashion
    • maybe the month and year parameters are stored somewhere else
  • should be run in aws spot instances
  • it should have autoscaling so that weird instances are killed and new instances are spawned