nlngh/ct_warc_to_doc

Source code to extract content from commoncrawl news corpus and upload to S3

Python

Commoncrawl News Specific WARC File Parser

Aim of the project
Extracts documents from commoncrawl news specific warc-files

Requirements

AWS
EC2

How to run

python main.py --month_id 01 --year_id 2020 --month_half first

ToDo

improve the overall warc file parsing workflow
- the workflow should be more robust
remove parameters and it should parse in a parameterless fashion
- maybe the month and year parameters are stored somewhere else
should be run in aws spot instances
it should have autoscaling so that weird instances are killed and new instances are spawned