/cdf-scraper

Scraper for College de France audio pages

Primary LanguagePythonMIT LicenseMIT

College de France audio scraper

Build Status

This is a sraper for pages containing audio material from the College de France website.

Purpose

It doesn't download any audio file but instead stores metadata about them (lesson title, lecturer, date, etc.) in Google's Datastore to allow statistics and further extraction of data to be done.

How to run

For devs, follow the gist.

For an actual prod run, a docker file is provided you can run it with:

docker build -t scraper .
docker run scraper

You'll have to create your own project in Google Compute Engine and pass in your project ID and proper json service account credentials via environment variables.

Output

Once you run it, you'll see a few thousands entities in your dashboard.