Datafable/epu-index

Calculate daily epu index

Closed this issue · 7 comments

Start a job after running the scrapers that will calculate and persist the epu index of yesterday.

@niconoe is it ok if I forward this to you to make it a Django command? I think we will schedule the scraper commands with cron, and when all of them are finished, this job should run.

Yep!

For more flexibility and being able to deal with cron misconfiguration issues, shouldn't this command allow:

  • either to run for a specific day given as a parameter
  • either for the "yesterday" special option (used by default)
  • either reprocess everything?

Indeed. However, "everything" will be difficult. Since for days before 2014, we don't know the number of journals scraped. (In fact, maybe we need a place to store this for new data too?)

@niconoe apparently the cutoff for the epu index is not 0 but -0.15.

So the EPU index is the number of articles with a epu score higher than -0.15 divided by the number of journals scraped.

So again, maybe we need a place where we can store the number of journals scraped. Some place where every scraper can write "I succeeded for this day". Maybe a table "journals scraped" with two columns "date" and "spider/journal name"?

It's now implemented, use as:

$ python manage.py calculate_daily_epu 2015-08-17

It tells what it does on stdout and store its result in EpuIndexScore. Please review and test!

Tested. Works perfectly.