/covid-etl-aut

A basic ETL pipeline for covid stats in Austria

Primary LanguageClojure

A very basic ETL (extract, transform, load) pipeline for covid stats in Austria

Functions in the extract namespace handle loading & parsing of datasources. Currently implemented are HTML scrapers (loaded into enlive), a simple fetcher & parser for JSONp sources, and a JSON fetcher & parser.

The transform ns contains functions that expect output from extract-scrapers and turn them into a common stats data format: A hashmap with location identifiers (:at for Austria, :wien for Vienna, etc) for keys, and a hashmap of statistics for values. A quick peek into how that map looks like: looks like this:

  {:at {:cases xxx
        :tests xxx
        :tdouble xxx}
   :wien {:cases xxx
          :tdouble xxx}
     
     ...
  }

All calculated stats are combined and, along with an extracted timestamp, later be sent to targets. Currently supported exports:

  • Google Sheets
  • json file dump

Setup

If you want use the Google sheets export, you'll need a Google Service Account. Here are instructions on how to create one. To grant write access to your service account, you add its "email address" as a collaborator directly within the sheet.

This project uses deps.edn. Run the scraper with clj -m core, or build an uberjar clj -Auberjar.