Pandemic-Knowledge

A fully-featured multi-source data pipeline for continuously extracting knowledge from COVID-19 data.

Contamination figures
Vaccination figures
Death figures
COVID-19-related news (Google News, Twitter)

What you can achieve

Live contaminations map + Latest news	Last 7 days news

France 3-weeks live map (Kibana Canvas)	Live vaccinations map

Context

This project was realized over 4 days as part of a MSc hackathon from ETNA, a french computer science school.

The incentives were both to experiment/prototype a big data pipeline and contribute to an open source project.

Install

Below, you'll find the procedure to process COVID-related file and news into the Pandemic Knowledge database (elasticsearch).

The process is scheduled to run every 24 hours so you can update the files and obtain the latest news

Pandemic-Knowledge

Env file

Running this project on your local computer ? Just copy the .env.example file :

cp .env.example .env

Open this .env file and edit password-related variables.

Initialize elasticsearch

Raise your host's ulimits for ElasticSearch to handle high I/O :

sudo sysctl -w vm.max_map_count=500000

Then :

docker-compose -f create-certs.yml run --rm create_certs
docker-compose up -d es01 es02 es03 kibana

Initialize Prefect

Create a ~/.prefect/config.toml file with the following content :

# debug mode
debug = true

# base configuration directory (typically you won't change this!)
home_dir = "~/.prefect"

backend = "server"

[server]
host = "http://172.17.0.1"
port = "4200"
host_port = "4200"
endpoint = "${server.host}:${server.port}"

Run Prefect :

docker-compose up -d prefect_postgres prefect_hasura prefect_graphql prefect_towel prefect_apollo prefect_ui

We need to create a tenant. Execute on your host :

pip3 install prefect
prefect backend server
prefect server create-tenant --name default --slug default

Access the web UI at localhost:8081

Run Prefect workers

Agents are services that run your scheduled flows.

Open and optionally edit the agent/config.toml file.
Let's instanciate 3 workers :

docker-compose -f agent/docker-compose.yml up -d --build --scale agent=3 agent

ℹ️ You can run the agent on another machine than the one with the Prefect server. Edit the agent/config.toml file for that.

COVID-19 data

Injection scripts should are scheduled in Prefect so they automatically inject data with the latest news (delete + inject).

There are several data source supported by Pandemic Knowledge

Our World In Data; used by Google
- docker-compose slug : insert_owid
- MinIO bucket : contamination-owid
- Format : CSV
OpenCovid19-Fr
- docker-compose slug : insert_france
- Format : CSV (download from Internet)
Public Health France - Virological test results (official source)
- docker-compose slug : insert_france_virtests
- Format : CSV (download from Internet)

Start MinIO and import your files according to the buckets evoked upper.

For Our World In Data, create the contamination-owid bucket and import the CSV file inside.
```
docker-compose up -d minio
```
MinIO is available at localhost:9000

Download dependencies and start the injection service of your choice. For instance :

pip3 install -r ./flow/requirements.txt
docker-compose -f insert.docker-compose.yml up --build insert_owid

In Kibana, create an index pattern contamination_owid_*

Once injected, we recommend to adjust the number of replicas in the DevTool :

PUT /contamination_owid_*/_settings
{
    "index" : {
        "number_of_replicas" : "2"
    }
}

Start making your dashboards in Kibana !

News data

There are two sources for news :

Google News (elasticsearch index: news_googlenews)
Twitter (elasticsearch index: news_tweets)

Run the Google News crawler :

docker-compose -f crawl.docker-compose.yml up --build crawl_google_news # and/or crawl_tweets

In Kibana, create a news_* index pattern
Edit the index pattern fields :

Name	Type	Format
img	string	Url
link	string with Type: Image with empty URL template	Url

Create your visualisation

News web app

Browse through the news with our web application.

Make sure you've accepted the self-signed certificate of Elasticsearch at https://localhost:9200

Start-up the app

docker-compose -f news_app/docker-compose.yml up --build -d

Discover the app at localhost:8080

TODOs

Possible improvements :

Using Dask for parallelizing process of CSV lines by batch of 1000
Removing indices only when source process is successful (adding new index, then remove old index)
Removing indices only when crawling is successful (adding new index, then remove old index)

Useful commands

To stop everything :

docker-compose down
docker-compose -f agent/docker-compose.yml down
docker-compose -f insert.docker-compose.yml down
docker-compose -f crawl.docker-compose.yml down

To start each service, step by step :

docker-compose up -d es01 es02 es03 kibana
docker-compose up -d minio
docker-compose up -d prefect_postgres prefect_hasura prefect_graphql prefect_towel prefect_apollo prefect_ui
docker-compose -f agent/docker-compose.yml up -d --build --scale agent=3 agent

astraway/Pandemic-Knowledge