J535D165/CoronaWatchNL

GGD no longer publishing tests performed per week

Sikerdebaard opened this issue · 3 comments

The GGD has stopped publishing data on tests performed per week on the page below. This is unfortunate because the data was quite machine-readable and quite complete.

https://ggdghor.nl/actueel-bericht/weekupdate-cijfers-coronatests-bij-de-ggden-2/

As a replacement the GGD points to the website of the RIVM. Unfortunately this means that if we want to keep this data up-to-date we need to parse the html table from this page every tuesday somewhere after 15:00:

https://www.rivm.nl/coronavirus-covid-19/actueel

There's no table available with data for all weeks nor is there data on the total number of tests performed. Unless we parse the epidemiological report pdf that is.

The code below will extract the relevant parameters from the RIVM website.

import datetime

year = datetime.date.today().year

df = pd.read_html('https://www.rivm.nl/coronavirus-covid-19/actueel')[0]  # get the first html table on the page

df.columns = ['name', 'afg', 'voorg']

startidx = 0
most_recent_weeknum = 0
totaal_aantal_tests = -1
aantal_positief = -1
perc_positief = -1
for idx, row in df.iterrows():
    if 'week' in str(row['afg']):  # only start parsing once we encounter the word week in column afg
        most_recent_weeknum = row['afg'].strip().split(' ')[-1]  # only extract the numerical part in this cell, it's the weeknumber
        startidx = idx
        continue
        
    if startidx and 'totaal aantal' in row['name'].lower():  # total number of tests performed in most_recent_weeknum
        totaal_aantal_tests = int(row['afg'].replace('.', '').replace(',', '.'))
        
    if startidx and 'aantal positie' in row['name'].lower():  # number of positive tests in most_recent_weeknum
        aantal_positief = int(row['afg'].replace('.', '').replace(',', '.'))
        
    if startidx and 'percentage pos' in row['name'].lower():  # percentage positive tests in most_recent_weeknum
        perc_positief = float(row['afg'].replace('.', '').replace(',', '.').replace('%', ''))
    
print(year, most_recent_weeknum, totaal_aantal_tests, aantal_positief, perc_positief)```

Here's another method to extract this data from the coronadashboard datablob. This is probably more reliable than extracting it from the rivm website.

import zipfile
import json
from io import BytesIO
import datetime

dashboard_datablob = 'https://coronadashboard.rijksoverheid.nl/latest-data.zip'

r = requests.get(dashboard_datablob, stream=True)
z = zipfile.ZipFile(BytesIO(r.content))

with z.open('json/NL.json') as fh:
    nl = json.load(fh)

rows = []
for week in nl['ggd']['values']:
    rows.append({
        'year-week': datetime.datetime.fromtimestamp(int(week['week_end_unix'])).strftime('%Y-%U'),
        'tested_pos': week['infected'],
        'tested_total': week['tested_total'],
        'percent_pos': week['infected_percentage'],
    })
    
pd.DataFrame(rows).set_index('year-week')```

Great scripts! It might be a good idea to make a new dataset with these test counts because it differs from GGD and virologische dagstaten (dataset 3...). RIVM is about to publish two new open datasets this week. As far as I know, a dataset with test counts isn't among them.

Hopefully, I can find some time to work on this tomorrow.

Very interesting, many thanks, I did not know about the coronadashboard.rijksoverheid.nl/latest-data.zip yet!