GGD no longer publishing tests performed per week
Sikerdebaard opened this issue · 3 comments
The GGD has stopped publishing data on tests performed per week on the page below. This is unfortunate because the data was quite machine-readable and quite complete.
https://ggdghor.nl/actueel-bericht/weekupdate-cijfers-coronatests-bij-de-ggden-2/
As a replacement the GGD points to the website of the RIVM. Unfortunately this means that if we want to keep this data up-to-date we need to parse the html table from this page every tuesday somewhere after 15:00:
https://www.rivm.nl/coronavirus-covid-19/actueel
There's no table available with data for all weeks nor is there data on the total number of tests performed. Unless we parse the epidemiological report pdf that is.
The code below will extract the relevant parameters from the RIVM website.
import datetime
year = datetime.date.today().year
df = pd.read_html('https://www.rivm.nl/coronavirus-covid-19/actueel')[0] # get the first html table on the page
df.columns = ['name', 'afg', 'voorg']
startidx = 0
most_recent_weeknum = 0
totaal_aantal_tests = -1
aantal_positief = -1
perc_positief = -1
for idx, row in df.iterrows():
if 'week' in str(row['afg']): # only start parsing once we encounter the word week in column afg
most_recent_weeknum = row['afg'].strip().split(' ')[-1] # only extract the numerical part in this cell, it's the weeknumber
startidx = idx
continue
if startidx and 'totaal aantal' in row['name'].lower(): # total number of tests performed in most_recent_weeknum
totaal_aantal_tests = int(row['afg'].replace('.', '').replace(',', '.'))
if startidx and 'aantal positie' in row['name'].lower(): # number of positive tests in most_recent_weeknum
aantal_positief = int(row['afg'].replace('.', '').replace(',', '.'))
if startidx and 'percentage pos' in row['name'].lower(): # percentage positive tests in most_recent_weeknum
perc_positief = float(row['afg'].replace('.', '').replace(',', '.').replace('%', ''))
print(year, most_recent_weeknum, totaal_aantal_tests, aantal_positief, perc_positief)```
Here's another method to extract this data from the coronadashboard datablob. This is probably more reliable than extracting it from the rivm website.
import zipfile
import json
from io import BytesIO
import datetime
dashboard_datablob = 'https://coronadashboard.rijksoverheid.nl/latest-data.zip'
r = requests.get(dashboard_datablob, stream=True)
z = zipfile.ZipFile(BytesIO(r.content))
with z.open('json/NL.json') as fh:
nl = json.load(fh)
rows = []
for week in nl['ggd']['values']:
rows.append({
'year-week': datetime.datetime.fromtimestamp(int(week['week_end_unix'])).strftime('%Y-%U'),
'tested_pos': week['infected'],
'tested_total': week['tested_total'],
'percent_pos': week['infected_percentage'],
})
pd.DataFrame(rows).set_index('year-week')```
Great scripts! It might be a good idea to make a new dataset with these test counts because it differs from GGD and virologische dagstaten (dataset 3...). RIVM is about to publish two new open datasets this week. As far as I know, a dataset with test counts isn't among them.
Hopefully, I can find some time to work on this tomorrow.
Very interesting, many thanks, I did not know about the coronadashboard.rijksoverheid.nl/latest-data.zip yet!