Discrepancy in Deaths - Gehrcke vs. Risklayer
bdiv opened this issue · 5 comments
Hello Dr. Gehrcke,
sorry to bother you again.
In my data I noticed a discrepancy between your aggregation and the Risklayer Chronik.
Data for all of Germany
Date | RL Deaths | Gehrcke |
---|---|---|
05-02-21 | 61961 | 61651 |
06-02-21 | 62128 | 61955 |
07-02-21 | 62597 | 62126 |
08-02-21 | 62630 | 62601 |
Image References:
Source
Source
This time there is no constant temporal shift.
Looking at your code I'm not quite sure how you generate the data and where this difference could come from . Do you use the Main Page and simply append the data from the current day to your files? Would you share the source URL for clarity?
If you take the "current" values from "Haupt" then the discrepancy probably comes from the time when counts for a day are finalised compared to when you sample them. Wouldn't the "Chronik" be a more stable ressource in that case?
Thanks! I didn't even get to reading this carefully, but I already want to quickly respond to
Do you use the Main Page and simply append the data from the current day to your files? Would you share the source URL for clarity?
This is currently being used:
export RISKLAYER_HISTORY_DEATHS_CSV_URL="https://docs.google.com/spreadsheets/d/e/2PACX-1vRTkw2_oVkpZ9-WQk-BRf4Pgam9aRmH62uCUr9FiY0Uxv5ixtDhwSsecc_QMrfrD4ncHsCAua2f0TJh/pub?output=csv"
Thank you for providing the Link! I looked into it a bit and these are my findings:
Time shifted values
- leaving out the decrease in deaths from 6th to 7th of February since it's probably due to corrections in risklayer
- the 9th is missing in deaths-rl-crowdsource-by-ags.csv
- values are generally shifted backwards one day
- interestingly this doesn't happen in cases-rl-crowdsource-by-ags.csv (see screenshot below) different source format perhaps since the extraction code for deaths and cases are the same if I recall correctly?
Discrepancy between the source of deaths-rl-crowdsource-by-ags.csv and the Chronik of the Risklayer Sheet
- i can align both datasources using my code 1:
- comparison to the Risklayer-Chronik:
Conclusion
- There is still something weird going on with the dates, just that this time it might be related to a difference in format?
- If both Risklayer-Chronik and [the source]](https://docs.google.com/spreadsheets/d/e/2PACX-1vRTkw2_oVkpZ9-WQk-BRf4Pgam9aRmH62uCUr9FiY0Uxv5ixtDhwSsecc_QMrfrD4ncHsCAua2f0TJh/pub?output=csv) for deaths-rl-crowdsource-by-ags.csv come from risklayer, then might the discrepancy be related to a difference in timing for the snapshot?
- Could you maybe also provide the source url for cases-rl-crowdsource-by-ags.csv ?
EDIT: - I have no detailed analysis but there is a difference in the cases as well: vgl
2302051
vs2302160
for 9th of Feb - The difference here (109) stems from the cities and counties below:
- it stands to reason that the same happens with the deaths
- I changed the range in the url from
range=AW1:CB437
torange=AW1:CB420
but the result still shows a difference:
Appendix
1: Alignment Code (sorry not that pretty but it does it's job)
import pandas as pd
gehrcke = pd.read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRTkw2_oVkpZ9-WQk-BRf4Pgam9aRmH62uCUr9FiY0Uxv5ixtDhwSsecc_QMrfrD4ncHsCAua2f0TJh/pub?output=csv")
gehrcke.head(1)[["GEN","AGS","05.02.2021","06.02.2021","07.02.2021","08.02.2021","09.02.2021","10.02.2021","current","current_time"]]
g = gehrcke[["05.02.2021","06.02.2021","07.02.2021","08.02.2021","09.02.2021","10.02.2021"]]
rl = pd.read_csv("https://docs.google.com/spreadsheets/d/1wg-s4_Lz2Stil6spQEYFdZaBEp8nWW26gVyfHqvcl8s/gviz/tq?tqx=out:csv&sheet=Chronik&range=AW1:CB437")
rl = rl.drop(rl.index[0:18]) # drop the aggregated state rows, leave only counties
rl = list(rl[["05.02","06.02","07.02","08.02","09.02","10.02"]].sum().values) # aggregate values for germany
t = pd.DataFrame(dict( dates = g.sum().index,
gehrcke = g.sum().values,
rl = rl ))
t["diff"] = t.gehrcke - t.rl
print(t)
Still didn't get time to process this really, but it looks like you found that there's a discrepancy between the actual Google sheet you're looking at and and the Google sheet CSV export I am using. I remember that Tagesspiegel is using the same Google sheet CSV export. Maybe the sheet behind the CSV export is a different one, and they are both manually maintained / "kept in sync"?
Could you maybe also provide the source url for cases-rl-crowdsource-by-ags.csv ?
Sorry for replying to late to that specific question. This is
export RISKLAYER_HISTORY_CASES_CSV_URL="https://docs.google.com/spreadsheets/d/e/2PACX-1vTiKkV3Iy-BsShsK3DSUeO9Gpen7VwsXM_haCOc8avj1PeoCIWqL4Os-Uza3jWMEUgmTrEizEV-Itq5/pub?output=csv"