jgehrcke/covid-19-germany-gae

Discrepancy in Deaths - Gehrcke vs. Risklayer

bdiv opened this issue · 5 comments

bdiv commented

Hello Dr. Gehrcke,
sorry to bother you again.

In my data I noticed a discrepancy between your aggregation and the Risklayer Chronik.

Data for all of Germany

Date RL Deaths Gehrcke
05-02-21 61961 61651
06-02-21 62128 61955
07-02-21 62597 62126
08-02-21 62630 62601

Image References:
image
Source
image
Source

This time there is no constant temporal shift.

Looking at your code I'm not quite sure how you generate the data and where this difference could come from . Do you use the Main Page and simply append the data from the current day to your files? Would you share the source URL for clarity?

If you take the "current" values from "Haupt" then the discrepancy probably comes from the time when counts for a day are finalised compared to when you sample them. Wouldn't the "Chronik" be a more stable ressource in that case?

Thanks! I didn't even get to reading this carefully, but I already want to quickly respond to

Do you use the Main Page and simply append the data from the current day to your files? Would you share the source URL for clarity?

This is currently being used:

export RISKLAYER_HISTORY_DEATHS_CSV_URL="https://docs.google.com/spreadsheets/d/e/2PACX-1vRTkw2_oVkpZ9-WQk-BRf4Pgam9aRmH62uCUr9FiY0Uxv5ixtDhwSsecc_QMrfrD4ncHsCAua2f0TJh/pub?output=csv"
bdiv commented

Thank you for providing the Link! I looked into it a bit and these are my findings:

Time shifted values

image
image

  • leaving out the decrease in deaths from 6th to 7th of February since it's probably due to corrections in risklayer
  • the 9th is missing in deaths-rl-crowdsource-by-ags.csv
  • values are generally shifted backwards one day
  • interestingly this doesn't happen in cases-rl-crowdsource-by-ags.csv (see screenshot below) different source format perhaps since the extraction code for deaths and cases are the same if I recall correctly?
    image

Discrepancy between the source of deaths-rl-crowdsource-by-ags.csv and the Chronik of the Risklayer Sheet

  • i can align both datasources using my code 1:
    image
  • comparison to the Risklayer-Chronik:
    image

Conclusion

Appendix

1: Alignment Code (sorry not that pretty but it does it's job)

import pandas as pd
gehrcke = pd.read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRTkw2_oVkpZ9-WQk-BRf4Pgam9aRmH62uCUr9FiY0Uxv5ixtDhwSsecc_QMrfrD4ncHsCAua2f0TJh/pub?output=csv")
gehrcke.head(1)[["GEN","AGS","05.02.2021","06.02.2021","07.02.2021","08.02.2021","09.02.2021","10.02.2021","current","current_time"]]
g = gehrcke[["05.02.2021","06.02.2021","07.02.2021","08.02.2021","09.02.2021","10.02.2021"]]
rl = pd.read_csv("https://docs.google.com/spreadsheets/d/1wg-s4_Lz2Stil6spQEYFdZaBEp8nWW26gVyfHqvcl8s/gviz/tq?tqx=out:csv&sheet=Chronik&range=AW1:CB437")
rl = rl.drop(rl.index[0:18]) # drop the aggregated state rows, leave only counties
rl = list(rl[["05.02","06.02","07.02","08.02","09.02","10.02"]].sum().values) # aggregate values for germany
t = pd.DataFrame(dict(  dates = g.sum().index,
                        gehrcke = g.sum().values,
                        rl = rl ))
t["diff"] = t.gehrcke - t.rl
print(t)

Still didn't get time to process this really, but it looks like you found that there's a discrepancy between the actual Google sheet you're looking at and and the Google sheet CSV export I am using. I remember that Tagesspiegel is using the same Google sheet CSV export. Maybe the sheet behind the CSV export is a different one, and they are both manually maintained / "kept in sync"?

Could you maybe also provide the source url for cases-rl-crowdsource-by-ags.csv ?

Sorry for replying to late to that specific question. This is

export RISKLAYER_HISTORY_CASES_CSV_URL="https://docs.google.com/spreadsheets/d/e/2PACX-1vTiKkV3Iy-BsShsK3DSUeO9Gpen7VwsXM_haCOc8avj1PeoCIWqL4Os-Uza3jWMEUgmTrEizEV-Itq5/pub?output=csv"

Did we make any further progress here understanding where the differences come from between the CSV exports used here, and the web view of the RL Google Sheet? Hm.

@bdiv would you say that these differences are cosmetic and negligible or significant? Honest opinion, please! :)