GoogleCloudPlatform/covid-19-open-data

docs: understanding locations

chapmanjacobd opened this issue · 1 comments

Good day,

I'm trying to understand the context of place_id in various files. I know that place_id is just an identifier but I have encountered some puzzling things. Before I dive deep into my questions I will start light by asserting my beliefs about the data and how it is joined together. If there are incorrect beliefs please correct them:

  • google-research/open-covid-19-data was started before this repo
░░▒█ ~ (main|?1) [2|1]🦋 curl -sS https://api.github.com/repos/GoogleCloudPlatform/covid-19-open-data | grep created_at
  "created_at": "2020-07-23T23:43:51Z",
▓█░▒ ~ (main|?1) [0|0]🥞 curl -sS https://api.github.com/repos/google-research/open-covid-19-data | grep created_at
  "created_at": "2020-05-21T03:35:01Z",

How does mobility.csv relate to Global_Mobility_Report.csv ?

They seem to be talking about exactly the same thing...

But it seems like they are different data products entirely:

sqlite-utils memory Global_Mobility_Report.csv "select count(distinct place_id) from t1"
[{"count(distinct place_id)": 13249}]

sqlite-utils memory mobility.csv "select count(distinct location_key) from t1"
[{"count(distinct location_key)": 7351}]

as well as with aggregated.csv:

xsv select place_id aggregated.csv | sort --unique > aggregated_place_ids.csv
xsv select place_id Global_Mobility_Report.csv | sort --unique > Global_Mobility_Report_place_ids.csv

combine aggregated_place_ids.csv not Global_Mobility_Report_place_ids.csv  | count
14283
combine Global_Mobility_Report_place_ids.csv not aggregated_place_ids.csv  | count
5913

After reading through more code I think I get it now

https://github.com/GoogleCloudPlatform/covid-19-open-data/blob/e2f6c1c0840fa1dc301ed798f6a624781b453c19/src/pipelines/mobility/google_mobility.py
https://github.com/GoogleCloudPlatform/covid-19-open-data/blob/15e2bdd4b1c7a523a74f42b3ada89f3686dbc882/src/pipelines/mobility/config.yaml

"Global_Mobility_Report.csv" is a source dataset which joins with other data, via knowledge_graph.csv, to create "mobility.csv" and "aggregates.csv"