docs: understanding locations
chapmanjacobd opened this issue · 1 comments
Good day,
I'm trying to understand the context of place_id in various files. I know that place_id
is just an identifier but I have encountered some puzzling things. Before I dive deep into my questions I will start light by asserting my beliefs about the data and how it is joined together. If there are incorrect beliefs please correct them:
- google-research/open-covid-19-data was started before this repo
░░▒█ ~ (main|?1) [2|1]🦋 curl -sS https://api.github.com/repos/GoogleCloudPlatform/covid-19-open-data | grep created_at
"created_at": "2020-07-23T23:43:51Z",
▓█░▒ ~ (main|?1) [0|0]🥞 curl -sS https://api.github.com/repos/google-research/open-covid-19-data | grep created_at
"created_at": "2020-05-21T03:35:01Z",
- The use of place_id was initially driven by the search_trends_symptoms dataset https://github.com/google-research/open-covid-19-data/search?q=place_id
- Not all place_ids in
mobility.csv
are expected to be found inaggregated.csv
- Not all place_ids in
aggregated.csv
are expected to be found inmobility.csv
How does mobility.csv
relate to Global_Mobility_Report.csv
?
They seem to be talking about exactly the same thing...
- https://github.com/GoogleCloudPlatform/covid-19-open-data/blob/main/docs/table-mobility.md
- https://www.google.com/covid19/mobility/data_documentation.html
But it seems like they are different data products entirely:
sqlite-utils memory Global_Mobility_Report.csv "select count(distinct place_id) from t1"
[{"count(distinct place_id)": 13249}]
sqlite-utils memory mobility.csv "select count(distinct location_key) from t1"
[{"count(distinct location_key)": 7351}]
as well as with aggregated.csv
:
xsv select place_id aggregated.csv | sort --unique > aggregated_place_ids.csv
xsv select place_id Global_Mobility_Report.csv | sort --unique > Global_Mobility_Report_place_ids.csv
combine aggregated_place_ids.csv not Global_Mobility_Report_place_ids.csv | count
14283
combine Global_Mobility_Report_place_ids.csv not aggregated_place_ids.csv | count
5913
After reading through more code I think I get it now
https://github.com/GoogleCloudPlatform/covid-19-open-data/blob/e2f6c1c0840fa1dc301ed798f6a624781b453c19/src/pipelines/mobility/google_mobility.py
https://github.com/GoogleCloudPlatform/covid-19-open-data/blob/15e2bdd4b1c7a523a74f42b3ada89f3686dbc882/src/pipelines/mobility/config.yaml
"Global_Mobility_Report.csv" is a source dataset which joins with other data, via knowledge_graph.csv, to create "mobility.csv" and "aggregates.csv"