Automatically calculate country codes per quadkey & remove country_iso flag
cholmes opened this issue · 2 comments
It seems like it should be possible to automatically include a country_iso to substantially speed up the query. The current method of having the user supply it is potentially error prone, and annoying.
The idea would be to calculate the list of country_iso values for every single quadkey. This would have to be a list, since quadkeys can cross countries, and big ones could have a hundred or more countries in them. But most should be one or a handful of countries, which will most always speed up the query.
There are 16 million quadkeys at level 12, but many are likely in the ocean. We likely could use a quadkey at level 10 or even 8, as having a couple more hive partitions to help wouldn't still make it worth it.
So I think the main thing would be to make a script that generates a list of country iso codes for every quadkey. Then store that as a parquet file, and if it's not too big we could likely just include it in the open_buildings package.
If we had this then we could remove the country_iso flag, as we'd be able to always use a hive partition.
@cholmes is it worth thinking about using the dataset-links.csv here https://github.com/microsoft/GlobalMLBuildingFootprints#what-does-the-data-include
We could cross ref that with another list that has the iso-2 codes perhaps?
Interesting - yeah, that dataset will be perfect for querying the microsoft buildings as in #27 as it doesn't use iso-2 code, it uses the region name. So for that we need the lookup from quadkey to region name. I'm curious what that one does if the quadkey overlaps two regions - my guess is it just uses the centroid as they use region for the hive column so you wouldn't really want too. It's at zoom 9 I think, but good to know that it's only 6.3mb.
I think it may make sense to start mostly from scratch, as I think we want all possible countries (or regions) that are in a quadkey. Starting at zoom 9 probably makes good sense. I think the ideal is columns for iso-2, region and quadkey. I don't think it should be too hard to generate this with a script - just iterate through every quadkey and if it overlaps with one or more countries then add a row with the quadkey and the country codes. It should correspond exactly to whatever country file was used to create the data. That reminds me I should upload the one I've been using to source.coop - I just made it from overture using their recommended way in the docs.