OvertureMaps/data

Tokyo Buildings missing from January release?

marklit opened this issue · 6 comments

I've downloaded the 231 GB of building data for January. I'm trying to extract out the Tokyo building footprints but there doesn't appear to be any in this release. Japanese building data has been in Overture's previous releases I believe so I'm not sure why they're missing now.

$ cd /mnt/f/gis/Global/overture/2024_01/theme\=buildings/type\=building/
$ ~/duckdb /mnt/f/gis/Asia/Japan/tokyo_buildings.duckdb
CREATE OR REPLACE TABLE tokyo AS
    SELECT * EXCLUDE (geometry),
           ST_GEOMFROMWKB(geometry) geom
    FROM READ_PARQUET('part*.zstd.parquet',
                      hive_partitioning=1)
    WHERE id LIKE '0842f5a%' OR id LIKE '0842f5b%';

select count(*) from tokyo;
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│            0 │
└──────────────┘

I checked that I can filter using an H3 prefix on the id field without issue.

SELECT id
FROM READ_PARQUET('part*.zstd.parquet',hive_partitioning=1)
WHERE id LIKE '08ba0085884%' LIMIT 1;
┌──────────────────────────────────┐
│                id                │
│             varchar              │
├──────────────────────────────────┤
│ 08ba0085884a4fff0200ab2e3ad3f37c │
└──────────────────────────────────┘

Hi @marklit - I haven't made a map of the January data yet otherwise I'd just zoom in and see if buildings are in Tokyo.

Instead I grabbed a bounding box in the Tokyo area and ran this query on Athena:

SELECT
    COUNT(1)
FROM overture_2024_01_17_alpha_0
WHERE
    theme = 'buildings'
    AND type = 'building'
    AND ST_INTERSECTS(
        ST_GEOMFROMBINARY(GEOMETRY),
        ST_ENVELOPE(
            ST_MULTIPOINT(
                ARRAY[
                    ST_POINT(139.602942, 35.575015),
                    ST_POINT(139.843876, 35.800503)                
                ]
            )
        )
    )

which returns 1179376 rows. Looks like Tokyo buildings are there. If there's an area where data is missing, please let us know.

Separately I was interested in those ID prefixes you were looking at. Using this query and adjusting to read from each Overture release - January, December, November, and October.

SELECT
    id,
    ST_ASTEXT(ST_GEOMFROMBINARY(GEOMETRY))
FROM overture_2024_01_17_alpha_0
WHERE
    theme = 'buildings'
    AND type = 'building'
    AND (
        id LIKE '0842f5a%'
        OR id LIKE '0842f5b%'
    )
LIMIT
    10

Those ID prefixes are only present in the October data, not in November, December, or January. The October GERS IDs were only valid for a few cities that didn't include Tokyo which is why they've change since. (https://overturemaps.org/download/overture-october-2023-release-notes/). They should be more stable going forward now that buildings GERS IDs are global.

Does this align with what you're seeing?

The GERS ids contain H3_8s, I was mistakenly using H3_4s. Thanks for looking into it and giving me a few ideas. I'll probably just use the bounding box technique for now.

We are also looking to improve the spatial partitioning of the dataset to improve the performance of spatial queries like the one above.

@jwass I'm sure you probably already have plans in place but I wanted to leave a few observations here.

I found the row group sizes very large in my research of the October release. https://tech.marksblogg.com/overture-gis-data.html It would be good to see there set to 15K.

Using H3_8s for the GERS IDs are very granular and they aren't great for using for geographic look ups. The 3 H3_4s that highlighted most of Tokyo in the post below contain 7K H3_8s. It demands a massive where clause unless you write some code to look up common prefixes which will also match outside of the original intended space.

The records aren't sorted by geography so multiple PQ files need to be read to collect a single country's data in many cases. There is a lot of wasted bandwidth pulling down files with mostly unrelated data. Here is a heatmap of January's building footprints which contain Tokyo's data https://tech.marksblogg.com/tokyo-walking-tour-guide.html#tokyo-s-building-footprints I believe only 25% of that file relates to Japan and I'd have to dig through the other 300+ building files to find the rest of Japan's data.

The 2 GB PQ file containing Tokyo building data in the blog above dedicates 1.2 GB to geometry which is fine but the bounding box data is almost ~500 MB which is wasteful. The large size of these fields could be that they're not sorted and the compressor's window just isn't wide enough to find the best patterns. I've always seen huge reduction compressed size in GIS datasets by sorting by longitude.

    1.2 GB geometry
  189.8 MB id
  121.4 MB bbox.maxy
  121.3 MB bbox.miny
  119.7 MB bbox.maxx
  119.6 MB bbox.minx
   82.6 MB sources.list.element.recordId
   10.7 MB updateTime
    2.9 MB height
    2.6 MB sources.list.element.confidence
    2.1 MB names.common.list.element.value
    1.1 MB sources.list.element.dataset
  584.1 kB class
  386.7 kB names.common.list.element.language
  310.8 kB names.official.list.element.value
  296.4 kB names.alternate.list.element.value
  285.1 kB names.official.list.element.language
  284.5 kB names.short.list.element.value
  281.1 kB names.short.list.element.language
  281.0 kB names.alternate.list.element.language
  273.3 kB numFloors
  150.9 kB sources.list.element.property
   93.6 kB roofShape
   82.7 kB facadeColor
   77.9 kB roofColor
   60.6 kB hasParts
   48.6 kB facadeMaterial
   48.5 kB version
   46.6 kB level
   46.2 kB roofMaterial
   44.7 kB roofOrientation
   41.2 kB roofDirection
   40.4 kB eaveHeight

Having centroid x and y fields as float32s and will go a long way to create a very small disk foot print for a field that can be used to organise the underlying data better and offers a smaller column that can be skipped through quickly when running WHERE clauses against S3 or other Cloud storage. I'm not sure if updateTime is fully populated in the above file but it's only 10MB. I suspect the centroid x and y fields could be similar in size.

A sort key across the dataset of the H3_8 followed by centroid x and y could be ideal. That way PQ files cluster nearby records together and the zone maps inside of the PQ files can still be effective.

@marklit Thank you again for the feedback and the very detailed writeup.

Improved spatial partitioning of the dataset is something that's top of mind and I expect to see significant improvements in the next few releases. There's a very crude implementation now that needs to be improved... and as you're pointing out.

We have a discussion around this topic and we should continue the conversation there. There are some links there for different investigations people have done around sorting for GeoParquet datasets if you haven't seen them. #91.

Using H3_8s for the GERS IDs are very granular and they aren't great for using for geographic look ups.

IDs really shouldn't be used for spatial queries. I feel this is something that's going to keep coming up for many people since location is baked into the ID but should probably be considered an antipattern. Just use the bounding boxes.

The 2 GB PQ file containing Tokyo building data in the blog above dedicates 1.2 GB to geometry which is fine but the bounding box data is almost ~500 MB which is wasteful.
...
Having centroid x and y fields as float32s

Agree. There's discussion about this here and the current GeoParquet bbox proposal allows for float32. We should just implement that and fix it. We'll want to continue to have full bounding boxes and not just centroids otherwise we won't be able to query for large geometries well like large bodies of water, admin areas, etc.. Any spatial format that includes a spatial index likely already has per-geometry envelopes included and this isn't unique to GeoParquet. (e.g. https://www.geopackage.org/spec/#gpb_format).

Thanks again. I enjoy reading all your posts about your methodology including the struggles. We're continually improving things on our end.

@marklit Just to follow up and close the loop on this - in addition to the improved spatial partitioning from March, the bbox fields are now all float32 instead of doubles. This saves quite a bit of space and improves spatial query performance by transferring much less data when needing to grab bounding boxes.

The building footprint data (theme=buildings/type=building) in April is about 38GB smaller than in March due to this.