onthegomap/planetiler

Improve runtime for small areas

Opened this issue · 8 comments

Planetiler takes ~30 seconds to run even for the smallest areas (like andorra from geofabrik). Let's see if there is any way to improve that. Here's a summary of runtime over andorra:

0:00:33 INF -   overall          33s cpu:1m12s gc:3s avg:2.2
0:00:33 INF -   lake_centerlines 3s cpu:12s gc:1s avg:4.4
0:00:33 INF -     read     1x(35% 0.9s done:2s)
0:00:33 INF -     process  9x(1% 0s wait:1s done:2s)
0:00:33 INF -     write    1x(0% 0s wait:1s done:2s)
0:00:33 INF -   water_polygons   12s cpu:17s avg:1.4
0:00:33 INF -     read     1x(94% 12s)
0:00:33 INF -     process  9x(0% 0s wait:12s)
0:00:33 INF -     write    1x(0% 0s wait:12s)
0:00:33 INF -   natural_earth    11s cpu:14s avg:1.3
0:00:33 INF -     read     1x(66% 7s done:4s)
0:00:33 INF -     process  9x(2% 0.2s wait:7s done:4s)
0:00:33 INF -     write    1x(0% 0s wait:8s done:4s)
0:00:33 INF -   osm_pass1        0.4s cpu:2s avg:3.7
0:00:33 INF -   osm_pass2        1s cpu:5s avg:5
0:00:33 INF -     read     1x(0% 0s)
0:00:33 INF -     process  9x(31% 0.3s)
0:00:33 INF -     write    1x(3% 0s wait:1s)
0:00:33 INF -   boundaries       0s cpu:0.1s avg:2.9
0:00:33 INF -   sort             0.1s cpu:0.7s avg:7.2
0:00:33 INF -   archive          0.5s cpu:3s avg:5.6

The biggest issues are natural earth and water polygons since planetiler has to deserialize every feature for the whole planet.

One idea would be to switch natural earth to read the geopackage source, and use the built-in spatial index to limit what we read to only what's inside the bounding box.

I'm not sure if we could do something similar with water polygons since they are just a zipped shapefile with a shp and shx file but no sbn or sbx. If we convert it to a different format we could add an index, but that complicates things quite a bit since we can't just download directly from the source.

cc/ @bdon

bdon commented

At the most extreme we could define a ReadableTileArchive as another input type that is passed directly as tiled features, without touching the FeatureCollector API; the OSM or NE-derived ocean is going to be exactly the same for every planetiler output modulo tags/buffer sizes. That would make water polygons cost effectively nothing.

Otherwise we might be able to read the Shapefile index if one is included for water polygons, or migrate to another indexed format for it (Geopackage, FGB?)

Another hybrid option might be to compute a spatial index the first time we read a file and use it to speed up subsequent reads?

Or ask the maintainers of the water polygons source if they'd be up for distributing in geopackage format/adding a spatial index.

Another low hanging fruit would be to keep the unzipped file contents around between runs

bdon commented

Context on Geopackage etc output from osmcoastline: osmcode/osmcoastline#35 (comment)

erik commented

For the Natural Earth / geopackage case, we can also have profiles declare the limited set of tables that they're interested in and skip reading features that won't be processed.

bdon commented

Proposal here #635

This addresses Natural Earth and other bring-your-own geopackages.

For water and land polygons I'd be happy to mirror those via cloud storage bucket in indexed GPKG format, or we can attempt to add that to the upstream data source.