apache/incubator-baremaps

Add support for Overturemap parquet files

bchapuis opened this issue ยท 10 comments

@sebr72 as discussed, I'm not really satisfied with my current experiment in the overturemap branch. The geoparquet format contains semi structured data which require some changes in the DataTable abstraction. Also, it requires a deep understanding of the geoparquet format.

One avenue (probably the best) could be to use the parser available in sedona (the project is written in scala):
https://sedona.apache.org/latest-snapshot/tutorial/sql/#__tabbed_9_2

Another avenue could be to build upon my throw-away overturemaps branch, but I'm not sure about the effort needed to have something robust.

In both cases, adding parquet or sedona will result in a lot of new dependencies (hadoop, spark).

@sebr72 There may also be a third option which is to rely on parquet support in postgresql. I have no experience with this extention.
https://github.com/adjust/parquet_fdw

@bchapuis I had a look at Sedona and I highlight the following:

  1. Large project mainly relying on Spark or Flink (large project themselves)
  2. The java examples are around Flink which is a lot faster than Spark but it is not directly linked to Geoparquet
  3. The geoparquet implementation is in Scala and geared around Spark
  4. Combining Spark Sedona and Scala with baremaps will end up in an "expensive" integration for Geoparquet.

I am going to switch to have a look at:
https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet

Yes, I think the suggestion of @Drabble to look into drill is a good idea. We can probably either use it or get inspiration from it for our own implementation.

@sebr72 @Drabble I will merge the current PR and organize the git history to have three separated commits with our individual contributions. For the following tasks, I suggest we make individual PRs and split the work more clearly.

  • Cleanup the sonar problems when it makes sense (@sebr72 ).
  • Add some unit tests to the GeoParquetReader (@sebr72 ).
  • Define a better organization for the packages and classes of the geoparquet module (data, hadoop, etc.).
  • Improve the allocation of objects in the GeoParquet reader (list for each field, wrapper for each value, etc.)
  • The high level abstration (DataTable, DataColumn, etc.) needs support for nested data structures such as groups in geoparquet and json in other data formats (#857, #860).
  • Improve the abstraction of the GeoParquetGroup (the current version uses internal classes, and the getters/setters were quickly definined to have an end-to-end example, etc.) - after doing a pass on the DataTable abstraction

@sebr72 @Drabble I merged the changes and we can now continue with individual PRs.

@bchapuis Great job on the pull request! I will look at your new one for nested groups.

I would be really interested in making an example to go from Overture data on S3 to serving MVT to a Maputnik frontend.

I think this would mean:

1 Fix the code to be able to use a S3 url directly. E.g. s3a://overturemaps-us-west-2/release/2024-05-16-beta.0/theme=admins/type=/
2. Use the GeoParquetDataTable to write Overture data into Postgresql using a ProjectionTransformer to go from EPSG:4326 to EPSG:3857
3. Create a geospatial index for the geometry column
4. Create a materialised view to group the columns into a TAGS jsonb field and maybe simplifications for different zoom levels
5. Make a simple style.json and tileset.json to serve the data

What do you think?

Yes, the plan sounds good and can probably be addressed with multiple PRs. Maybe we can skip step 4 or use views instead of materialized views. As the daylight distribution with soon be deprecated and replaced by overturemaps, an idea could be to copy the daylight directory and use it as a basis.

We have a basic support for Overture maps now. Should we consider this issue closed and raise more issues for further improvements to the Overture maps library?

Congratz guys ๐ŸŽ‰