geoarrow/geoarrow-rs

How to handle Postgis geometry column

Closed this issue · 11 comments

In postgis, geometry column can store multi geo types. Which array should I use to store them? GeometryCollectionArray or MixedGeometryArray or WKBArray? What's the difference between them?

How should I implement user defined function to do geo compution? Should I split mixed array to 6 arrays (point, linestring, polygon, multipoint, multilinestring, multipolygon)?

WKB, Point, LineString, Polygon, MultiPoint, MultiLineString, MultiPolygon, and GeometryCollection are all terms defined in OGC Simple Features

  • WKBArray is a "serialized array". It stores binary buffers that can't be used directly, but rather have to be parsed. Coordinate access within a wkb array is O(n) not O(1) like it is for any of the other geoarrow-native arrays
  • MixedGeometryArray uses an Arrow union type to be able to store variously-typed geometries, but it doesn't support a GeometryCollection, which is a list of variously-typed geometries.
  • GeometryCollectionArray as a type is able to store any geometry, even geometry collections (as long as you don't have a geometry collection nested inside another geometry collection). But I had issues with implementing features on GeometryCollections (#172) and so these aren't well supported.

Most external data sources don't have metadata that strongly describes their type. So loading from Postgis is effectively the same as reading from GeoJSON or parsing an arbitrary WKB column. Most of the time it'll hold a single geometry type, and you can use one of the six downcasted types (it's easy to hold e.g. point and multipoint in the same column). In the medium term, we'll parse all incoming data without a known geometry type into a MixedGeometryArray and then downcast whenever possible to a single geom type.

Should I split mixed array to 6 arrays (point, linestring, polygon, multipoint, multilinestring, multipolygon)?

That's what a MixedGeometryArray is, if you look at the definition

Thanks for reply. MixedGeometryArray is not usable now, what's your plan for MixedGeometryArray?

Indeed, I've been focusing on making sure the six concrete types are stable before focusing on MixedGeometryArray. In particular, the other types are more straightforward to develop than MixedGeometryArray

There are a few areas of work for MixedGeometryArray:

  • Specification: Mixed geometry arrays are not yet well-specified in the GeoArrow spec. See geoarrow/geoarrow#23 and geoarrow/geoarrow#43. A mixed array could be specified by a few different data structures. See geoarrow/geoarrow#43 (review). I've been exploring it so far under the assumption that it'll be an Arrow UnionArray, but that has the downside that it can't be represented in Parquet.
  • Implementation: As you can see, the implementation of mixed arrays is behind other concrete arrays. This is partly because the Union type is much more confusing to follow, and partly because mixed arrays are a good second-priority after the regular arrays
  • Tests: Because of the added complexity, varied round-trip tests are a lot more important than other array types.
  • WKB IO: ensuring and testing that wkb io is working

I plan to work on all of this eventually but I don't have any timeline. If there's any part you want to work on, happy to discuss more to align the implementation

In the shorter term, if you have a WKB array, you can iterate through the array to check if all geometries can be casted to a non-mixed type, and then do the conversion to that type specifically.

Yesterday I run benchmark for WKBArray and LineStringArray. I prepared data which contains 10000000 items and run intesects algo on the data.

    #[tokio::test]
    async fn wkb_bench() {
        let mut linestrings: Vec<Option<geo::Geometry>> = vec![];
        for i in 0..10000000 {
            let x = i as f64;
            let y = i as f64;
            linestrings.push(Some(geo::Geometry::LineString(line_string![(x: x, y: y)])));
        }
        linestrings.push(Some(geo::Geometry::LineString(
            line_string![(x: -21.951, y: 64.14479), (x: -21.9, y: 64.1)],
        )));
        println!("Data len: {}", linestrings.len());
        let wkb_arr = WKBArray::<i32>::from(linestrings);

        let schema_ref = Arc::new(Schema::new(vec![Field::new(
            "geom",
            wkb_arr.storage_type(),
            false,
        )]));
        let record =
            RecordBatch::try_new(schema_ref.clone(), vec![wkb_arr.into_array_ref()]).unwrap();

        let ctx = SessionContext::new();
        ctx.register_batch("mytable", record).unwrap();
        ctx.register_udf(geom_from_text::geom_from_text());
        ctx.register_udf(super::intersects());

        let now = std::time::Instant::now();
        let df = ctx
        .sql("select count(1) from mytable where st_intersects(geom, st_geomfromtext('POINT(-21.9 64.1)'))")
        .await
        .unwrap();
        df.collect().await.unwrap();
        println!("cost: {}ms", now.elapsed().as_millis());
    }

    #[tokio::test]
    async fn linestring_bench() {
        let mut linestrings: Vec<geo::LineString> = vec![];
        for i in 0..10000000 {
            let x = i as f64;
            let y = i as f64;
            linestrings.push(line_string![(x: x, y: y)]);
        }
        linestrings.push(line_string![(x: -21.951, y: 64.14479), (x: -21.9, y: 64.1)]);
        println!("Data len: {}", linestrings.len());
        let linestring_arr = LineStringArray::<i32>::from(linestrings);

        let schema_ref = Arc::new(Schema::new(vec![Field::new(
            "geom",
            linestring_arr.storage_type(),
            false,
        )]));
        let record =
            RecordBatch::try_new(schema_ref.clone(), vec![linestring_arr.into_array_ref()])
                .unwrap();

        let ctx = SessionContext::new();
        ctx.register_batch("mytable", record).unwrap();
        ctx.register_udf(geom_from_text::geom_from_text());
        ctx.register_udf(super::linestring_intersects());

        let now = std::time::Instant::now();
        let df = ctx
        .sql("select count(1) from mytable where st_intersects(geom, st_geomfromtext('POINT(-21.9 64.1)'))")
        .await
        .unwrap();
        df.collect().await.unwrap();
        println!("cost: {}ms", now.elapsed().as_millis());
    }

2be329b4b7d6f55b5ae12ccbb6ed14c
747c842bbed02a8fb80a3d883e2d2fd
The result makes me a little surprised that the performance gap is not as big as I thought (no order of magnitude difference). So I think it has value that we treat WKBArray as first class citizen like concrete type arrays (implement algo for WKBArray). Especially there is no consensus on how to implement mixed geometry array. What do you think?

@kylebarron Sorry to bother, what do you think about above? Would you still think that we shouldn't impl algos for WKBArray?

Linking to #291 (comment), which is mostly how I feel about this. I think it's better to focus on the non-wkb implementations and stabilize the mixed geometry array. #299 also did a bit of refactoring to improve the MixedGeometryBuilder.

no order of magnitude difference

50% slower is still a lot!

Especially there is no consensus on how to implement mixed geometry array

I don't think we necessarily need to have a full geoarrow consensus on the internals of the mixed array as long as we avoid exposing an ABI-stable interface to its data, or at least documenting it as unstable. If we operate in terms of geometry array traits and public methods, it's ok if we later switch the internal geometry representation. But in any case, no other mixed geometry array implementation currently exists to my knowledge, and so implementing it here will be good for spec discussions.

I think now that EWKB parsing support has been implemented, this can be closed?

Yeah I think so. Thanks for your great work.

Hi, I have a follow-up question which might be relevant to this ticket. Since the GeometryArray enum is fading out, what should be the correct "abstract" Geometry type to use now? Is it MixedGeometryArray? I have a gut feeling (but maybe wrong) that the MixedGeometryArray is a bit too heavy for a single pure Geometry object without any semantic attributes. Could you please provide some hints here? Thanks!

The primary abstract geometry type is Arc<dyn GeometryArrayTrait>, which then you can downcast using the data_type() method and AsGeometryArray. This is very similar to upstream arrow. You can see uses all throughout the codebase of downcasting to a concrete geometry type and doing an operation on it