tracyhenry/Kyrix

Implement rtrees with clustering optimization

peterg17 opened this issue · 8 comments

Aka wenbo's funky rtrees.
The basic idea of this is that we will use the GiST box type, which postgres will index as a normal rtree. The main difference between this and our PsqlNativeBoxIndexer is that we take bounding boxes that are close to each other and put them into the same row.

the very initial version which indexed 1B points in 12 hours and achieved even better online performance: https://github.com/tracyhenry/Kyrix/blob/master/back-end/src/main/java/index/PsqlGridCompressIndexer.java

seems very interesting to explore different clustering schemes and compare performance

I'm actually a bit confused about how you are clustering the points if you calculate gridX & gridY by dividing by a fixed width. I don't see anywhere in the grouped table query where you add to a blob if it is in a certain range of gridX or gridY. It seems like you check if a point has the same value as the current gridX and gridY, which seems to only be a point with the same exact coordinates??

as you can see in these two lines: https://github.com/tracyhenry/Kyrix/blob/master/back-end/src/main/java/index/PsqlGridCompressIndexer.java#L167-L168
I'm calculating the "grid coordinate" of an object, which is its centroid coordinate (cx or cy) divided by grid size (gridW or gridH).

Note that grid coordinates are different from canvas coordinates. If you see the gridded canvas as a 2D grid array, the grid coordinates are just the array indexes. So by grouping points with the same grid coordinates, I'm essentially doing what you said, i.e., "add to a blob if it is in a certain range of gridX or gridY"

Let me know if that makes sense.

Ok, so you're rounding to the nearest grid coordinate I guess. I was thinking along the lines of taking coordinates like (2500, 1000) and dividing it by gridW=100 and gridH=100, and getting grid coordinates (25, 10) but then with coordinates (2510, 1010) getting (25.1, 10.1) and that being different.

So, I think I understand what you did a bit better. It seems like you have it pretty much implemented, so i'm wondering if you think I should change things in a big way or just see if there are ways to enhance the performance from what it is now. One big question I had is: do we want to keep using PostGIS for this indexer or do we want to use the native box type?

Ok, so you're rounding to the nearest grid coordinate I guess. I was thinking along the lines of taking coordinates like (2500, 1000) and dividing it by gridW=100 and gridH=100, and getting grid coordinates (25, 10) but then with coordinates (2510, 1010) getting (25.1, 10.1) and that being different.

So, I think I understand what you did a bit better. It seems like you have it pretty much implemented, so i'm wondering if you think I should change things in a big way or just see if there are ways to enhance the performance from what it is now. One big question I had is: do we want to keep using PostGIS for this indexer or do we want to use the native box type?

Yes the grid-based clustering is mostly implemented. But I suspect there will be other ways of clustering (e.g. Birch) that can achieve better performance. It'll be also interesting to compare index/online performance of different clustering methods and compare to base 2D indexers.

PostGIS v.s Native: I would go for native since it's supposed to be faster?

I think it would be very interesting to look at different ways of clustering. I'll read the Birch paper and I can maybe write up some ideas for approaches. As a side note, if you know of any papers like this that are related to the spatial indexing we're working on, please feel free to send them to me. It's good to have some context in terms of approaches others have tried.

asah commented

Peter, I don't have other papers in mind but I will send you if I come across any. Feel free to research a bit (e.g. just google clustering algorithms, since this idea works with any 2D clustering schemes). Writing down possible approaches will be great. Also, it'll be good if you can help test this indexer and see what its performance is in your testing environment. It should be relatively faster on 100M records compared to 2D indexer, both in index and online performance.

Adam brought up a good point: how to avoid moving each record out of the database to Java back-end. The current grid indexer doesn't avoid it. Although being simple, it requires pretty heavy manipulation of raw records, e.g. concatenating fields, calculating grid coordinates, etc. Writing custom JS functions and push the manipulations into Postgres would be interesting and should boost index performance.