Is it possible to wrap a spark connector on top of the osmosis pbf reader?

Question

Is it possible to wrap a spark connector on top of the osmosis pbf reader?

ericsun95 opened this issue 4 years ago · 2 comments

An open interesting topic here, curious about if it's feasible or easy to wrap it with spark as it actually read block by block.

Answer 1 · 2020-10-27T22:06:24.000Z

My Spark knowledge is very thin so take this with a pinch of salt.

The PBF format has a concept of blocks for compression purposes, but it doesn't include an index for seeking randomly into the file, you have to read a block at a time to determine where the current block ends and the next begins. So while I think it would be possible to expose a PBF file as a Dataset it would have to be mostly single threaded. It is possible to process blocks using multiple threads but a single thread needs to consume the raw file.

Answer 2 · 2020-10-28T03:09:48.000Z

My Spark knowledge is very thin so take this with a pinch of salt.

The PBF format has a concept of blocks for compression purposes, but it doesn't include an index for seeking randomly into the file, you have to read a block at a time to determine where the current block ends and the next begins. So while I think it would be possible to expose a PBF file as a Dataset it would have to be mostly single threaded. It is possible to process blocks using multiple threads but a single thread needs to consume the raw file.

Thanks for the replying. I have found this https://github.com/simplexspatial/osm4scala which support reading it in parallel with spark.