Is it possible to wrap a spark connector on top of the osmosis pbf reader?
ericsun95 opened this issue · 2 comments
An open interesting topic here, curious about if it's feasible or easy to wrap it with spark as it actually read block by block.
My Spark knowledge is very thin so take this with a pinch of salt.
The PBF format has a concept of blocks for compression purposes, but it doesn't include an index for seeking randomly into the file, you have to read a block at a time to determine where the current block ends and the next begins. So while I think it would be possible to expose a PBF file as a Dataset it would have to be mostly single threaded. It is possible to process blocks using multiple threads but a single thread needs to consume the raw file.
My Spark knowledge is very thin so take this with a pinch of salt.
The PBF format has a concept of blocks for compression purposes, but it doesn't include an index for seeking randomly into the file, you have to read a block at a time to determine where the current block ends and the next begins. So while I think it would be possible to expose a PBF file as a Dataset it would have to be mostly single threaded. It is possible to process blocks using multiple threads but a single thread needs to consume the raw file.
Thanks for the replying. I have found this https://github.com/simplexspatial/osm4scala which support reading it in parallel with spark.