roelderickx/ogr2osm

The exporting of resulting OSM files can potentially be sped up

Opened this issue · 4 comments

This requires confirmation later, but I noticed on this StackOverflow discussion:

https://stackoverflow.com/questions/44560655/python-writelines-and-write-huge-time-difference

Why is there such a large difference in the file writing time for write() and writelines() even though it is the same data?


I have used ogr2osm for a while, and I notice that it can be quite slow on larger files. Like, unusually slow.

It seems the exporting can be sped up. Will investigate later.

Benchmarking the existing method

First, I must admit my current PC is at mid-high tier, and so things might be faster than average. But the point should still stand even for slower computers. Also, extra care must be taken because the files to be processed can be very large.


Some details:

  • Command run: python -m ogr2osm -t test_translate -o target.osm source.geojson
  • Size of data source: about 430 MB
  • Measuring the duration: adding some basic measurement at DataWriterContextManager.output using time.time()
  • I/O are all on SSD

I run the command for 5 times.

Measured time (average): 19.455 seconds

One thing that sticks out when doing some detailed profiling:

Beginning to time the export
Writing file header
Writing nodes
Writing took (to_xml, write): 7.521965265274048, 0.6387271881103516
Writing ways
Writing took (to_xml, write): 10.0064537525177, 0.5809998512268066
Writing relations
Writing took (to_xml, write): 0, 0
Writing file footer
Time elapsed was 18.999 seconds

It is actually the to_xml part which is slow, not the IO.

It seems we may continue with some sort of multi-threading.

Hmmm. We are already using lxml for fast export.

Spawning new threads does not work due to Python's GIL, which effectively encourages single-threaded code.

Playing around with multiprocessing did not bring much immediate results because we will need to do extra work to pass values into the subprocesses. This might be viable in the long term, but not something that can be done in a single day.

If we are able to somehow utilize multi-processing effectively, then perhaps there will be a significant speedup.

This just dropped a few days ago:

https://www.bitecode.dev/p/whats-up-python-the-gil-removed-a

THe removal of GIL in Python can be very useful to this speed up: instead of spawning difficult-to-control subprocesses to parallelize XML-to-string, we may finally have a easy-to-control multi-threaded XML-to-string process to speed up exporting.