roelderickx/ogr2osm

Issues with encoding in windows batch

Closed this issue · 4 comments

Hi,

I was trying to use ogr2osm in a windows batch but had a lot of encoding problems, because the batch always created ANSI-encoded files, but my workflow needs utf-8 encoded files. I managed to solve my issue by changing the following line:
self.f = open(self.filename, 'w', buffering = -1)
to
self.f = open(self.filename, 'w', buffering = -1, encoding="utf-8")

there is already a parameter called "encoding" but it seems it is only used for the source file, could we extend this "encoding" to be used in the destination file as well? or could we introduce another parameter for that? what are your thoughts? or do you have a tip how I can force the windows batch to output utf-8 without changing ogr2osm?

thanks for this awesome tool =)

Thanks for your bug report. This issue looks like a duplicate of pnorman#15 but your solution is different and you have found a testcase where the current method has issues.

Some observations:

  • The encoding parameter of ogr2osm only specifies the encoding of the input file, not the encoding of the output file
  • The documentation of the python open() function specifies that the default encoding is used when the encoding parameter is omitted or None. This is platform dependent, I can't test it on Windows but at least for Linux it is UTF-8.
  • Although a clear suggestion is present, there is no strict obligation for an osm file to be encoded in UTF-8 on the OSM wiki page
  • According to the W3C recommendation for XML the expected encoding is UTF-8 if neither a byte order mark nor an encoding is specified, as is currently the case for ogr2osm

Given the last observation ogr2osm is supposed to output UTF-8 at the moment, eventually translating from the input file encoding if necessary. To obtain consistent behaviour across different operating systems it is as such necessary to pass encoding='utf-8' as you suggested. I would also explicitly specify the encoding in the header then, ie <?xml version="1.0" encoding="utf-8"?>.

I can confirm the testcases still pass on Linux with your suggested modification. Can you verify if the testcases pass under Windows as well?

Thanks for the fast answer!

  • As far as I know both Linux and Mac use UTF-8 as their default encoding and Windows uses ANSI / Windows-1252 (at least in the german version of windows).
  • It seems some OSM-tools do write UTF-8 in the header, here is an example of Overpass:

<?xml version="1.0" encoding="UTF-8"?> <osm version="0.6" generator="Overpass API 0.7.56.9 76e5016d"> <note>The data included in this document is from www.openstreetmap.org. The data is made available under ODbL.</note> <meta osm_base="2021-06-09T08:10:43Z"/>

After making these changes everything runs smooth in the batch.

Ok. I am not sure if the cram tests can be run as is under Windows, but can you try to convert at least test/shapefiles/japanese.shp and confirm if the formatted result matches test/japanese.xml?

In the test script the output is formatted using xmllint before comparing:

ogr2osm --encoding shift_jis --gis-order -f test/shapefiles/japanese.shp
xmllint --format japanese.osm > japanese.xml

Meanwhile I managed to test the modification in Windows, the test is conclusive. The proposed changes have been merged into master.
Thanks @Meibes for your investigation.