alan-turing-institute/CleverCSV

Unicode characters cause UnicodeEncodeError from clevercsv.wrappers.write_table on Windows 10

Closed this issue · 3 comments

Hello and thank you for your work on this excellent library! I'm running on a Windows 10 machine and encountering a UnicodeEncodeError when attempting to write data that includes Unicode using clevercsv.wrappers.write_table.

It appears that adding an optional encoding argument to clevercsv.wrappers.write_table would fix this, as it works when I use the clevercsv.writer without the wrapper as a workaround (below).

Workaround:

with open("outfile.csv", "w", newline="", encoding="utf-8") as fp:
    w = clevercsv.writer(fp)
    w.writerows(data_list)

Stack Trace:

Traceback (most recent call last):
  File "<REDACTED>", line 143, in <module>
    report.create_csv_report()
  File "<REDACTED>", line 42, in create_csv_report
  File "<REDACTED>\lib\site-packages\clevercsv\wrappers.py", line 441, in write_table
    w.writerows(table)
  File "<REDACTED>\lib\site-packages\clevercsv\write.py", line 60, in writerows
    return self._writer.writerows(rows)
  File "<REDACTED>\local\programs\python\python37-32\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2033' in position 250: character maps to <undefined>

Hi @mitchgrogg, thank you for reporting this issue!

I'm trying to figure out whether this is a CleverCSV bug or perhaps simply unexpected behavior. Would you mind trying it with the Python CSV module to see if you get the same result?

When you use open without a specific encoding, Python uses whatever locale.getpreferredencoding() returns (link to docs). Judging from your stack trace, it might simply be that on your Windows machine this is cp1252. If that's the case then you indeed need to specify utf-8 explicitly when writing unicode data.

Let me know what you find, if it does turn out to be a bug in CleverCSV or something we could document better or turn into a feature, I'd like to hear it!

It is indeed, caused by the fact that Windows still uses the legacy cp1252 encoding, unfortunately. If I set the PYTHONUTF8=1 environment variable on my system, it works. However, that workaround only works on Python 3.7+.

I suggest adding the optional named encoding argument to clevercsv.wrappers.write_table. It seems counterintuitive that one can read_table with a specific encoding, but then not write_table that same data with a specific encoding (example below).

table_list = clevercsv.wrappers.read_table('example_in.csv', encoding='utf-8')
clevercsv.wrappers.write_table(table_list, 'example_out.csv') # This throws UnicodeEncodeError

I'd be happy to open a pull request with my proposed changes if you're open to that.

Thanks for checking the encoding issue and for the suggestion to add the encoding keyword to write_table, that was definitely a bug. Normally I'd be happy for you to create a PR but since it was such a small fix I've added it myself (#28). I'll prepare an updated release of CleverCSV right away. Please let me know if you have any other suggestions or run into other problems!