jamesmishra/mysqldump-to-csv

Exception thrown while converting foreign characters

Opened this issue · 1 comments

for line in fileinput.input():

I changed this line to: for line in fileinput.input(openhook=fileinput.hook_encoded("iso-8859-1")):

Converted it with no issues after changing it.

Slightly more precise repro:

python mysqldump-to-csv/mysqldump_to_csv.py <enwiki-latest-categorylinks.sql

blows up with:

Traceback (most recent call last):
  File "/home/ciro/down/wiki/mysqldump-to-csv/mysqldump_to_csv.py", line 114, in <module>
    main()
  File "/home/ciro/down/wiki/mysqldump-to-csv/mysqldump_to_csv.py", line 104, in main
    for line in fileinput.input():
  File "/usr/lib/python3.11/fileinput.py", line 251, in __next__
    line = self._readline()
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/fileinput.py", line 372, in _readline
    return self._readline()
           ^^^^^^^^^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdc in position 1980: invalid continuation byte

The likely reason is that that file contains binary data on the third column, it's a dumpsterfire:

INSERT INTO `categorylinks` VALUES (10,'Redirects_from_moves','*..2NN:,@2.FBHRP:D6^A^W^Aܽ<DC>^L','2014-10-26 04:50:23','','uca-default-u-kn','page'),

Your solution makes it now blow up. I wonder if the output will be correct though. given that the file contains:

/*!40101 SET character_set_client = utf8 */;

so it is likely meant to be utf8.

Perhaps a wiser choice is:

        sys.stdin.reconfigure(errors='ignore')
        for line in fileinput.input(encoding="utf-8", errors="ignore"):

it appears to butcher the binary data, but hopefully the rest is not corrupted.

surrogateerrors sounds even better, but it also blows up: https://stackoverflow.com/questions/24616678/unicodedecodeerror-in-python-when-reading-a-file-how-to-ignore-the-error-and-ju