Exception thrown while converting foreign characters
Opened this issue · 1 comments
mysqldump-to-csv/mysqldump_to_csv.py
Line 104 in 24301df
I changed this line to: for line in fileinput.input(openhook=fileinput.hook_encoded("iso-8859-1")):
Converted it with no issues after changing it.
Slightly more precise repro:
python mysqldump-to-csv/mysqldump_to_csv.py <enwiki-latest-categorylinks.sql
blows up with:
Traceback (most recent call last):
File "/home/ciro/down/wiki/mysqldump-to-csv/mysqldump_to_csv.py", line 114, in <module>
main()
File "/home/ciro/down/wiki/mysqldump-to-csv/mysqldump_to_csv.py", line 104, in main
for line in fileinput.input():
File "/usr/lib/python3.11/fileinput.py", line 251, in __next__
line = self._readline()
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/fileinput.py", line 372, in _readline
return self._readline()
^^^^^^^^^^^^^^^^
File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdc in position 1980: invalid continuation byte
The likely reason is that that file contains binary data on the third column, it's a dumpsterfire:
INSERT INTO `categorylinks` VALUES (10,'Redirects_from_moves','*..2NN:,@2.FBHRP:D6^A^W^Aܽ<DC>^L','2014-10-26 04:50:23','','uca-default-u-kn','page'),
Your solution makes it now blow up. I wonder if the output will be correct though. given that the file contains:
/*!40101 SET character_set_client = utf8 */;
so it is likely meant to be utf8.
Perhaps a wiser choice is:
sys.stdin.reconfigure(errors='ignore')
for line in fileinput.input(encoding="utf-8", errors="ignore"):
it appears to butcher the binary data, but hopefully the rest is not corrupted.
surrogateerrors
sounds even better, but it also blows up: https://stackoverflow.com/questions/24616678/unicodedecodeerror-in-python-when-reading-a-file-how-to-ignore-the-error-and-ju