whateverforever/zettelcon

Getting UnicodeDecodeErrors

Closed this issue · 5 comments

I have tried to run this script on my files, but I got UnicodeDecodeError. I added to the open(...) functions the option encoding="utf8", and added a try statement here:

def get_file_outlinks(path):
    with open(path, "r", encoding='utf8') as fh:
        try:
            contents = fh.read()
            print(path)
        except UnicodeDecodeError:
            print('----', path)

With this I discovered some of my files were ANSI encoded, not utf-8 (?!?). I then converted them and run the altered script again. This time things went smoothly. It just complained many articles didn't have back references, which is actually quite cool, but could perhaps be printed a bit better, it's hard to read.

However, I downloaded the original and ran it, got UnicodeDecodeError again, line 187, function get_file_outlinks. Adding encoding="utf8" fixed this. What could be causing this? I think that adding some safeguard against encoding errors would be cool.

Full traceback of this last part,:

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "C:\Users\Labore-05-Rosangela\Anaconda3\lib\multiprocessing\pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "C:\Users\Labore-05-Rosangela\Anaconda3\lib\multiprocessing\pool.py", line 44, in mapstar
    return list(map(*args))
  File "C:\Users\Labore-05-Rosangela\Dropbox\Zettelkasten - Copy\zettelcon.py", line 185, in get_file_outlinks
    contents = fh.read()
  File "C:\Users\Labore-05-Rosangela\Anaconda3\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 5445: character maps to <undefined>
"""

The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File ".\zettelcon.py", line 230, in <module>
    main()
  File ".\zettelcon.py", line 49, in main
    process_directory(**params)
  File ".\zettelcon.py", line 58, in process_directory
    res = pool.map(get_file_outlinks, files)
  File "C:\Users\Labore-05-Rosangela\Anaconda3\lib\multiprocessing\pool.py", line 268, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "C:\Users\Labore-05-Rosangela\Anaconda3\lib\multiprocessing\pool.py", line 657, in get
    raise self._value
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 5445: character maps to <undefined>

hm, interesting. What python version are you using?

Python 3.7.4

ok, I'm not sure what's going on, but here's my current hypothesis: Specifying the encoding is utf-8 is the correct move, however it needs to be specified in multiple places. Since zettelcon also writes with the default encoding (which apparently is not utf-8 on your system), the writing of zettelcon might have saved the files again in the weird encoding, thus giving problems on the next run.

I specified the encoding in all open() calls in the latest commit. Maybe give that a try

Yeah, that worked. It's weird that if I open a file without specifying any encoding, write to it, then close, it's utf-8. I don't know why that wouldn't be the case in your script. Anyways, the problem is solved, thanks.

cool, thanks for trying the tool out!