setnicka/ulozto-downloader

Python Unicode Encoder error with certain character in filename

Closed this issue · 7 comments

Scavy commented

I have found the following problem.
When you try to download a file from Ulozto that contains a certain character in the name, Python fails to translate it when outputting to screen / log, causing the script to terminate.

Unicode Character 'LATIN SMALL LETTER U WITH RING ABOVE' (U+016F)
ů

The error is here:

Traceback (most recent call last):
  File "C:\ulozto-downloader\ulozto-downloader.py", line 7, in <module>
    cmd.run()
  File "C:\ulozto-downloader\uldlib\cmd.py", line 157, in run
    d.download(url, args.parts, args.password, args.output, args.temp, args.yes, args.conn_timeout, args.enforce_tor)
  File "C:\ulozto-downloader\uldlib\downloader.py", line 242, in download
    self.log("Downloading into: '{}'".format(self.output_filename))
  File "C:\ulozto-downloader\uldlib\frontend.py", line 116, in main_log
    self._log_logfile('MAIN', msg, progress=progress, level=level)
  File "C:\ulozto-downloader\uldlib\frontend.py", line 99, in _log_logfile
    self.logfile.write(f"{t} {prefix}\t[{level.name}] {msg}\n")
  File "C:\Programs\Python\Python311\Lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode character '\u016f' in position 104: character maps to <undefined>

I've looked at the code, and can't really see how to handle it.
I suggest there should be added some kind of error handling at that place, so that the unhandled characters gets replaced with x or . or something.. it's just for output.

To add to it, then I've been trying to change codepage in Windows Terminal to 65001 UTF8.. and that didn't help.
There's a chance that it won't fail on a CZ codepage.

And whats url to download ? Probably "wrong" character in there

Scavy commented

It seems to be all files where the name contains "illegal" characters according to the cp1251 codepage.

I ran into another character that would trigger the same error:
UnicodeEncodeError: 'charmap' codec can't encode character '\u011b' in position 82: character maps to <undefined>

Scavy commented

It doesn't seem to be a problem in handling those filenames.. the problem is when it tries to write it to the logfile, that is when the error is triggered.

So some kind of error handling in code at line 99 in the file "\uldlib\frontend.py", is what is needed. That should handle all kinds of unknown characters, in that situation.
Unfortunately, I'm not proficient enough in python yet, to come up with a simple fix myself.

Scavy commented

Ok, I managed to fix the problem.

The original function:

    def _log_logfile(self, prefix: str, msg: str, progress: bool, level: LogLevel):
        if progress or self.logfile is None:
            return

        t = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        self.logfile.write(f"{t} {prefix}\t[{level.name}] {msg}\n")
        self.logfile.flush()

The changed function I made with error handling, that strips non-ascii characters and replaces them with ?:

    def _log_logfile(self, prefix: str, msg: str, progress: bool, level: LogLevel):
        if progress or self.logfile is None:
            return

        t = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        log_msg = f"{t} {prefix}\t[{level.name}] {msg}\n"

        try:
            self.logfile.write(log_msg)
        except UnicodeEncodeError:
            # Replace unencodable characters with a placeholder
            log_msg = log_msg.encode('ascii', 'replace').decode('ascii')
            self.logfile.write(log_msg)
            
        self.logfile.flush()

Edit: This is just a quick hack to circumvent the problem. I think the optimal solution would be to change the logfile to UTF-8.
So I leave my quick fix here, for someone to rewrite into a better solution.

Edit2: Just to make sure it's clear - This only handles writing of the filename to the logfile, if it contains non ASCII characters. So it will ruin anything that depends on getting the filename from the logfile in these cases.

quoing commented

Ok, I managed to fix the problem.

It would be better to fix the issue with file encoding rather than re-coding the output..

In my opinion it would be better to open file with correct encoding.. following seems much "generic" solution.. could you re-try?
https://github.com/setnicka/ulozto-downloader/blob/master/uldlib/frontend.py#L81
self.logfile = open(logfile, 'a', encoding="utf-8")

Scavy commented

Ok, I managed to fix the problem.

It would be better to fix the issue with file encoding rather than re-coding the output..

In my opinion it would be better to open file with correct encoding.. following seems much "generic" solution.. could you re-try? https://github.com/setnicka/ulozto-downloader/blob/master/uldlib/frontend.py#L81 self.logfile = open(logfile, 'a', encoding="utf-8")

This is a better solution! And works very well.
I was hoping for someone to pick it up and make a correct fix, rather than my way of circumventing the problem.
Thanks! :)
Can you do a PR with the solution so that @setnicka can accept it into the code?

Fixed with #177 by @quoing, thank you.