thombashi/pathvalidate

Unicode en dash (u"\u2013") Is Not Replaced By sanitize_filename

kenlerner opened this issue · 3 comments

When running the following:
sanitized = sanitize_filename(txt, platform="Windows")

If the variable txt contains a unicode dash an invalid sanitized filename is returned. The unicode dash is not replaced. An error occurs when a filename is opened using the sanitized filename.

The following change works:
sanitized = sanitize_filename(re.sub(u"\u2013", "-", txt), platform="Windows")

I think the function should remove the unicode en dash and replace it with an ascii dash.

@kenlerner
Thank you for your feedback.

Could I ask what made you think Unicode dash is an invalid character for a filename?
Unicode normalization (NFC, NFKC, NFD, NFKD) would leave Unicode dashes as it is.

I understand that Unicode dashes are confusing for file names, but still, that is a valid character for file names.

Python created an exception when trying to create a file when the filename had a unicode dash in it. Error was same as reported here:
https://stackoverflow.com/questions/55867822/when-running-python-script-i-get-%C3%A2%E2%82%AC-instead-of-a-hyphen

I can create files that name includes an unicode dash by Python.
If that exception happens only at a specific Python version, please upgrade Python or report the problem to the official Python team.

And the topic at the link does not seem to be a filename problem, just that they have mixed used ASCII-dash and Unicode-dash as dictionary keys.