Report non-POSIX paths
e3krisztian opened this issue · 4 comments
Under UNIX file names can be arbitrary byte strings, but of course they are usually valid ASCII/Unicode.
An example archive containing non-POSIX paths is https://github.com/python/cpython/blob/main/Lib/test/testtar.tar : extracting it will yield some interesting files (e.g. ustar/umlauts-*
, which are just encoded in iso8859-1, instead of UTF-8).
These files are currently skipped over by unblob almost silently (logged as warnings). To be able to collect these paths later TaskResult.reports
should also get a report about non-POSIX paths.
#545 Is somewhat trying to solve this. In there, there is a StatReport
generated for EVERY file, even the ones which are non-POSIX.
I don't know what is the expected outcome here, is that pull request enoguh? If so, this Issue can be solved.
If not, discuss here what is the expected behavior (a new report type for non-posix paths? A new report type for not skipped Tasks with reason?)
Also, just created #547 which suggest to remove that validation altogether, so we may should be focusing that one instead of including it into the report.
Just to be more precise: these are all POSIX-compliant paths: meaning, any byte except NULL. POSIX doesn't specify character encoding, but nowdays everyone are moving towards UTF-8. Typical place to find non-unicode paths, is with older archives or FS images where locale-dependent encoding is used (e.g. ISO-8859-1 in the initial example).