onekey-sec/unblob

Extend --skip-magic option for better matching

7homasSutter opened this issue · 3 comments

It would be nice to have a more fine grained option to select what file-types should be skipped. The current --skip-magic option allows to select specific skipped types but it does not allow to extend the default list of magics. It would be nice to have the possibility to extend the default list instead of overwriting it. Moreover, using the magic prefix is confusing in scenarios where the magic bytes of a file is the same (for example, zip and apk files).

Is your feature request related to a problem? Please describe.
Issue #262 has described the exact same problem and the solution was to integrate the --skip-magic option which is a nice way to solve the problem but I suggest extending the feature.

The problem that I'm facing is, that I want to extract an Android image but don't want to extract certain file types (e.g., .apk, .ttf). However, using the --skip-magic option isn't really user friendly because I would need to define a list of --skip-magic parameters for every filetype to exclude as well as for the default list of magic defined by unblob.

Consider the following example: Let's assume I have a .zip file that contains only three files (an .xlsx, an .apk., and a .jar file). We then want to extract this file with unblob but don't want to extract the .apk, .jar, and .xlsx file. As user, I would expect that it is sufficient to add --skip-magic "APK" --skip-magic "JAR" to skip these file extensions. However, adding these two parameters doesn't match apk and jar files as it seems. Moreover, when setting a --skip-magic parameter it overwrites the default list of skip-magic in unblob. Thus, unblob extracts all the files including the .xlsx, which is not what we want.

docker run --platform=linux/amd64 --rm --pull always -v /Volumes/ExtremeSSD/test/output/:/data/output -v /Volumes/ExtremeSSD/test/input/:/data/input ghcr.io/onekey-sec/unblob:latest --skip-magic "APK" --skip-magic "JAR" "/data/input/Test.apk.zip"

To overcome this problem, we have to figure out the correct magic prefix for apk and jar files. So we figured out that adding the magic "Android" and "Java" would actually skip the apk and jar files. However, we would need to add for all defaults another --skip-magic parameter to prevent overwriting the default magic list and skip as well the .xlsx file. The list of defaults to skip is quiet long. Thus, we would need to add around 20 --skip-magic parameters to skip all the defaults.

docker run --platform=linux/amd64 --rm --pull always -v /Volumes/ExtremeSSD/test/output/:/data/output -v /Volumes/ExtremeSSD/test/input/:/data/input ghcr.io/onekey-sec/unblob:latest --skip-magic "Android" --skip-magic "Java" --skip-magic "Microsoft Excel" "/data/input/Test.apk.zip"

I hope the example is understandable.

Describe the solution you'd like
There is two things I would like to suggest to make the --skip-magic parameter more user friendly:

  1. Add the possibility to extend the default magic list without overwriting it.
  2. Map file extensions within unblob to a magic if it is a known file type. For instance, "APK" = "Android"

I think users should just be able to type --skip-magic "<some-file-extension>" to match a correct magic instead of having to extract the magic from a file by themselves.

Ps. if there is a better solution to match apk files I'm up for suggestions.

Hi @7homasSutter ! Thanks for the suggestion, we'll discuss it internally to see what would be the best course of action here. Will keep you posted.

Discussion has some relation to #243

Hi @7homasSutter ! Finally got some time to work on unblob feature requests. I opened an MR that changes the way we handle skip magic lists. See #693

Regarding the ability to filter between apk, jar, zip, etc I think the best way to handle it would be to introduce an extension based filter. Even more so if the libmagic version differs, since some of the older versions do not differentiate between an apk and a jar for example.

Introducing a mapping between extension and magic mime as you suggest would bring too much confusion since end users would not really know if they need to provide a magic or an extension.

We have a similar problem described in #600 where we do not want to extract .rlib files, but the magic mime is current ar archive. If we want to keep extracting other ar archive, we need to filter on extension.

From my perspective, the use case you described should be solved by #693 and the introduction of a --skip-extension CLI argument. We'll see what my team members have to say about this tho :)