sindresorhus/file-type

TIF files return an End-Of-Stream error

sandercoffee opened this issue · 7 comments

Hello, I'm new here, I'm not sure how to make a pull request correctly, so I'll give you some details I found:

1. Some .TIF files return an End-Of-Stream error, which breaks validation. (BUG)

In this step, the idea is to return some specific format according to the Tiff Tags, or the default "image/tiff", but as the error happens and is not handled it breaks the validation.

file-type/core.js

Lines 1486 to 1490 in feac593

const fileType = await this.readTiffIFD(false);
return fileType ? fileType : {
ext: 'tif',
mime: 'image/tiff',
};

So I used a try/catch here and worked perfectly..

const tif = {
  ext: 'tif',
  mime: 'image/tiff',
};
try {
  const fileType = await this.readTiffIFD(false);
  return fileType ? fileType : tif;
} catch (_) {
  return tif;
}

Example file that returns error https://drive.google.com/file/d/1UDiCM3jmi0-VJzLD0B7Zn9yoKFL5V3Ur/view?usp=sharing

2. Add support for .MSI (Microsoft Software Installer) currently detected as application/x-cfb (enhancement) (help wanted)

file-type/core.js

Lines 1206 to 1218 in feac593

if (this.check([0xD0, 0xCF, 0x11, 0xE0, 0xA1, 0xB1, 0x1A, 0xE1])) {
// Detected Microsoft Compound File Binary File (MS-CFB) Format.
return {
ext: 'cfb',
mime: 'application/x-cfb',
};
}
// Increase sample size from 12 to 256.
await tokenizer.peekBuffer(this.buffer, {length: Math.min(256, tokenizer.fileInfo.size), mayBeLess: true});
// -- 15-byte signatures --

Change the code to the following:

// Increase sample size from 12 to 256.
await tokenizer.peekBuffer(this.buffer, { length: Math.min(256, tokenizer.fileInfo.size), mayBeLess: true });

if (this.check([0xd0, 0xcf, 0x11, 0xe0, 0xa1, 0xb1, 0x1a, 0xe1, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x3e, 0x00])) {
  // Detected Microsoft Software Installer File.
  return {
    ext: "msi",
    mime: "application/x-msi",
  };
}

if (this.check([0xd0, 0xcf, 0x11, 0xe0, 0xa1, 0xb1, 0x1a, 0xe1])) {
  // Detected Microsoft Compound File Binary File (MS-CFB) Format.
  return {
    ext: "cfb",
    mime: "application/x-cfb",
  };
}

// -- 15-byte signatures --

⚠️ This validation is incomplete, we need to check the next bytes better.
⚠️ For fixture-2.doc.cfb and fixture.msi.cfb files it is detecting as "application/x-msi".

If you can comment and collaborate with ideas I'd appreciate it :D

re: cfb checking - i'm porting this to lua, and i added support for doc/ppt/xls as well:
(the original code is from this SO post)

  if check("\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1") then
    local sector_size = bit.lshift(1, get_u16_le(pos + 30))
    local root_dir_index = get_u32_le(pos + 48)
    pos = (root_dir_index + 1) * sector_size + 81
    -- microsoft CLSIDs below
    -- versions:
    -- 5 (95)
    -- 6 (6.0-7.0)
    -- 8 (97-2003)
    -- 12 (2007?)
    -- https://raw.githubusercontent.com/decalage2/oletools/master/oletools/common/clsid.py
    if check("\x9b\x4c\x75\xf4\xf5\x64\x40\x4b\x8a\xf4\x67\x97\x32\xac\x06\x07") then
      -- Word.Document.12: f4754c9b-64f5-4b40-8af4-679732ac0607
      return "doc", "application/msword"
    elseif check("\x06\x09\x02\0\0\0\0\0\xc0\0\0\0\0\0\0\x46") then
      -- Word.Document.8: 00020906-0000-0000-c000-000000000046
      return "doc", "application/msword"
    elseif check("\x00\x09\x02\0\0\0\0\0\xc0\0\0\0\0\0\0\x46") then
      -- Word.Document.6: 00020900-0000-0000-c000-000000000046
      return "doc", "application/msword"
    elseif check("\x30\x08\x02\0\0\0\0\0\xc0\0\0\0\0\0\0\x46") then
      -- Excel.Sheet.12: 00020830-0000-0000-c000-000000000046
      return "xls", "application/vnd.ms-excel"
    elseif check("\x20\x08\x02\0\0\0\0\0\xc0\0\0\0\0\0\0\x46") then
      -- Excel.Sheet.8: 00020820-0000-0000-c000-000000000046
      return "xls", "application/vnd.ms-excel"
    elseif check("\x10\x08\x02\0\0\0\0\0\xc0\0\0\0\0\0\0\x46") then
      -- Excel.Sheet.5: 00020810-0000-0000-c000-000000000046
      return "xls", "application/vnd.ms-excel"
    elseif check("\xf4\x55\x4f\xcf\x87\x8f\x47\x4d\x80\xbb\x58\x08\x16\x4b\xb3\xf8") then
      -- Powerpoint.Show.12: cf4f55f4-8f87-4d47-80bb-5808164bb3f8
      return "ppt", "application/vnd.ms-powerpoint"
    elseif check("\x10\x8d\x81\x64\x9b\x4f\xcf\x11\x86\xea\0\xaa\0\xb9\x29\xe8") then
      -- Powerpoint.Show.8: 64818d10-4f9b-11cf-86ea-00aa00b929e8
      return "ppt", "application/vnd.ms-powerpoint"
    elseif check("\x84\x10\x0c\0\0\0\0\0\xc0\0\0\0\0\0\0\x46") then
      -- msi: 000c1084-0000-0000-c000-000000000046
      return "msi", "application/octet-stream"
    else
      return "cfb", "application/x-cfb"
    end
  end

as for tif, oddly enough that image seems to work fine in the port - however fixture.tif (correctly) errors as it is severely truncated.

on a related note, it might be a good idea to reconsider what happens when a file almost has the correct structure, but is invalid (an example being the invalid png fixture) - since it's probably still useful information that it would be a png if only it weren't invalid

Thanks for your detailed feedback @sandercoffee.

Please don't mix issues. Harder to administer the status if we only work or resolve one if the sections.

You can read GitHub guidance how to create a Pull-Request: Creating a pull request

I don't find it super clear, maybe the small summary helps:

  1. Fork this repository via Github interface
  2. Clone the forked repository locally
  3. Create a new branch locally (you computer). Try to name the branch such a way it understandable what the change is about, not critical.
  4. Commit you changes locally
  5. Push the branch (will be pushed to you forked repository).
  6. Turn the branch into a pull request (PR):
    1. by going to this repository, you will see you branch probably on the first page, with the possibility to turn it into a PR
    2. Describe the change
    3. If it resolves an issue, use something like Resolves: #560 in the description
    4. Your PR will be reviewed, unless you change it to Draft, which indicates you have not finalized
    5. Keep adding commits if you want to add changes

The following conventions how to name remote repositories:

  1. upstream (this repository, the target repository you want to contribute to)
  2. origin (the fork you created of this repository)
  3. local (the local clone you have on your workstation)

Image source: Confusing Terms in the Git Terminology

See also: https://levelup.gitconnected.com/how-to-sync-forked-repositories-using-git-or-github-2933e497fa16

Just give it a try, that's how we all started.

re: resolving one of the sections... technically you could make it a checklist, and progress would update correctly... but yeah it's still a lot better to split it into multiple issues, especially if they're not very related...

if anyone's interested, here's an updated list of cfb clsids. (if it's marked as non-standard, that means i'm just guessing what the mimetype would be) (also note that the autodesk mimetype is the one autodesk uses, rather than the one registered with IANA)

    if
      check("\x9b\x4c\x75\xf4\xf5\x64\x40\x4b\x8a\xf4\x67\x97\x32\xac\x06\x07") or -- Word.Document.12: clsid f4754c9b-64f5-4b40-8af4-679732ac0607
      check("\x06\x09\x02\0\0\0\0\0\xc0\0\0\0\0\0\0\x46") or -- Word.Document.8: clsid 00020906-0000-0000-c000-000000000046
      check("\x00\x09\x02\0\0\0\0\0\xc0\0\0\0\0\0\0\x46")    -- Word.Document.6: clsid 00020900-0000-0000-c000-000000000046
    then
      return "doc", "application/msword"
    elseif
      check("\x30\x08\x02\0\0\0\0\0\xc0\0\0\0\0\0\0\x46") or -- Excel.Sheet.12: clsid 00020830-0000-0000-c000-000000000046
      check("\x20\x08\x02\0\0\0\0\0\xc0\0\0\0\0\0\0\x46") or -- Excel.Sheet.8: clsid 00020820-0000-0000-c000-000000000046
      check("\x10\x08\x02\0\0\0\0\0\xc0\0\0\0\0\0\0\x46")    -- Excel.Sheet.5: clsid 00020810-0000-0000-c000-000000000046
    then
      return "xls", "application/vnd.ms-excel"
    elseif
      check("\xf4\x55\x4f\xcf\x87\x8f\x47\x4d\x80\xbb\x58\x08\x16\x4b\xb3\xf8") or -- Powerpoint.Show.12: clsid cf4f55f4-8f87-4d47-80bb-5808164bb3f8
      check("\x10\x8d\x81\x64\x9b\x4f\xcf\x11\x86\xea\0\xaa\0\xb9\x29\xe8")        -- Powerpoint.Show.8: clsid 64818d10-4f9b-11cf-86ea-00aa00b929e8
    then
      return "ppt", "application/vnd.ms-powerpoint"
    elseif check("\x46\xf0\x06\0\0\0\0\0\xc0\0\0\0\0\0\0\x46") then
      -- TemplateMessage: clsid 0006f046-0000-0000-c000-000000000046
      return "oft", "application/vnd.ms-outlook"
    elseif check("\x0b\x0d\x02\0\0\0\0\0\xc0\0\0\0\0\0\0\x46") then
      -- MailMessage: clsid 00020d0b-0000-0000-c000-000000000046
      return "msg", "application/vnd.ms-outlook"
    elseif check("\x84\x10\x0c\0\0\0\0\0\xc0\0\0\0\0\0\0\x46") then
      -- msi: clsid 000c1084-0000-0000-c000-000000000046
      return "msi", "application/octet-stream"
    elseif -- autodesk inventor: https://knowledge.autodesk.com/search-result/caas/simplecontent/content/documentsubtype-list-common-name-inventors-name-cslid-inv-pro-2021-dev-tools.html
      -- fixtures: https://knowledge.autodesk.com/support/inventor/troubleshooting/caas/downloads/content/inventor-sample-files.html
      check("\x90\xb4\x29\x4d\xb2\x49\xd0\x11\x93\xc3\x7e\x07\x06\x00\x00\x00") or -- Part: clsid 4d29b490-49b2-11d0-93c3-7e07060000
      check("\x03\x42\x46\x9c\xae\x9b\xd3\x11\x8b\xad\x00\x60\xb0\xce\x6b\xb4") or -- Sheet Metal Part: clsid 9c464203-9bae-11d3-8bad-0060b0ce6bb4
      check("\x19\x54\x05\x92\xfa\xb3\xd3\x11\xa4\x79\x00\xc0\x4f\x6b\x95\x31") or -- Generic Proxy Part: clsid 92055419-b3fa-11d3-a479-00c04f6b9531
      check("\x04\x42\x46\x9c\xae\x9b\xd3\x11\x8b\xad\x00\x60\xb0\xce\x6b\xb4") or -- Compatibility Proxy Part: clsid 9c464204-9bae-11d3-8bad-0060b0ce6bb4
      check("\xaf\xd3\x88\x9c\xeb\xc3\xd3\x11\xb7\x9e\x00\x60\xb0\xf1\x59\xef") or -- Catalog Proxy Part: clsid 9c88d3af-c3eb-11d3-b79e-0060b0f159ef
      check("\xd4\x80\x8d\x4d\xb0\xf5\x60\x44\x8c\xea\x4c\xd2\x22\x68\x44\x69")    -- Molded Part Document: clsid 4d8d80d4-f5b0-4460-8cea-4cd222684469
    then
      return "ipt", "application/vnd.autodesk.inventor" -- non-standard
    elseif
      check("\xe1\x81\x0f\xe6\xb3\x49\xd0\x11\x93\xc3\x7e\x07\x06\x00\x00\x00") or -- Assembly: clsid e60f81e1-49b3-11d0-93c3-7e0706000000
      check("\x54\x83\xec\x28\x24\x90\x0f\x44\xa8\xa2\x0e\x0e\x55\xd6\x35\xb0")    -- Weldment: clsid 28ec8354-9024-440f-a8a2-0e0e55d635b0
    then
      return "iam", "application/vnd.autodesk.inventor.assembly"
    elseif
      check("\x80\x3a\x28\x76\xdd\x50\xd3\x11\xa7\xe3\x00\xc0\x4f\x79\xd7\xbc") or -- Presentation: clsid 76283a80-50dd-11d3-a7e3-00c04f79d7bc
      check("\x7d\xc1\xb4\xa2\xd2\xf0\x0f\x4c\x97\x99\xdd\x5f\x71\xdf\xb2\x91")    -- Composite Presentation: clsid a2b4c17d-f0d2-4c0f-9799-dd5f71dfb291
    then
      return "ipn", "application/vnd.autodesk.inventor.presentation" -- non-standard
    elseif check("\xf1\xfd\xf9\xbb\xdc\x52\xd0\x11\x8c\x04\x08\x00\x09\x0b\xe8\xec") then
      -- Drawing: clsid bbf9fdf1-52dc-11d0-8c04-0800090be8ec
      return "idw", "application/vnd.autodesk.inventor.drawing" -- non-standard
    elseif check("\x5d\x5c\xb9\x81\x31\x8e\x65\x4f\x97\x90\xcc\xf6\xec\xab\xd1\x41") then
      -- Design View: clsid 81b95c5d-8e31-4f65-9790-ccf6ecabd141
      return "idv", "application/vnd.autodesk.inventor.designview" -- non-standard
    elseif check("\x30\xb0\xfb\x62\xc7\x24\xd3\x11\xb7\x8d\x00\x60\xb0\xf1\x59\xef") then
      -- iFeature: clsid 62fbb030-24c7-11d3-b78d-0060b0f159ef
      return "ide", "application/vnd.autodesk.inventor.ifeature" -- non-standard
    else
      return "cfb", "application/x-cfb"
    end

Please open a different issue for the MSI requirements.

Just had the same problem with Tiff files.
Fixed by upgrading library to the newest versio.

I just wanted to thank you guys for the great effort 💪 🚀