False negative for a PDF?

Question

False negative for a PDF?

conjuncts opened this issue a year ago · 9 comments

I have a pdf for which filetype is unable to recognize the extension.

!wget -O bulk/3.pdf bulk -q https://www.nature.com/articles/s41467-023-38544-z.pdf

import filetype
out = filetype.guess("./bulk/3.pdf")
print(out) # None

Answer 1 · 2024-10-21T14:57:08.000Z

works fine on latest git commit and v1.2.0, pdf downloaded directly from the browser
can you post hex of first 16 bytes of this file that you have issue with?

Answer 2 · 2024-10-22T00:53:48.000Z

That's strange, I tried it again with v1.2.0 and it is still None. Maybe my file somehow got modified.
3.pdf

file_path = './bulk/3.pdf'

# Open the file in binary mode and read the first 16 bytes
with open(file_path, 'rb') as file:
    first_16_bytes = file.read(16)

# Convert the bytes to hexadecimal format
hex_output = first_16_bytes.hex()
hex_output # 3c68746d6c3e3c686561643e3c6d6574

Environment: Windows 11

Answer 3 · 2024-10-22T03:43:53.000Z

your hex
00000000 3C 68 74 6D 6C 3E 3C 68 65 61 64 3E 3C 6D 65 74 <html><head><met
should be
00000000 25 50 44 46 2D 31 2E 34 0A 25 E2 E3 CF D3 0A 31 %PDF-1.4.%.....1

Answer 4 · 2024-10-22T03:56:47.000Z

Okay, it seems that I have a different version of the pdf compared to Nature's. The wget pdf has the 00000000 25 50 ... hex, but the pdf that I attached has the 00000000 3C 68 ... hex. With the attached pdf, I was able to replicate it on google colab as well.

Answer 5 · 2024-10-22T04:06:00.000Z

3.pdf have html stuff in front of the actual pdf file (firefox open it fine)
pdf file is same (removed html part and checked hash)

Hex View  00 01 02 03 04 05 06 07  08 09 0A 0B 0C 0D 0E 0F
 
00000000  3C 68 74 6D 6C 3E 3C 68  65 61 64 3E 3C 6D 65 74  <html><head><met
00000010  61 20 68 74 74 70 2D 65  71 75 69 76 3D 22 72 65  a http-equiv="re
00000020  66 72 65 73 68 22 20 63  6F 6E 74 65 6E 74 3D 22  fresh" content="
00000030  30 3B 75 72 6C 3D 68 74  74 70 3A 2F 2F 64 6E 73  0;url=http://dns
00000040  65 72 72 6F 72 61 73 73  69 73 74 2E 61 74 74 2E  errorassist.att.
00000050  6E 65 74 2F 73 65 61 72  63 68 2F 3F 71 3D 68 74  net/search/?q=ht
00000060  74 70 3A 2F 2F 62 75 6C  6B 25 32 46 25 32 36 73  tp://bulk%2F%26s
00000070  72 63 68 67 64 65 43 69  64 25 33 44 61 61 61 61  rchgdeCid%3Daaaa
00000080  61 61 61 61 25 32 36 74  25 33 44 30 25 32 36 62  aaaa%26t%3D0%26b
00000090  63 25 33 44 22 2F 3E 3C  2F 68 65 61 64 3E 3C 62  c%3D"/></head><b
000000A0  6F 64 79 3E 3C 73 63 72  69 70 74 20 74 79 70 65  ody><script type
000000B0  3D 22 74 65 78 74 2F 6A  61 76 61 73 63 72 69 70  ="text/javascrip
000000C0  74 22 3E 77 69 6E 64 6F  77 2E 6C 6F 63 61 74 69  t">window.locati
000000D0  6F 6E 3D 22 68 74 74 70  3A 2F 2F 64 6E 73 65 72  on="http://dnser
000000E0  72 6F 72 61 73 73 69 73  74 2E 61 74 74 2E 6E 65  rorassist.att.ne
000000F0  74 2F 73 65 61 72 63 68  2F 3F 71 3D 22 2B 65 73  t/search/?q="+es
00000100  63 61 70 65 28 77 69 6E  64 6F 77 2E 6C 6F 63 61  cape(window.loca
00000110  74 69 6F 6E 29 2B 22 26  72 3D 22 2B 65 73 63 61  tion)+"&r="+esca
00000120  70 65 28 64 6F 63 75 6D  65 6E 74 2E 72 65 66 65  pe(document.refe
00000130  72 72 65 72 29 2B 22 26  74 3D 30 26 73 72 63 68  rrer)+"&t=0&srch
00000140  67 64 65 43 69 64 3D 61  61 61 61 61 61 61 61 26  gdeCid=aaaaaaaa&
00000150  62 63 3D 22 3B 3C 2F 73  63 72 69 70 74 3E 3C 2F  bc=";</script></
00000160  62 6F 64 79 3E 3C 2F 68  74 6D 6C 3E 25 50 44 46  body></html>%PDF
00000170  2D 31 2E 34 0A 25 E2 E3  CF D3 0A 31 20 30 20 6F  -1.4.%.....1 0 o

<html>
<head>
    <meta http-equiv="refresh"
        content="0;url=http://dnserrorassist.att.net/search/?q=http://bulk%2F%26srchgdeCid%3Daaaaaaaa%26t%3D0%26bc%3D" />
</head>
<body>
    <script
        type="text/javascript">window.location = "http://dnserrorassist.att.net/search/?q=" + escape(window.location) + "&r=" + escape(document.referrer) + "&t=0&srchgdeCid=aaaaaaaa&bc=";</script>
</body>
</html>

Answer 6 · 2024-10-22T04:12:24.000Z

Huh, interesting. I still would prefer if filetype were to be able to recognize this as a pdf, though. For example, pymupdf is able to open the pdf no problem.

I was able to replicate the pdf miss on google colab with this code:

!pip install filetype==1.2.0
!wget -O 3.pdf https://github.com/user-attachments/files/17468433/3.pdf
import filetype
out = filetype.guess("3.pdf")
print(out) # None

Answer 7 · 2024-10-22T04:54:27.000Z

hmm did your wget command in ubuntu vm and got clear pdf file
its not hard to add html bypass to detect pdf, but idk how common this is (and probably just a fluke)

Answer 8 · 2025-01-24T09:12:32.000Z

I have similar issue with the attached file (extracted from Common Crawl, there are more similar cases there). file command identifies it as application/pdf and pdftotext can extract text from it. Would it be possible to add a workaround in filetype.py?

WMOOS6EVZZXGYSVQPP7UORUTNEUEBO3Q.pdf

Answer 9 · 2025-01-24T13:47:02.000Z

I have similar issue with the attached file (extracted from Common Crawl, there are more similar cases there). file command identifies it as application/pdf and pdftotext can extract text from it. Would it be possible to add a workaround in filetype.py?

WMOOS6EVZZXGYSVQPP7UORUTNEUEBO3Q.pdf

there garbage before pdf header
as far i understand author of filetype.py trying to follow with file format spec, idk how file works

Hex View  00 01 02 03 04 05 06 07  08 09 0A 0B 0C 0D 0E 0F
 
00000000  00 10 43 6C 61 73 68 20  42 79 6C 61 77 73 2E 70  ..Clash Bylaws.p
00000010  64 66 00 00 00 00 00 00  00 00 00 00 00 00 00 00  df..............
00000020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
00000030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
00000040  00 00 00 00 00 00 00 00  00 01 00 61 73 43 6C 68  ...........asClh
00000050  20 00 00 00 01 69 66 00  00 00 00 00 00 00 00 00   ....if.........
00000060  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ................
00000070  00 00 00 00 00 00 00 00  00 00 00 00 3D 37 00 00  ............=7..
00000080  25 50 44 46 2D 31 2E 33  0A 32 20 30 20 6F 62 6A  %PDF-1.3.2 0 obj