False negative for a PDF?
conjuncts opened this issue · 9 comments
I have a pdf for which filetype is unable to recognize the extension.
!wget -O bulk/3.pdf bulk -q https://www.nature.com/articles/s41467-023-38544-z.pdf
import filetype
out = filetype.guess("./bulk/3.pdf")
print(out) # None
works fine on latest git commit and v1.2.0, pdf downloaded directly from the browser
can you post hex of first 16 bytes of this file that you have issue with?
That's strange, I tried it again with v1.2.0 and it is still None. Maybe my file somehow got modified.
3.pdf
file_path = './bulk/3.pdf'
# Open the file in binary mode and read the first 16 bytes
with open(file_path, 'rb') as file:
first_16_bytes = file.read(16)
# Convert the bytes to hexadecimal format
hex_output = first_16_bytes.hex()
hex_output # 3c68746d6c3e3c686561643e3c6d6574
Environment: Windows 11
your hex
00000000 3C 68 74 6D 6C 3E 3C 68 65 61 64 3E 3C 6D 65 74 <html><head><met
should be
00000000 25 50 44 46 2D 31 2E 34 0A 25 E2 E3 CF D3 0A 31 %PDF-1.4.%.....1
Okay, it seems that I have a different version of the pdf compared to Nature's. The wget pdf has the 00000000 25 50 ... hex, but the pdf that I attached has the 00000000 3C 68 ... hex. With the attached pdf, I was able to replicate it on google colab as well.
3.pdf have html stuff in front of the actual pdf file (firefox open it fine)
pdf file is same (removed html part and checked hash)
Hex View 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00000000 3C 68 74 6D 6C 3E 3C 68 65 61 64 3E 3C 6D 65 74 <html><head><met
00000010 61 20 68 74 74 70 2D 65 71 75 69 76 3D 22 72 65 a http-equiv="re
00000020 66 72 65 73 68 22 20 63 6F 6E 74 65 6E 74 3D 22 fresh" content="
00000030 30 3B 75 72 6C 3D 68 74 74 70 3A 2F 2F 64 6E 73 0;url=http://dns
00000040 65 72 72 6F 72 61 73 73 69 73 74 2E 61 74 74 2E errorassist.att.
00000050 6E 65 74 2F 73 65 61 72 63 68 2F 3F 71 3D 68 74 net/search/?q=ht
00000060 74 70 3A 2F 2F 62 75 6C 6B 25 32 46 25 32 36 73 tp://bulk%2F%26s
00000070 72 63 68 67 64 65 43 69 64 25 33 44 61 61 61 61 rchgdeCid%3Daaaa
00000080 61 61 61 61 25 32 36 74 25 33 44 30 25 32 36 62 aaaa%26t%3D0%26b
00000090 63 25 33 44 22 2F 3E 3C 2F 68 65 61 64 3E 3C 62 c%3D"/></head><b
000000A0 6F 64 79 3E 3C 73 63 72 69 70 74 20 74 79 70 65 ody><script type
000000B0 3D 22 74 65 78 74 2F 6A 61 76 61 73 63 72 69 70 ="text/javascrip
000000C0 74 22 3E 77 69 6E 64 6F 77 2E 6C 6F 63 61 74 69 t">window.locati
000000D0 6F 6E 3D 22 68 74 74 70 3A 2F 2F 64 6E 73 65 72 on="http://dnser
000000E0 72 6F 72 61 73 73 69 73 74 2E 61 74 74 2E 6E 65 rorassist.att.ne
000000F0 74 2F 73 65 61 72 63 68 2F 3F 71 3D 22 2B 65 73 t/search/?q="+es
00000100 63 61 70 65 28 77 69 6E 64 6F 77 2E 6C 6F 63 61 cape(window.loca
00000110 74 69 6F 6E 29 2B 22 26 72 3D 22 2B 65 73 63 61 tion)+"&r="+esca
00000120 70 65 28 64 6F 63 75 6D 65 6E 74 2E 72 65 66 65 pe(document.refe
00000130 72 72 65 72 29 2B 22 26 74 3D 30 26 73 72 63 68 rrer)+"&t=0&srch
00000140 67 64 65 43 69 64 3D 61 61 61 61 61 61 61 61 26 gdeCid=aaaaaaaa&
00000150 62 63 3D 22 3B 3C 2F 73 63 72 69 70 74 3E 3C 2F bc=";</script></
00000160 62 6F 64 79 3E 3C 2F 68 74 6D 6C 3E 25 50 44 46 body></html>%PDF
00000170 2D 31 2E 34 0A 25 E2 E3 CF D3 0A 31 20 30 20 6F -1.4.%.....1 0 o
<html>
<head>
<meta http-equiv="refresh"
content="0;url=http://dnserrorassist.att.net/search/?q=http://bulk%2F%26srchgdeCid%3Daaaaaaaa%26t%3D0%26bc%3D" />
</head>
<body>
<script
type="text/javascript">window.location = "http://dnserrorassist.att.net/search/?q=" + escape(window.location) + "&r=" + escape(document.referrer) + "&t=0&srchgdeCid=aaaaaaaa&bc=";</script>
</body>
</html>Huh, interesting. I still would prefer if filetype were to be able to recognize this as a pdf, though. For example, pymupdf is able to open the pdf no problem.
I was able to replicate the pdf miss on google colab with this code:
!pip install filetype==1.2.0
!wget -O 3.pdf https://github.com/user-attachments/files/17468433/3.pdf
import filetype
out = filetype.guess("3.pdf")
print(out) # None
hmm did your wget command in ubuntu vm and got clear pdf file
its not hard to add html bypass to detect pdf, but idk how common this is (and probably just a fluke)
I have similar issue with the attached file (extracted from Common Crawl, there are more similar cases there). file command identifies it as application/pdf and pdftotext can extract text from it. Would it be possible to add a workaround in filetype.py?
I have similar issue with the attached file (extracted from Common Crawl, there are more similar cases there).
filecommand identifies it asapplication/pdfandpdftotextcan extract text from it. Would it be possible to add a workaround infiletype.py?
there garbage before pdf header
as far i understand author of filetype.py trying to follow with file format spec, idk how file works
Hex View 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00000000 00 10 43 6C 61 73 68 20 42 79 6C 61 77 73 2E 70 ..Clash Bylaws.p
00000010 64 66 00 00 00 00 00 00 00 00 00 00 00 00 00 00 df..............
00000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000040 00 00 00 00 00 00 00 00 00 01 00 61 73 43 6C 68 ...........asClh
00000050 20 00 00 00 01 69 66 00 00 00 00 00 00 00 00 00 ....if.........
00000060 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000070 00 00 00 00 00 00 00 00 00 00 00 00 3D 37 00 00 ............=7..
00000080 25 50 44 46 2D 31 2E 33 0A 32 20 30 20 6F 62 6A %PDF-1.3.2 0 obj