sindresorhus/file-type

PDF detection fails on uncaught Adobe AI check error on very small files

Jimmy89 opened this issue · 1 comments

Description

In this ticket I want to address three topics:

  1. Within the readme the s3 tokenizer guide has not been updated with the exports in more recent versions of @tokenizer/s3, as makeTokenizer must be replaced with makeChunkedTokenizerFromS3.
  2. I noticed that, after running my jest tests, the s3 'connection' to the file inserted in the tokenizer is still 'open', even though a file type was detected. Is there a way to enforce closing it?
  3. The bug described below.

I have a very small test PDF file of just 855 bytes, where I am using file-type v20.0.0 upon.
The PDF file is on a S3 environment, therefore I use v1.0.0 of the @tokenizer/s3 package to retrieve the file.

When calling fileTypeFromTokenizer with the file I received a End-Of-File every time.
Through some debugging, I found this line to cause it:

throw error;

As the comment states: if the file is not large enough, the error must be ignored. However, the error I receive is:

End-Of-File
        at RangeRequestTokenizer.loadRange (/node_modules/@tokenizer/range/lib/range-request-tokenizer.js:101:19)

Which is not an instance of strtok3.EndOfStreamError and therefore is not caught by the ignore error if statement. This looks like a bug to me.
Commenting out the throw error resolved the issue for me, but that is not a permanent one ;)

Below, I inserted snippets of my code to ease up the debugging.

To create the PDF:

import { fs } from "zx"; // which is the fs-extra package
import { PDFDocument } from 'pdf-lib'

const s3DestinationDir = ""; // Change to your liking

const createPdf = async (filePath, content = 'Random generated test PDF') => {
  const asset = `${s3DestinationDir}/${filePath}`;
  await fs.ensureFile(asset);
  const pdfDoc = await PDFDocument.create()
  const page = pdfDoc.addPage()
  page.drawText(content);
  const pdfBytes = await pdfDoc.save();
  await fs.writeFile(asset, pdfBytes);
  return asset;
}

await createPdf("somefile.pdf");

Then

import {
  S3Client,
} from "@aws-sdk/client-s3";
import { fileTypeFromTokenizer } from 'file-type';
import { makeChunkedTokenizerFromS3 } from '@tokenizer/s3';

// initialize S3 client.

  // Initialize S3 tokenizer
  const s3Tokenizer = await makeChunkedTokenizerFromS3(client, {
    Bucket: "NAME",
    Key: "FILE",
  });

 await fileTypeFromTokenizer(s3Tokenizer);

Existing Issue Check

  • I have searched the existing issues and could not find any related to my problem.

ESM (ECMAScript Module) Requirement Acknowledgment

  • My project is an ESM project and my package.json contains the following entry: "type": "module".

File-Type Scope Acknowledgment

  • I understand that file-type detects binary file types and not text or other formats.

Thanks for your feedback, @Jimmy89. Please raise one issue at a time. Only your "topic" 1 is relevant here.

  1. I noticed that after running my Jest tests, the S3 connection to the file inserted in the tokenizer remains open, even though a file type was detected. Is there a way to enforce closing it?
  2. The bug described below.

Could you open a separate issue for each of these in the relevant project, @tokenizer/s3, which is likely the source of the problem? You’re welcome to keep the broader discussion in this issue.