OCR PDF documents in Node.js 🐱
Optional:
This is only needed if you are testing out the tess method on the OCR class. This is much faster than the recognize method on the OCR class since it uses tesseract.js, but yeilds less information.
IMPORTANT:
a11ycat-ocrexpects the ImageMagick tools to be availabe in$PATH. If you are testing thetessmethod on theOCRclass, thentesseractmust also be in$PATH
- Build the project from the repository
git clone https://github.com/devnoot/a11ycat-ocr.git a11ycat-ocr
cd a11ycat-ocr
npm install
npm build- Include the OCR class in your project
const { A11yCat } = require('../../dist/index')
const { resolve } = require('path')
const ocr = new A11yCat.OCR()
async function main() {
try {
// Set the path to the pdf you want to OCR
const pdfPath = '/path/to/my.pdf'
// Set a destination directory for the pdf images
const destinationDir = resolve(process.cwd() + '/tmp')
// Convert a pdf to a series of images
const generatedImages = await ocr.convertPdfToImages(pdfPath, destinationDir)
// Run OCR on one of the generated images
const textFile = await ocr.tess(generatedImages[0])
} catch (error) {
throw error
}
}
main()Tests are located in test/spec. Tests should use data from test/data/images and test/data/pdfs
Because there are some large PDFs in the test dataset, this can take a very long time depending on the host computer.
npm test