arnaudvalle/pdf-scan-to-text

Converting a (badly) scanned PDF to text

JavaScript

pdf-scan-to-text

Convert a (badly) scanned PDF to formatted text.

Step 1: Automator 🤖🔧

Use Automator to split the PDF (1 document per page)
Use Automator to transform each PDF page into an image

Step 2: AI magic 🧠✨🔮

For each image, extract the text with tesseract.js
Write a file with whatever got extracted

npm run detect

Step 3: 📄🔗=📕

Merge all files together

npm run merge

Format to HTML

npm run format