/pdf-scan-to-text

Converting a (badly) scanned PDF to text

Primary LanguageJavaScript

pdf-scan-to-text

Convert a (badly) scanned PDF to formatted text.

Step 1: Automator 🤖🔧

  • Use Automator to split the PDF (1 document per page)
  • Use Automator to transform each PDF page into an image

Step 2: AI magic 🧠✨🔮

  • For each image, extract the text with tesseract.js
  • Write a file with whatever got extracted
npm run detect

Step 3: 📄🔗=📕

  • Merge all files together
npm run merge
  • Format to HTML
npm run format