PDF Extractor v0.1.0

A tool to scrape text from pdf's.

Software Goals

Todo's

0.1.0

Install libraries
Setup portable Tesseract
Secure sensitive info
Develop Structure
Setup webserver

v 0.2.0

Prototype Frontend
Add File Uploader
Add PDF display
Prototype Drawing Tools using canvas or other
Develop Job File (stores key:coordinate pairs for the snippets to cut)

v 0.3.0

Convert draw objects into a job file
Add sharp to cut image snippets using job file
Add Tesseract to run OCR on image snippets

v 0.4.0

Develop persistence for extracted data
Add Remove Cache to UI, for cleanup of old jobs
Develop some software adjustments to scanned text based on errors encountered

v 0.5.0

Prototype exporting of scanned data
Bugfixes
Add Testing

v 0.6.0 - v 1.0.0

Additional Feature Development
Development Roadmap
Developer Documentation
User Documentation
Usage Examples
Pull Request Requirements
Revisit License
Release into the wild
OpenCV installer or portable
OpenCV functionality

User Workflow:

- User opens application
- User loads a pdf file from URL or local
- PDF uploads and displays.
- Editor tools become available (Ignore Area, Extract Value, Extract Table Row, Grid Lines)
- User adds grid lines, boxes (non tabled values) and boxGroups (table row/values) to the UI overlay
- User gives each box and boxGroup id's. boxGroups are indexed
- User saves overlay graphics into a job file.
- User runs job.
- On completion, User uploads, exports or saves

App Logic:

- App loads job file as a template
- App creates folder with filename to hold working images
- App breaks pdf up into pages and then snippets containing the text
- App saves all the images to be processed
- App generates a list of all the snippets to run OCR on
- App runs OCR on each snippet.
- App stores data as JSON key/value pairs
- App completes job and returns data to UI

UI Features:

- PDF uploader
- PDF viewer
- Canvas/WebGL overlay for editor
- File Wizard

Libraries/Frameworks/Runtime:

- Node.js - Runtime
- Express - Middleware
- React - Frontend
- Something for drawing tools/interface (Pixi.js, d3.js, etc) - Drawing Tools
- Internal Storage (SQL lite?, json) - Data Persistence
- Axios or FS - PDF Loader
- Sharp - Image manipulation, snippet generation
- TesseractOCR & Tesseract.js for node - Scan Text
- nw.js - Standalone software, installers, etc
- OpenCV - to deskew scanned pdfs

App Wants:

- 100% coverage testing
- Standalone exe or installer for entire applcation
- Portable TesseractOCR or some auto installer
- Upload to MySQL server (schema generator?)
- Export to XLS, CSV, DBF?, JSON
- Save to Google Drive, Local Drive
- Distribution to local network if app runs on distributed mode

JoeCrash/pdf-extractor

PDF Extractor v0.1.0

Software Goals

Todo's

User Workflow:

App Logic:

UI Features:

Libraries/Frameworks/Runtime:

App Wants: