Dead simple PDF text reader.
npm install pdf-text-reader
You'll probably get peer dependency warnings. You are safe to completely ignore these. You only need them if you want to use pdf.js
's more complicated specific types. To do that just install the correct version (as per the warning) of the pdfjs-dist package. Installing it is completely unnecessary if you're simply reading PDFs from file paths or URLs.
import {readPdfText} from 'pdf-text-reader';
async function run() {
const pages = await readPdfText('path/to/pdf/file.pdf');
console.log(pages[0].lines);
}
run();
See src/index.ts (in the git repo) or dist/index.d.ts (in the npm package) for detailed argument and return value typing.
This uses Mozilla's pdf.js through the pdfjs-dist npm package. As such, any valid input to pdf.js's getDocument
function are valid inputs to this package's readPdfText
function. See pdfjs-dist in DefinitelyTyped for info on that or, for just the type information, read src/index.ts in this repo.
This package simply reads the output of pdfjs.getDocument
and sorts it into lines based on the text vertical position in the document. It also inserts spaces for text on the same line that is far apart horizontally and new lines in between lines that are far apart vertically.
Example:
The text below in a PDF will be read as having spaces in between them even if the space characters aren't in the PDF.
cell 1 cell 2 cell 3
The number of spaces to insert is calculated by a naive but extraordinarily simple calculation of Math.ceil(distance-between-text/text-height)
.
If you need lower level parsing control, you can also use the exported parsePageItems
function. This only reads one page at a time as seen below. This function is used by readPdfText
so the output will be identical for the same pdf page.
You must also have the pdfjs-dist
npm package to use this.
import {parsePageItems} from 'pdf-text-reader';
import * as pdfjs from 'pdfjs-dist';
async function run() {
const doc = await pdfjs.getDocument('myDocument.pdf').promise;
const page = await doc.getPage(1);
const content = await page.getTextContent();
const items = content.items;
const parsedPage = parsePageItems(items);
console.log(parsedPage.lines);
}
run();
See src/index.ts (in the git repo) or dist/index.d.ts (in the npm package) for detailed argument and return value typing.