ffalt/pdf.js-extract

How to extract columns ?

KristijanS99 opened this issue · 1 comments

I am trying to extract the columns but i canno't find anywhere documented way to do so as there is how to extract the rows ? Please can anyone provide a working example or any walk through for this ?

ffalt commented

There is no easy answer, it very depends on the structure in your pdf files.
have a look at this approach: extracting by coordinates

A more generic way to extract columns is still not generic at all. It requires you to measure column widths in the pdf. merge text by distance criterias and filter out unwanted data like e.g. repeating headers or page numbers. example