Reading columns correctly.

Question

Reading columns correctly.

Opened this issue 3 years ago · 1 comments

I would like to thank you for working on this great package. It's extremely useful and has plenty of applications. I hope you continue to work and maintain it.

I noted that one of the limitations (as you mentioned) is text fragmentation when the text in pdf are in columns (eg most scientific articles). I came across this function tabulizer::extract_text(file) which can read multiple columns. I wonder if you can use something similar in your package to fix that issue. This tabulizer function will also still also cause issues with tables and images/table captions but at least will get the flow of the main text correct.

thank you

Answer 1 · 2021-11-16T16:27:58.000Z

Thanks for the comments. I, with one of my graduate students, are currently working on expanding this package and a companion package. One of the elements we are working to improve is this feature. I don't plan to use the tabulizer package as it has some pretty strict dependencies (ie, Java). However, look for some improvements coming soon to multiple column PDFs.