lebebr01/pdfsearch

Reading columns correctly.

Opened this issue · 1 comments

I would like to thank you for working on this great package. It's extremely useful and has plenty of applications. I hope you continue to work and maintain it.

I noted that one of the limitations (as you mentioned) is text fragmentation when the text in pdf are in columns (eg most scientific articles). I came across this function tabulizer::extract_text(file) which can read multiple columns. I wonder if you can use something similar in your package to fix that issue. This tabulizer function will also still also cause issues with tables and images/table captions but at least will get the flow of the main text correct.

thank you

Thanks for the comments. I, with one of my graduate students, are currently working on expanding this package and a companion package. One of the elements we are working to improve is this feature. I don't plan to use the tabulizer package as it has some pretty strict dependencies (ie, Java). However, look for some improvements coming soon to multiple column PDFs.