HazyResearch/pdftotree

Switch from Tabula to Camelot?

HiromuHota opened this issue · 2 comments

Is your feature request related to a problem? Please describe.

Switching from Tabula to Camelot have two advantages:

  1. Tabula is Java, Camelot is Python. Switching to Camelot frees us from Java.
  2. Seems like Camelot performs better on table recognition.

Describe the solution you'd like

I'd like to switch from Tabula to Camelot if it makes more sense.
Currently, pdftotree detects table "area" (either ml, vision, or heuristic) and uses Tabula for table recognition.
I'd have to figure out if Camelot takes area argument like Tabula does.

Describe alternatives you've considered

It should be fine even if Camelot does not take area but detects tables well on its own.

Additional context
Add any other context or screenshots about the feature request here.

According to https://arxiv.org/pdf/1911.10683.pdf,

Camelot is the best off-the-shelf tool in this comparison.

In general, simplifying dependencies sounds like a big win to me, esp if performance is comparable or better.

@lukehsiao thanks for your thoughts.

I just confirmed that Camelot allows to specify table areas (and pages).
https://camelot-py.readthedocs.io/en/master/user/advanced.html#specify-table-areas