Only horizontal lines, columns too close to each other
abelsonlive opened this issue · 4 comments
While using tabula-extractor
to parse this PDF (pages 1 - 151), I ran into some interesting issues:
- While there are no visible 'ruling lines', the rows are colored differently – something that I suspect shows up in many cases.
- The extraction performed much, much better when specifying the pixel positions of each column.
- I eventually merged the multi-line cells using a
python
script. I imagine there might be a way to add a flag intabula-extractor
for a general "merge-down" or "merge-up" post-processing step. It could probably follow the logic of this function:
def merge(data, id_col = 'id', direction='down'):
for i, row in enumerate(data):
# every row should have an `id_col`,
# if it doesn't then it means we're
# at a multiline cell
if r[id_col] == '':
# find non-empty cells to merge down
merge_keys = [k for k,v in row.items() if v!='']
for k in merge_keys:
# determine merge index based off of `direction` arg
if direction == 'down':
merge_idx = i + 1
elif direction == 'up':
merge_idx = i - 1
# merge multi-line cells
data[merge_idx][k] = '%s %s' % (data[i][k], data[merge_idx][k])
# delete row which we merged
del data[i]
return data
I also had to run this multiple times to catch those instances in which there were 3 or more lines in an individual cell.
Awesome, @abelsonlive! Thanks for the report.
While there are no visible 'ruling lines', the rows are colored differently – something that I suspect shows up in many cases.
Our current spreadsheet
algorithm takes ruling lines into account only if they form a full grid. We should also consider the case where there are only horizontal rulers.
The extraction performed much, much better when specifying the pixel positions of each column.
There isn't much space between columns on that PDF (eg. row 9 on the first page), so the extractor can't reliably detect the widths of the columns. That's exactly why there's an option for specifying columns' positions.
@jazzido: Can you give a brief description of what this new technique is? Or, what this picture is showing?
@lukehsiao, it's an implementation of a classic technique in document analysis and segmentation. The basic idea is to calculate the vertical and horizontal projections (sum of heights and widths) of the glyphs and then analyze the resulting profiles (green and red curves) to segment the area of interest. In our implementation, we place a row (column) separator wherever there is a change of slope in the horizontal (vertical) profile.
There is some code in tabula-java
that implements this, but it's not integrated with the extraction algorithms that we currently use.