tabulapdf/tabula-extractor

Only horizontal lines, columns too close to each other

abelsonlive opened this issue · 4 comments

While using tabula-extractor to parse this PDF (pages 1 - 151), I ran into some interesting issues:

  1. While there are no visible 'ruling lines', the rows are colored differently – something that I suspect shows up in many cases.
  2. The extraction performed much, much better when specifying the pixel positions of each column.
  3. I eventually merged the multi-line cells using a python script. I imagine there might be a way to add a flag in tabula-extractor for a general "merge-down" or "merge-up" post-processing step. It could probably follow the logic of this function:
def merge(data, id_col = 'id', direction='down'):
  for i, row in enumerate(data):
    # every row should have an `id_col`, 
    # if it doesn't then it means we're 
    # at a multiline cell
    if r[id_col] == '':  
      # find non-empty cells to merge down
      merge_keys = [k for k,v in row.items() if v!='']
      for k in merge_keys:
        # determine merge index based off of `direction` arg
        if direction == 'down':
           merge_idx = i + 1
        elif direction == 'up':
          merge_idx = i - 1
        # merge multi-line cells
        data[merge_idx][k] = '%s %s' % (data[i][k], data[merge_idx][k])
        # delete row which we merged
        del data[i]
  return data

I also had to run this multiple times to catch those instances in which there were 3 or more lines in an individual cell.

Awesome, @abelsonlive! Thanks for the report.

While there are no visible 'ruling lines', the rows are colored differently – something that I suspect shows up in many cases.

Our current spreadsheet algorithm takes ruling lines into account only if they form a full grid. We should also consider the case where there are only horizontal rulers.

The extraction performed much, much better when specifying the pixel positions of each column.

There isn't much space between columns on that PDF (eg. row 9 on the first page), so the extractor can't reliably detect the widths of the columns. That's exactly why there's an option for specifying columns' positions.

I'm playing with a new technique for segmenting tables. This case is successfully handled with no parameter tweaking at all:

m27-1

@jazzido: Can you give a brief description of what this new technique is? Or, what this picture is showing?

@lukehsiao, it's an implementation of a classic technique in document analysis and segmentation. The basic idea is to calculate the vertical and horizontal projections (sum of heights and widths) of the glyphs and then analyze the resulting profiles (green and red curves) to segment the area of interest. In our implementation, we place a row (column) separator wherever there is a change of slope in the horizontal (vertical) profile.

There is some code in tabula-java that implements this, but it's not integrated with the extraction algorithms that we currently use.