HazyResearch/pdftotree

ValueError: min() arg is an empty sequence

gtholpadiperitusai opened this issue · 3 comments

Describe the bug
When I run pdftotree on a PDF file, I get a runtime exception: ValueError: min() arg is an empty sequence.

To Reproduce
Steps to reproduce the behavior:

  1. Download this PDF file: performance-smart-networks.pdf

  2. Execute the following code: html = pdftotree.parse(pdf_file="performance-smart-networks.pdf", html_path=None, model_type=None, model_path=None, favor_figures=True, visualize=False)

Expected behavior
The variable html should contain the HTML mark-up with the text from the PDF.

Error Logs/Screenshots
Here is the full error stack trace:

~/anaconda3/lib/python3.6/site-packages/pdftotree/core.py in parse(pdf_file, html_path, model_type, model_path, favor_figures, visualize)
     63     if not extractor.is_scanned():
     64         log.info("Digitized PDF detected, building tree structure...")
---> 65         pdf_tree = extractor.get_tree_structure(model_type, model, favor_figures)
     66         log.info("Tree structure built, creating html...")
     67         pdf_html = extractor.get_html_tree()

~/anaconda3/lib/python3.6/site-packages/pdftotree/TreeExtract.py in get_tree_structure(self, model_type, model, favor_figures)
    236                 ref_page_seen,
    237                 tables[page_num],
--> 238                 favor_figures,
    239             )
    240         return self.tree

~/anaconda3/lib/python3.6/site-packages/pdftotree/utils/pdf/pdf_parsers.py in parse_tree_structure(elems, font_stat, page_num, ref_page_seen, tables, favor_figures)
    760     # Figures for this page
    761     figures_page = get_figures(
--> 762         mentions, elems.layout.bbox, page_num, boxes_figures, page_width, page_height
    763     )
    764 

~/anaconda3/lib/python3.6/site-packages/pdftotree/utils/pdf/pdf_parsers.py in get_figures(boxes, page_bbox, page_num, boxes_figures, page_width, page_height)
   1244 
   1245     for fig_box in boxes_figures:
-> 1246         node_fig = Node(fig_box)
   1247         nodes_figures.append(node_fig)
   1248 

~/anaconda3/lib/python3.6/site-packages/pdftotree/utils/pdf/node.py in __init__(self, elems)
     45         #     # self.sum_elem_bbox = self.sum_elem_bbox + len(elem.get_text())
     46         self.table_area_threshold = 0.7
---> 47         self.set_bbox(bound_elems(elems))
     48         # self.table_indicator = True
     49         self.type_counts = Counter(map(elem_type, elems))

~/anaconda3/lib/python3.6/site-packages/pdftotree/utils/pdf/vector_utils.py in bound_elems(elems)
    119     Finds the minimal bbox that contains all given elems
    120     """
--> 121     group_x0 = min(map(lambda l: l.x0, elems))
    122     group_y0 = min(map(lambda l: l.y0, elems))
    123     group_x1 = max(map(lambda l: l.x1, elems))

ValueError: min() arg is an empty sequence

Environment (please complete the following information):

  • OS: Ubuntu 18.04
  • Python: 3.6.4 (Anaconda distribution)
  • pdftotree Version: v0.4.0

I have run into same issue with my pdf and it is not a scanned document. I have checked all the bug fixes and the problem still persists.

I had an similar error and found out that my pdf simulated empty LTFigures. These empty objects will cause your error, since l.x0, l.y0, l.x1 and l.y1 just don't exists, and therefore your mapping will be empty, i.e. min() arg is an empty sequence.

I solved it by not adding empty LTFigures while constructing the elements of the pdf. You need to add a single if statement in function processor(m) of the package pdf_utils.py (pdftotree.utils.pdf.pdf_utils). See # ADD THIS.

def processor(m):
        # Normalizes the coordinate system to be consistent with
        # image library conventions (top left as origin)
        if isinstance(m, LTComponent):
            m.set_bbox(normalize_bbox(m.bbox, height, scaler))

            if isinstance(m, LTCurve):
                m.pts = normalize_pts(m.pts, height, scaler)
                # Only keep longer lines here
                if isinstance(m, LTLine) and max(m.width, m.height) > pts_thres:
                    segments.append(m)
                    return
                # Here we exclude straight lines from curves
                curves.append(m)
                return

            if isinstance(m, LTFigure):
                if len(m) > 0: # ADD THIS
                    figures.append(m)
                    return

            # Collect stats on the chars
            if isinstance(m, LTChar):
                chars.append(m)
                # fonts could be rotated 90/270 degrees
                font_size = _font_size_of(m)
                font_size_counter[font_size] += 1
                return

            if isinstance(m, LTTextLine):
                mention_text = keep_allowed_chars(m.get_text()).strip()
                # Skip empty and invalid lines
                if mention_text:
                    # TODO: add subscript detection and use latex underscore
                    # or superscript
                    m.clean_text = mention_text
                    m.font_name, m.font_size = _font_of_mention(m)
                    mentions.append(m)
                return

Duplicate of #42