ValueError: min() arg is an empty sequence
gtholpadiperitusai opened this issue · 3 comments
Describe the bug
When I run pdftotree on a PDF file, I get a runtime exception: ValueError: min() arg is an empty sequence
.
To Reproduce
Steps to reproduce the behavior:
-
Download this PDF file: performance-smart-networks.pdf
-
Execute the following code:
html = pdftotree.parse(pdf_file="performance-smart-networks.pdf", html_path=None, model_type=None, model_path=None, favor_figures=True, visualize=False)
Expected behavior
The variable html
should contain the HTML mark-up with the text from the PDF.
Error Logs/Screenshots
Here is the full error stack trace:
~/anaconda3/lib/python3.6/site-packages/pdftotree/core.py in parse(pdf_file, html_path, model_type, model_path, favor_figures, visualize)
63 if not extractor.is_scanned():
64 log.info("Digitized PDF detected, building tree structure...")
---> 65 pdf_tree = extractor.get_tree_structure(model_type, model, favor_figures)
66 log.info("Tree structure built, creating html...")
67 pdf_html = extractor.get_html_tree()
~/anaconda3/lib/python3.6/site-packages/pdftotree/TreeExtract.py in get_tree_structure(self, model_type, model, favor_figures)
236 ref_page_seen,
237 tables[page_num],
--> 238 favor_figures,
239 )
240 return self.tree
~/anaconda3/lib/python3.6/site-packages/pdftotree/utils/pdf/pdf_parsers.py in parse_tree_structure(elems, font_stat, page_num, ref_page_seen, tables, favor_figures)
760 # Figures for this page
761 figures_page = get_figures(
--> 762 mentions, elems.layout.bbox, page_num, boxes_figures, page_width, page_height
763 )
764
~/anaconda3/lib/python3.6/site-packages/pdftotree/utils/pdf/pdf_parsers.py in get_figures(boxes, page_bbox, page_num, boxes_figures, page_width, page_height)
1244
1245 for fig_box in boxes_figures:
-> 1246 node_fig = Node(fig_box)
1247 nodes_figures.append(node_fig)
1248
~/anaconda3/lib/python3.6/site-packages/pdftotree/utils/pdf/node.py in __init__(self, elems)
45 # # self.sum_elem_bbox = self.sum_elem_bbox + len(elem.get_text())
46 self.table_area_threshold = 0.7
---> 47 self.set_bbox(bound_elems(elems))
48 # self.table_indicator = True
49 self.type_counts = Counter(map(elem_type, elems))
~/anaconda3/lib/python3.6/site-packages/pdftotree/utils/pdf/vector_utils.py in bound_elems(elems)
119 Finds the minimal bbox that contains all given elems
120 """
--> 121 group_x0 = min(map(lambda l: l.x0, elems))
122 group_y0 = min(map(lambda l: l.y0, elems))
123 group_x1 = max(map(lambda l: l.x1, elems))
ValueError: min() arg is an empty sequence
Environment (please complete the following information):
- OS: Ubuntu 18.04
- Python: 3.6.4 (Anaconda distribution)
pdftotree
Version: v0.4.0
I have run into same issue with my pdf and it is not a scanned document. I have checked all the bug fixes and the problem still persists.
I had an similar error and found out that my pdf simulated empty LTFigures. These empty objects will cause your error, since l.x0, l.y0, l.x1 and l.y1 just don't exists, and therefore your mapping will be empty, i.e. min() arg is an empty sequence.
I solved it by not adding empty LTFigures while constructing the elements of the pdf. You need to add a single if statement in function processor(m) of the package pdf_utils.py (pdftotree.utils.pdf.pdf_utils). See # ADD THIS.
def processor(m):
# Normalizes the coordinate system to be consistent with
# image library conventions (top left as origin)
if isinstance(m, LTComponent):
m.set_bbox(normalize_bbox(m.bbox, height, scaler))
if isinstance(m, LTCurve):
m.pts = normalize_pts(m.pts, height, scaler)
# Only keep longer lines here
if isinstance(m, LTLine) and max(m.width, m.height) > pts_thres:
segments.append(m)
return
# Here we exclude straight lines from curves
curves.append(m)
return
if isinstance(m, LTFigure):
if len(m) > 0: # ADD THIS
figures.append(m)
return
# Collect stats on the chars
if isinstance(m, LTChar):
chars.append(m)
# fonts could be rotated 90/270 degrees
font_size = _font_size_of(m)
font_size_counter[font_size] += 1
return
if isinstance(m, LTTextLine):
mention_text = keep_allowed_chars(m.get_text()).strip()
# Skip empty and invalid lines
if mention_text:
# TODO: add subscript detection and use latex underscore
# or superscript
m.clean_text = mention_text
m.font_name, m.font_size = _font_of_mention(m)
mentions.append(m)
return
Duplicate of #42