A Rectangle must have a non-negative width (from RegEx text detection)
DrPlanecraft opened this issue · 4 comments
Hello Again!
I have an issue similar to my previous report. However, this time it is RegularExpressionTextExtraction, It has passed through SimpleLineOfTextExtraction
to reproduce, Run the following code without the try-catch:
from borb.toolkit import RegularExpressionTextExtraction
from borb.pdf import PDF
missedMatches = [('lactobacilli', ('that', 'lactobacilli', '–', 'good', 'bacteria')), ('–', ('that', 'lactobacilli', '–', 'good', 'bacteria')), ('system', ('that', 'live', 'in', 'the', 'digestive', 'system')), ('Shirota-', ('root', 'of', 'the', 'business', 'activities.', 'In', 'addition', 'to', 'these', 'core', 'ideas,', 'Shirota-')), ('ism', ('ism', 'also', 'encompasses', 'the', 'virtues', 'of', 'sincerity,', 'care', 'for', 'the', 'community,')), ('price', ('A', 'price', 'anyone')), ('can', ('can', 'afford')), ('afford', ('can', 'afford')), ('–', ('–', 'were', 'able', 'to', 'inhibit', 'the', 'growth')), ('6', ('6',)), ('7', ('7',)), ('L.', ('L.', 'casei', 'strain', 'Shirota')), ('exclusive', ('is', 'exclusive', 'only', 'to', 'Yakult')), ('discovered', ('discovered', 'by', 'our')), ('exclusive', ('Shirota.', 'It', 'is', 'exclusive')), ('cannot', ('cannot', 'be', 'found', 'in')), ('found', ('cannot', 'be', 'found', 'in')), ('any', ('any', 'other', 'cultured')), ('other', ('any', 'other', 'cultured')), ('drinks.', ('milk', 'drinks.')), ('–', ('–',)), ('Intestinal', ('A', 'Healthy', 'Intestinal')), ('Tract,', ('Tract,', 'Healthy', 'Life')), ('Life', ('Tract,', 'Healthy', 'Life')), ('Masses', ('the', 'Masses')), ('F', ('F',)), ('I', ('I',)), ('R', ('R',)), ('S', ('S',)), ('T', ('T',)), ('P', ('P',)), ('R', ('R',)), ('O', ('O',)), ('D', ('D',)), ('U', ('U',)), ('C', ('C',)), ('E', ('E',)), ('D', ('D',)), ('I', ('I',)), ('N', ('N',)), ('1', ('1',)), ('9', ('9',)), ('3', ('3',)), ('5', ('5',))]
with open("Artwork 2.pdf","rb") as wkwk:
artwork = PDF.loads(file=wkwk,event_listeners=[])
print(artwork.get_page(0))
print("\nArtwork:\n")
for word, sentence in missedMatches: # Artwork Matches missed
sentence = " ".join(sentence).strip().replace("'","’").replace('-', "–")
print(sentence)
try:
extractedSentence = RegularExpressionTextExtraction(sentence)
except AssertionError:
print("triggered")
print(extractedSentence[1][0].get_bounding_boxes()[0].get_x())
print(extractedSentence[1][0].get_bounding_boxes()[0].get_y())
print(extractedSentence[1][0].get_bounding_boxes()[0].get_height())
print(extractedSentence[1][0].get_bounding_boxes()[0].get_width())
print(extractedSentence)
print("\n")
Traceback (most recent call last):
File "C:\Users\Lenovo\OneDrive\Documents\LI ZHUOXI\ITE- College West\Lessons\Industrial Attachment Program\IAP Higher Nitec AI Applications\HumanKind Design Pte Ltd\AI_Proofreading\operations.py", line 185, in findOnPDF
extractedSentence = RegularExpressionTextExtraction(sentence).get_matches_for_pdf(sentence, self.artwork)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\anaconda3\envs\HumanKind\Lib\site-packages\borb\toolkit\text\regular_expression_text_extraction.py", line 371, in get_matches_for_pdf
CanvasStreamProcessor(page, Canvas(), []).read(page_source, [cse])
File "C:\Users\Lenovo\anaconda3\envs\HumanKind\Lib\site-packages\borb\pdf\canvas\canvas_stream_processor.py", line 305, in read
raise e
File "C:\Users\Lenovo\anaconda3\envs\HumanKind\Lib\site-packages\borb\pdf\canvas\canvas_stream_processor.py", line 299, in read
operator.invoke(self, operands, event_listeners)
File "C:\Users\Lenovo\anaconda3\envs\HumanKind\Lib\site-packages\borb\pdf\canvas\operator\text\show_text.py", line 49, in invoke
l._event_occurred(tri)
File "C:\Users\Lenovo\anaconda3\envs\HumanKind\Lib\site-packages\borb\toolkit\text\regular_expression_text_extraction.py", line 322, in _event_occurred
self._render_text(event)
File "C:\Users\Lenovo\anaconda3\envs\HumanKind\Lib\site-packages\borb\toolkit\text\regular_expression_text_extraction.py", line 334, in _render_text
for e in text_render_info.split_on_glyphs():
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Lenovo\anaconda3\envs\HumanKind\Lib\site-packages\borb\pdf\canvas\event\chunk_of_text_render_event.py", line 172, in split_on_glyphs
e._baseline_bounding_box = Rectangle(
^^^^^^^^^^
File "C:\Users\Lenovo\anaconda3\envs\HumanKind\Lib\site-packages\borb\pdf\canvas\geometry\rectangle.py", line 29, in __init__
assert width >= 0, "A Rectangle must have a non-negative width."
^^^^^^^^^^
AssertionError: A Rectangle must have a non-negative width.
Expected behaviour
I want to get the locations of all regex matches so i can draw boxes on the PDF itself
Desktop (please complete the following information):
- OS: Windows 11
- borb version 2.1.19.2
- Artwork 2.pdf
Additional context
Edit: replaced the linked document with a mostly valid document
Can you please make your example as minimal as possible? Rather than attempting to match everything in the list for instance, you could limit your example to the first failing match.
Kind regards,
Joris Schellekens
Can you please make your example as minimal as possible? Rather than attempting to match everything in the list for instance, you could limit your example to the first failing match.
Kind regards, Joris Schellekens
Do I need to cut down on the PDF aswell?
If not, here is the updated code:
from borb.toolkit import RegularExpressionTextExtraction
from borb.pdf import PDF
missedMatches = [('lactobacilli', ('that', 'lactobacilli', '–', 'good', 'bacteria'))]
with open("Artwork 2.pdf","rb") as file:
artwork = PDF.loads(file=file)
for word, sentence in missedMatches: # Artwork Matches missed
sentence = " ".join(sentence).strip().replace("'","’").replace('-', "–")
print(sentence)
extractedSentence = RegularExpressionTextExtraction(sentence).get_matches_from_pdf(sentence,artwork)
print(extractedSentence)
print("\n")
I apologise for any syntax/formating errors as I am writing this reply on a mobile phone
@jorisschellekens, I have made an edit to the main post updating the linked PDF, I realise that the previous PDF had 0 bytes producing a separate error