jorisschellekens/borb

A Rectangle must have a non-negative width (from RegEx text detection)

DrPlanecraft opened this issue · 4 comments

Hello Again!

I have an issue similar to my previous report. However, this time it is RegularExpressionTextExtraction, It has passed through SimpleLineOfTextExtraction

to reproduce, Run the following code without the try-catch:

from borb.toolkit import RegularExpressionTextExtraction
from borb.pdf import PDF
missedMatches = [('lactobacilli', ('that', 'lactobacilli', '–', 'good', 'bacteria')), ('–', ('that', 'lactobacilli', '–', 'good', 'bacteria')), ('system', ('that', 'live', 'in', 'the', 'digestive', 'system')), ('Shirota-', ('root', 'of', 'the', 'business', 'activities.', 'In', 'addition', 'to', 'these', 'core', 'ideas,', 'Shirota-')), ('ism', ('ism', 'also', 'encompasses', 'the', 'virtues', 'of', 'sincerity,', 'care', 'for', 'the', 'community,')), ('price', ('A', 'price', 'anyone')), ('can', ('can', 'afford')), ('afford', ('can', 'afford')), ('–', ('–', 'were', 'able', 'to', 'inhibit', 'the', 'growth')), ('6', ('6',)), ('7', ('7',)), ('L.', ('L.', 'casei', 'strain', 'Shirota')), ('exclusive', ('is', 'exclusive', 'only', 'to', 'Yakult')), ('discovered', ('discovered', 'by', 'our')), ('exclusive', ('Shirota.', 'It', 'is', 'exclusive')), ('cannot', ('cannot', 'be', 'found', 'in')), ('found', ('cannot', 'be', 'found', 'in')), ('any', ('any', 'other', 'cultured')), ('other', ('any', 'other', 'cultured')), ('drinks.', ('milk', 'drinks.')), ('–', ('–',)), ('Intestinal', ('A', 'Healthy', 'Intestinal')), ('Tract,', ('Tract,', 'Healthy', 'Life')), ('Life', ('Tract,', 'Healthy', 'Life')), ('Masses', ('the', 'Masses')), ('F', ('F',)), ('I', ('I',)), ('R', ('R',)), ('S', ('S',)), ('T', ('T',)), ('P', ('P',)), ('R', ('R',)), ('O', ('O',)), ('D', ('D',)), ('U', ('U',)), ('C', ('C',)), ('E', ('E',)), ('D', ('D',)), ('I', ('I',)), ('N', ('N',)), ('1', ('1',)), ('9', ('9',)), ('3', ('3',)), ('5', ('5',))]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
with open("Artwork 2.pdf","rb") as wkwk:
    artwork = PDF.loads(file=wkwk,event_listeners=[])

print(artwork.get_page(0))
print("\nArtwork:\n")
for word, sentence in missedMatches: # Artwork Matches missed
    sentence = " ".join(sentence).strip().replace("'","’").replace('-', "–")
    print(sentence)

    try:
        extractedSentence = RegularExpressionTextExtraction(sentence)
    except AssertionError:
        print("triggered")
        print(extractedSentence[1][0].get_bounding_boxes()[0].get_x())
        print(extractedSentence[1][0].get_bounding_boxes()[0].get_y())
        print(extractedSentence[1][0].get_bounding_boxes()[0].get_height())
        print(extractedSentence[1][0].get_bounding_boxes()[0].get_width())
        
    print(extractedSentence)
    print("\n")
Traceback (most recent call last):
  File "C:\Users\Lenovo\OneDrive\Documents\LI ZHUOXI\ITE- College West\Lessons\Industrial Attachment Program\IAP Higher Nitec AI Applications\HumanKind Design Pte Ltd\AI_Proofreading\operations.py", line 185, in findOnPDF
    extractedSentence = RegularExpressionTextExtraction(sentence).get_matches_for_pdf(sentence, self.artwork)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\anaconda3\envs\HumanKind\Lib\site-packages\borb\toolkit\text\regular_expression_text_extraction.py", line 371, in get_matches_for_pdf
    CanvasStreamProcessor(page, Canvas(), []).read(page_source, [cse])
  File "C:\Users\Lenovo\anaconda3\envs\HumanKind\Lib\site-packages\borb\pdf\canvas\canvas_stream_processor.py", line 305, in read
    raise e
  File "C:\Users\Lenovo\anaconda3\envs\HumanKind\Lib\site-packages\borb\pdf\canvas\canvas_stream_processor.py", line 299, in read
    operator.invoke(self, operands, event_listeners)
  File "C:\Users\Lenovo\anaconda3\envs\HumanKind\Lib\site-packages\borb\pdf\canvas\operator\text\show_text.py", line 49, in invoke
    l._event_occurred(tri)
  File "C:\Users\Lenovo\anaconda3\envs\HumanKind\Lib\site-packages\borb\toolkit\text\regular_expression_text_extraction.py", line 322, in _event_occurred
    self._render_text(event)
  File "C:\Users\Lenovo\anaconda3\envs\HumanKind\Lib\site-packages\borb\toolkit\text\regular_expression_text_extraction.py", line 334, in _render_text
    for e in text_render_info.split_on_glyphs():
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Lenovo\anaconda3\envs\HumanKind\Lib\site-packages\borb\pdf\canvas\event\chunk_of_text_render_event.py", line 172, in split_on_glyphs
    e._baseline_bounding_box = Rectangle(
                               ^^^^^^^^^^
  File "C:\Users\Lenovo\anaconda3\envs\HumanKind\Lib\site-packages\borb\pdf\canvas\geometry\rectangle.py", line 29, in __init__
    assert width >= 0, "A Rectangle must have a non-negative width."
           ^^^^^^^^^^
AssertionError: A Rectangle must have a non-negative width.

Expected behaviour
I want to get the locations of all regex matches so i can draw boxes on the PDF itself

Desktop (please complete the following information):

Additional context
Edit: replaced the linked document with a mostly valid document

Can you please make your example as minimal as possible? Rather than attempting to match everything in the list for instance, you could limit your example to the first failing match.

Kind regards,
Joris Schellekens

Can you please make your example as minimal as possible? Rather than attempting to match everything in the list for instance, you could limit your example to the first failing match.

Kind regards, Joris Schellekens

Do I need to cut down on the PDF aswell?

If not, here is the updated code:

from borb.toolkit import RegularExpressionTextExtraction
from borb.pdf import PDF
missedMatches = [('lactobacilli', ('that', 'lactobacilli', '–', 'good', 'bacteria'))]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
with open("Artwork 2.pdf","rb") as file:
    artwork = PDF.loads(file=file)

for word, sentence in missedMatches: # Artwork Matches missed
    sentence = " ".join(sentence).strip().replace("'","’").replace('-', "–")
    print(sentence)

    extractedSentence = RegularExpressionTextExtraction(sentence).get_matches_from_pdf(sentence,artwork)
        
    print(extractedSentence)
    print("\n")

I apologise for any syntax/formating errors as I am writing this reply on a mobile phone

@jorisschellekens, I have made an edit to the main post updating the linked PDF, I realise that the previous PDF had 0 bytes producing a separate error

In the latest version of borb this does not throw an error:

image