DerwenAI/pytextrank

Bugfix for scrubber sample code which fails when scrubbing "two"

0dB opened this issue · 2 comments

0dB commented

The code in section "Scrubber" of https://derwen.ai/docs/ptr/sample/ has a small bug: When you add a token that also exists as a single term in the file, like "two", the while loop will consume the whole span and span[0] will then fail. Easy fix:

In (using my tokens instead of the ones on the page):

def prefix_scrubber():
    def scrubber_func(span: Span) -> str:
        while span[0].text in ("every", "other", "the", "two"): # ATTN, different tokens, will fail in original code
            span = span[1:]
        return span.text
    return scrubber_func

just add len(span) > 1 and and replace

while span[0].text in ("every", "other", "the", "two"):
by
while len(span) > 1 and span[0].text in ("every", "other", "the", "two"):

to get

def prefix_scrubber():
    def scrubber_func(span: Span) -> str:
        while len(span) > 1 and span[0].text in ("every", "other", "the", "two"):
            span = span[1:]
        return span.text
    return scrubber_func

Now, for the sample used on that page, I get

0.13134098, 05, sentences, [sentences, the two sentences, sentences, two sentences, the sentences]
0.07117996, 02, sentence, [every sentence, every other sentence]

and the line for "two" is still fine

0.00000000, 02, two, [two, two]

You are welcome to use the token list I used, ("every", "other", "the", "two"), it gives even more merged results than the example on the page.

0dB commented

I have created a PR.

Thank you kindly @0dB , looks great.
I'm working to resolve the CI issue and get this merge.