Bugfix for scrubber sample code which fails when scrubbing "two"
0dB opened this issue · 2 comments
The code in section "Scrubber" of https://derwen.ai/docs/ptr/sample/ has a small bug: When you add a token that also exists as a single term in the file, like "two", the while
loop will consume the whole span and span[0]
will then fail. Easy fix:
In (using my tokens instead of the ones on the page):
def prefix_scrubber():
def scrubber_func(span: Span) -> str:
while span[0].text in ("every", "other", "the", "two"): # ATTN, different tokens, will fail in original code
span = span[1:]
return span.text
return scrubber_func
just add len(span) > 1 and
and replace
while span[0].text in ("every", "other", "the", "two"):
by
while len(span) > 1 and span[0].text in ("every", "other", "the", "two"):
to get
def prefix_scrubber():
def scrubber_func(span: Span) -> str:
while len(span) > 1 and span[0].text in ("every", "other", "the", "two"):
span = span[1:]
return span.text
return scrubber_func
Now, for the sample used on that page, I get
0.13134098, 05, sentences, [sentences, the two sentences, sentences, two sentences, the sentences]
0.07117996, 02, sentence, [every sentence, every other sentence]
and the line for "two" is still fine
0.00000000, 02, two, [two, two]
You are welcome to use the token list I used, ("every", "other", "the", "two"), it gives even more merged results than the example on the page.
I have created a PR.