segment-any-text/wtpsplit

Accuracy: Error in Split (EN)

Qubitium opened this issue · 1 comments

Linux with 4090 GPU. We found a strange output in split.

from wtpsplit import WtP
wtp = WtP("wtp-canine-s-12l-no-adapters")
wtp.half().to("cuda")
 
txt="""Title: A monkey's Tale
Rating: O
Words: 104"""
r = wtp.split(txt, "en")
print(r)

Actual:

['Title: ', 'A ', "monkey's Tale\n", 'Rating: O\n', 'Words: 104']

Expected:

['Title: A monkey's Tale\n", 'Rating: O\n', 'Words: 104']

The input contains single new lines. We did not expect "A monkey's Tale" to be split into two sentences.

Perhaps a few training samples with these type of short/list formats will eliminate the corner cases.

Hi!

There was a subtle bug in the hash embeddings which affected some texts in some models. This is fixed in 1.3.0. Now this should give the expected output ['Title: A monkey's Tale\n", 'Rating: O\n', 'Words: 104'].