Accuracy: Error in Split (EN)
Qubitium opened this issue · 1 comments
Qubitium commented
Linux with 4090 GPU. We found a strange output in split.
from wtpsplit import WtP
wtp = WtP("wtp-canine-s-12l-no-adapters")
wtp.half().to("cuda")
txt="""Title: A monkey's Tale
Rating: O
Words: 104"""
r = wtp.split(txt, "en")
print(r)
Actual:
['Title: ', 'A ', "monkey's Tale\n", 'Rating: O\n', 'Words: 104']
Expected:
['Title: A monkey's Tale\n", 'Rating: O\n', 'Words: 104']
The input contains single new lines. We did not expect "A monkey's Tale" to be split into two sentences.
Perhaps a few training samples with these type of short/list formats will eliminate the corner cases.
bminixhofer commented
Hi!
There was a subtle bug in the hash embeddings which affected some texts in some models. This is fixed in 1.3.0. Now this should give the expected output ['Title: A monkey's Tale\n", 'Rating: O\n', 'Words: 104']
.