segment-any-text/wtpsplit

Any string that isn't a multiple of 4 causes an assert failure

intelliqua opened this issue · 2 comments

Hi,

Any string that isn't a multiple of 4 causes an assert failure at line 548 in models.py
"assert char_encoding.shape[1] % self.conv.stride[0] == 0"

stride is intialised to config.downsampling_rate (4) in modeling_canine.py in transformers lib.

Sample code causing assert failure (length of input string is 35):
from wtpsplit import WtP
wtp = WtP("wtp-canine-s-12l")
wtp.split("This is a test This is another test", lang_code="en")

Sample code that works (with added full-stop that makes the length of input string to become 36):
from wtpsplit import WtP
wtp = WtP("wtp-canine-s-12l")
wtp.split("This is a test This is another test.", lang_code="en")

oof that's a big one, sorry about that. It's a symptom of being lazy and only testing wtp-bert-mini in CI.

It's fixed in v1.2.3, can you confirm it works now?

Thanks! That was quick. Yes, it is fixed 👍