Single words incorrectly segmented into character sequences
lhcoder opened this issue · 4 comments
@bminixhofer Hello,
I've encountered an issue within the code at
wtpsplit/wtpsplit/utils/__init__.py
Line 452 in 0f675f7
When processing texts like "Abstract", "Hello", and "Hello Hello", the predicted token_logits values are consistently high. As a result, it tends to split these words into individual character sequences, e.g., ["H", "e", "l", "l", "o"], rather than keeping them as whole tokens.
Perhaps it would be more effective to replace the use of np.min(token_logits)
with a fixed smaller value (e.g., -10) to prevent this overly aggressive splitting.
Hi, which model are you using so you are getting this behavior?
sat-12l-sm
I see, thanks for the info. I'm not sure if replacing np.min(token_logits)
will be a sustainable solution; it may have undesired consequences. In general, such very short sequences are uncommon to try to segment so I assume they are somewhat out of domain. In contrast, "Abstract Abstract." works just fine. To work around this, you could create a simple filter depending on string length.
Another solution would be to use non-SM models, e.g. sat-12l
. SM models have been fine-tuned on a diverse set of sentences, so it is not surprising that sequences that that don't resemble sentences at all fall out of domain. However, the pre-training objective of sat-12l
has been more diverse, so it should be more robust, handling short sequences as well as sentences and paragraphs.
Indeed, in my tests, it split your test cases just fine, e.g.:
>>> sat.split("Hello Hello")
['Hello Hello']
Checked again. and it should actually not be a problem! I pushed a fix in v2.0.8
. Thanks for both raising this and providing a solution! I will close the issue.