Numbers are not segmented the same way depending on the Script/Language

Question

Numbers are not segmented the same way depending on the Script/Language

ManyTheFish opened this issue a year ago · 13 comments

Description

The string hello 95532387 is segmented as "hello", " ", "95532387" where the string สปริงเกียร์ร่วม 95532387 will be segmented as "สปร\u{e34}ง", "เก\u{e35}ยร\u{e4c}", "ร\u{e48}วม", " ", "9", "5", "5", "3", "2", "3", "8", "7".

Expected behavior

The number 95532387 should always be segmented as "95532387".

How to solve the issue

Before calling a specialized segmenter during the segmentation process, we should check if the interleaved string only contains numbers, if not we call the specialized segmenter else we return directly Some(s).

---- segmenter::japanese::test::segmenter_segment_str stdout ----
thread 'segmenter::japanese::test::segmenter_segment_str' panicked at charabia/src/segmenter/japanese.rs:144:5:
assertion `left == right` failed: 
Segmenter JapaneseSegmenter didn't segment the text as expected.

help: the `segmented` text provided to `test_segmenter!` does not corresponds to the output of the tested segmenter, it's probably due to a bug in the segmenter or a mistake in the provided segmented text.

  left: ["関西", "国際", "空港", "限定", "トート", "バッグ", " ", "1", "2", "3", "4", " ", "すもも", "も", "もも", "も", "もも", "の", "うち"]
 right: ["関西", "国際", "空港", "限定", "トート", "バッグ", " ", "1234", " ", "すもも", "も", "もも", "も", "もも", "の", "うち"]

The number "1234" is getting split into individual digit strings. Can you help me with what extra changes could be done for this? I tried a few things but failed in that.

Answer 9 · 2024-04-03T08:25:52.000Z

Hello @239yash,
Why don't you go for something simpler, like:

if s.chars().all(|c| c.is_numeric() || c.is_ponctuation()) {

You may know that the separators are customizable in Charabia, meaning that they have already been processed before calling this part of the code.
Let's say the . is part of the separators, then the text 123.456 will be preprocessed as ["123", ".", "456"].

For the test case, it's expected that the digit characters will now be joined together.

Answer 10 · 2024-05-02T07:36:52.000Z

Hello @ManyTheFish,
I trying my luck on this first issue and it led me to 2 questions:

Are the floating point numbers currently supported ?
In the latin segmenter tests 32.3 is expected to become ["32", ".", "3"].
The segmenter_segment_str test is using AhoSegmentedStrIter directly and not SegmentedStrIter, so languages like Thai using the FST_SEGMENTER won't pass the integer test with the fix you proposed. Is it a fix issue or is it a test issue ?

It looks like I got to the same point than @239yash, I will be happy to collaborate if they are still on this issue.

Answer 11 · 2024-05-02T14:02:21.000Z

Hello @42plamusse,

Are the floating point numbers currently supported ? In the latin segmenter tests 32.3 is expected to become ["32", ".", "3"].

Yes, you're right, but the test uses the default separator set, including . which separates the lemmas linked by it. However, this set is customizable, and . could be removed from the list, meaning that it shouldn't be separated anymore, so it's possible to receive ["32.3"] but not by default.

The segmenter_segment_str test is using AhoSegmentedStrIter directly and not SegmentedStrIter, so languages like Thai using the FST_SEGMENTER won't pass the integer test with the fix you proposed. Is it a fix issue or is it a test issue?

Yes you're right, the test macro should be updated to separate segmenter_segment_str expected output from segment. 🤔

In an another hand, we could remove numbers from the tests in the specialized tokenizer.

Description

Expected behavior

How to solve the issue

Related