meilisearch/charabia

Numbers are not segmented the same way depending on the Script/Language

ManyTheFish opened this issue · 13 comments

Description

The string hello 95532387 is segmented as "hello", " ", "95532387" where the string สปริงเกียร์ร่วม 95532387 will be segmented as "สปร\u{e34}ง", "เก\u{e35}ยร\u{e4c}", "ร\u{e48}วม", " ", "9", "5", "5", "3", "2", "3", "8", "7".

Expected behavior

The number 95532387 should always be segmented as "95532387".

How to solve the issue

Before calling a specialized segmenter during the segmentation process, we should check if the interleaved string only contains numbers, if not we call the specialized segmenter else we return directly Some(s).

Related


Hey! 👋
Before starting any implementation, make sure that you read the CONTRIBUTING.md file.
In addition to the recurrent rules, you can find some guides to easily implement a Segmenter or a Normalizer.
Thanks a lot for your Contribution! 🤝

If i want to work on this, does it need to be assigned to me or just fork and start working?

Hey @Abastien1734,
We do not assign people to issues; you can start working straight away!
Happy to know you’re interested in the product. Thanks! 😁

muffpy commented

@Abastien1734 are you still working on this or can I take over? 🙂

@muffpy you can take over

@ManyTheFish Hi, I was checking for the implementation part in the code, which you pointed out in the description of this issue. I was thinking of changes, something like this in the mod.rs file-

                if s.chars().all(char::is_numeric) {
                    Some(s)
                } else {
                    self.current = self.segmenter.segment_str(s);
                    self.next() 
                }

Please help me, if I am going in the right direction.

Hello @239yash,
What you suggested could work.
However, it's not sufficient for floating points; you could have dots in the provided string, but it wouldn't match your case.
Could you make sure to add some tests in the codebase?

@ManyTheFish I will add the case for handling floating point numbers too, If the current approach doesn't work out. Let me add that, Will add the test cases also after implementation. Thanks!

@ManyTheFish

 if s.parse::<f64>().is_ok() {
                        Some(s)
                    } else {
                        self.current = self.segmenter.segment_str(s);
                        self.next()
                    }

I have tested for floating point numbers, but it's not working out as the string "1234.5" is already broken in two pieces when passed to the above code block i.e. - "1234" and "5". This logic is working fine for pure integer strings. Can you help me with this?

Should I raise a PR for this, so that you can have a look into this?

One more thing, I noticed while running test cases for the given below languages, the test segmenter_segment_str is also failing.

segmenter::japanese::test::segmenter_segment_str
segmenter::khmer::test::segmenter_segment_str
segmenter::thai::test::segmenter_segment_str

The error printed is -

---- segmenter::japanese::test::segmenter_segment_str stdout ----
thread 'segmenter::japanese::test::segmenter_segment_str' panicked at charabia/src/segmenter/japanese.rs:144:5:
assertion `left == right` failed: 
Segmenter JapaneseSegmenter didn't segment the text as expected.

help: the `segmented` text provided to `test_segmenter!` does not corresponds to the output of the tested segmenter, it's probably due to a bug in the segmenter or a mistake in the provided segmented text.

  left: ["関西", "国際", "空港", "限定", "トート", "バッグ", " ", "1", "2", "3", "4", " ", "すもも", "も", "もも", "も", "もも", "の", "うち"]
 right: ["関西", "国際", "空港", "限定", "トート", "バッグ", " ", "1234", " ", "すもも", "も", "もも", "も", "もも", "の", "うち"]

The number "1234" is getting split into individual digit strings. Can you help me with what extra changes could be done for this? I tried a few things but failed in that.

Hello @239yash,
Why don't you go for something simpler, like:

if s.chars().all(|c| c.is_numeric() || c.is_ponctuation()) {

You may know that the separators are customizable in Charabia, meaning that they have already been processed before calling this part of the code.
Let's say the . is part of the separators, then the text 123.456 will be preprocessed as ["123", ".", "456"].

For the test case, it's expected that the digit characters will now be joined together.

Hello @ManyTheFish,
I trying my luck on this first issue and it led me to 2 questions:

  • Are the floating point numbers currently supported ?
    In the latin segmenter tests 32.3 is expected to become ["32", ".", "3"].
  • The segmenter_segment_str test is using AhoSegmentedStrIter directly and not SegmentedStrIter, so languages like Thai using the FST_SEGMENTER won't pass the integer test with the fix you proposed. Is it a fix issue or is it a test issue ?

It looks like I got to the same point than @239yash, I will be happy to collaborate if they are still on this issue.

Hello @42plamusse,

Are the floating point numbers currently supported ? In the latin segmenter tests 32.3 is expected to become ["32", ".", "3"].

Yes, you're right, but the test uses the default separator set, including . which separates the lemmas linked by it. However, this set is customizable, and . could be removed from the list, meaning that it shouldn't be separated anymore, so it's possible to receive ["32.3"] but not by default.

The segmenter_segment_str test is using AhoSegmentedStrIter directly and not SegmentedStrIter, so languages like Thai using the FST_SEGMENTER won't pass the integer test with the fix you proposed. Is it a fix issue or is it a test issue?

Yes you're right, the test macro should be updated to separate segmenter_segment_str expected output from segment. 🤔

In an another hand, we could remove numbers from the tests in the specialized tokenizer.