characters been replaced from original string

Question

characters been replaced from original string

Opened this issue 3 months ago · 4 comments

Hi team

we're using this to split text to sentences, but we found that some charater been replaced after splitting

e.g, ASCII 32 and 160
how can I keep the orignal character, I need to do some comparing work with original text

Answer 1 · 2024-08-27T18:33:13.000Z

hey yarnping - sure, I'm happy to help. You're right, it should never miss characters after splitting sentences.
Can you create an example of it failing?
thanks

Answer 2 · 2024-08-29T10:44:17.000Z

than you, here 's the pic from sublime text

<script>
        const text = "“I . . . maybe. I must say, the line between excellent career choice and critical life screwup is getting a bit blurry.”";
        const doc = nlp(text);
        const sentences = doc.sentences().out('array')
        console.log(text);
        console.log(sentences[0]);
    </script>

Answer 3 · 2024-09-04T15:13:42.000Z

hey yarnping, i think the unicode characters that were giving your trouble may be missing from your example text. This case works as expected for me:

nlp(`I . . . maybe. I must say, the line between `).debug()

maybe the github UI cleaned them up somehow? let me know if I can help reproducing this problem
thanks

Answer 4 · 2024-09-17T14:48:16.000Z

example.txt
sure, here's the exmple text