spencermountain/compromise

characters been replaced from original string

Opened this issue · 4 comments

Hi team

we're using this to split text to sentences, but we found that some charater been replaced after splitting

e.g, ASCII 32 and 160
how can I keep the orignal character, I need to do some comparing work with original text

hey yarnping - sure, I'm happy to help. You're right, it should never miss characters after splitting sentences.
Can you create an example of it failing?
thanks

than you, here 's the pic from sublime text
image

<script>
        const text = "“I . . . maybe. I must say, the line between excellent career choice and critical life screwup is getting a bit blurry.”";
        const doc = nlp(text);
        const sentences = doc.sentences().out('array')
        console.log(text);
        console.log(sentences[0]);
    </script>

hey yarnping, i think the unicode characters that were giving your trouble may be missing from your example text. This case works as expected for me:

nlp(`I . . . maybe. I must say, the line between `).debug()

maybe the github UI cleaned them up somehow? let me know if I can help reproducing this problem
thanks

example.txt
sure, here's the exmple text

image