Expose Span Information from FindTopNMostFrequentLangs
Opened this issue · 1 comments
akihiroota87 commented
When calling FindTopNMostFrequentLangs(text,num_langs), it would be helpful to know the ranges of text that each result applies to. For example, if you had the string "Hello, my name is 三船 敏郎. It's a pleasure to meet you.", it would be helpful to know that English applies to indices 0-16 and 24-52, while Japanese applies to indices 17-23. I propose the following:
- Add vector<pair<int,int>> to LangChunkStats that keeps track of ranges of text the language applies to. The vector can be populated using the script_span.offset and script_span.text_bytes.
- Add the vector to Result when populating results vector.
These small changes would give the caller more detailed information about the language of each section of text, if there are multiple languages detected.
chrisosaurus commented
Issue #8 seems to also mention this.