Expose Span Information from FindTopNMostFrequentLangs

Question

Expose Span Information from FindTopNMostFrequentLangs

Opened this issue 6 years ago · 1 comments

When calling FindTopNMostFrequentLangs(text,num_langs), it would be helpful to know the ranges of text that each result applies to. For example, if you had the string "Hello, my name is 三船敏郎. It's a pleasure to meet you.", it would be helpful to know that English applies to indices 0-16 and 24-52, while Japanese applies to indices 17-23. I propose the following:

Add vector<pair<int,int>> to LangChunkStats that keeps track of ranges of text the language applies to. The vector can be populated using the script_span.offset and script_span.text_bytes.
Add the vector to Result when populating results vector.

These small changes would give the caller more detailed information about the language of each section of text, if there are multiple languages detected.

Answer 1 · 2019-04-04T00:40:45.000Z

Issue #8 seems to also mention this.