google/cld3

Expose Span Information from FindTopNMostFrequentLangs

Opened this issue · 1 comments

When calling FindTopNMostFrequentLangs(text,num_langs), it would be helpful to know the ranges of text that each result applies to. For example, if you had the string "Hello, my name is 三船 敏郎. It's a pleasure to meet you.", it would be helpful to know that English applies to indices 0-16 and 24-52, while Japanese applies to indices 17-23. I propose the following:

  1. Add vector<pair<int,int>> to LangChunkStats that keeps track of ranges of text the language applies to. The vector can be populated using the script_span.offset and script_span.text_bytes.
  2. Add the vector to Result when populating results vector.

These small changes would give the caller more detailed information about the language of each section of text, if there are multiple languages detected.

Issue #8 seems to also mention this.