
Word/phrase level timestamp support possible?

Closed this issue · 5 comments

Hi, I'd like to use Word/phrase level timestamp as shown in issues above, is there any possibility to support it?


recognition result should contain duration and offset attributes, see here: https://github.com/jabber-tools/cognitive-services-speech-sdk-rs/blob/main/src/speech/speech_recognition_result.rs#L19-L20

right now these are defined as string (probably I should change to proper type) but it should work. Did you try it? Does it return these attributes?

Yes, I can use the offset and duration of the entire utterance. However, I would like to use each word and its offset and duration as shown below.

    "Id": "791d3f8a724846f69e9d9256947d2479",
    "RecognitionStatus": "Success",
    "Offset": 500000,
    "Duration": 13000000,
    "DisplayText": "What's the weather like?",
    "NBest": [
            "Confidence": 0.97660327,
            "Lexical": "what's the weather like",
            "ITN": "what's the weather like",
            "MaskedITN": "what's the weather like",
            "Display": "What's the weather like?",
            "Words": [
                    "Word": "what's",
                    "Offset": 500000,
                    "Duration": 3900000
                    "Word": "the",
                    "Offset": 4500000,
                    "Duration": 1300000
                    "Word": "weather",
                    "Offset": 5900000,
                    "Duration": 2900000
                    "Word": "like",
                    "Offset": 8900000,
                    "Duration": 4600000

According to Azure-Samples/cognitive-services-speech-sdk#665, if I call RequestWordLevelTimestamps and set OutputFormat to Detailed, I can get the word level timestamp.

I called RequestWordLevelTimestamps and set OutputFormat to Detailed, but I could not get the NBest.
So to get the NBest, we need to add the NBest field here.


no need to enhance the struct SpeechRecognitionResult in any way. Just do exactly same as they advice in above mentioned issue 665, i.e.:

  1. set request_word_level_timestamps on your speech config object
  2. set set_get_output_format to OutputFormat::Detailed on your speech config object
  3. in recognized callback (available only in recognized callback, not in recognizing callback!) read event result property like this event.result.properties.get_property(PropertyId::SpeechServiceResponseJsonResult, "N/A")
  4. 4 That's it! You will get the very same JSON as they describe above, I just tested it and it works fine

Let me know should you have any problems with it, I just used one of provided examples to make this work with above mentioned tweaks.

I tried the above method and got the desired result.
Thank you so much!

glad to help, closing the issue now.