jabber-tools/cognitive-services-speech-sdk-rs

Word/phrase level timestamp support possible?

Closed this issue · 5 comments

Azure-Samples/cognitive-services-speech-sdk#665
Hi, I'd like to use Word/phrase level timestamp as shown in issues above, is there any possibility to support it?

hi,

recognition result should contain duration and offset attributes, see here: https://github.com/jabber-tools/cognitive-services-speech-sdk-rs/blob/main/src/speech/speech_recognition_result.rs#L19-L20

right now these are defined as string (probably I should change to proper type) but it should work. Did you try it? Does it return these attributes?

Yes, I can use the offset and duration of the entire utterance. However, I would like to use each word and its offset and duration as shown below.

{
    "Id": "791d3f8a724846f69e9d9256947d2479",
    "RecognitionStatus": "Success",
    "Offset": 500000,
    "Duration": 13000000,
    "DisplayText": "What's the weather like?",
    "NBest": [
        {
            "Confidence": 0.97660327,
            "Lexical": "what's the weather like",
            "ITN": "what's the weather like",
            "MaskedITN": "what's the weather like",
            "Display": "What's the weather like?",
            "Words": [
                {
                    "Word": "what's",
                    "Offset": 500000,
                    "Duration": 3900000
                },
                {
                    "Word": "the",
                    "Offset": 4500000,
                    "Duration": 1300000
                },
                {
                    "Word": "weather",
                    "Offset": 5900000,
                    "Duration": 2900000
                },
                {
                    "Word": "like",
                    "Offset": 8900000,
                    "Duration": 4600000
                }
            ]
        },

According to Azure-Samples/cognitive-services-speech-sdk#665, if I call RequestWordLevelTimestamps and set OutputFormat to Detailed, I can get the word level timestamp.
https://github.com/jabber-tools/cognitive-services-speech-sdk-rs/blob/main/src/speech/speech_config.rs#L245-L250
https://github.com/jabber-tools/cognitive-services-speech-sdk-rs/blob/main/src/speech/speech_config.rs#L324-L335

I called RequestWordLevelTimestamps and set OutputFormat to Detailed, but I could not get the NBest.
So to get the NBest, we need to add the NBest field here.
https://github.com/jabber-tools/cognitive-services-speech-sdk-rs/blob/main/src/speech/speech_recognition_result.rs#L19-L20

hi

no need to enhance the struct SpeechRecognitionResult in any way. Just do exactly same as they advice in above mentioned issue 665, i.e.:

  1. set request_word_level_timestamps on your speech config object
  2. set set_get_output_format to OutputFormat::Detailed on your speech config object
  3. in recognized callback (available only in recognized callback, not in recognizing callback!) read event result property like this event.result.properties.get_property(PropertyId::SpeechServiceResponseJsonResult, "N/A")
  4. 4 That's it! You will get the very same JSON as they describe above, I just tested it and it works fine

Let me know should you have any problems with it, I just used one of provided examples to make this work with above mentioned tweaks.

I tried the above method and got the desired result.
Thank you so much!

glad to help, closing the issue now.