tazz4843/whisper-rs

max tokens and split on word params doesn't work

Closed this issue · 10 comments

I'm trying to enable set_max_tokens along with set_split_on_word to provide a way to set max word per sentence but when I set split_on_word to true and max_tokens to anything more than 0 then the transcription happens very fast but with gibberish and only 2 sentence for long audio

In original whisper.cpp cli program it works as expected with max length per line.

@thewh1teagle Hi, as I can see in whisper.cpp split_on_word works with max_len and not with max_tokens parameter. Also they implicitly enable token_timestamps when max_len > 0

wparams.token_timestamps = params.output_wts || params.output_jsn_full || params.max_len > 0;
wparams.max_len          = params.output_wts && params.max_len == 0 ? 60 : params.max_len;
wparams.split_on_word    = params.split_on_word;

and later split_on_words works only when token_timestams == true and max_len > 0

https://github.com/ggerganov/whisper.cpp/blob/bf4cb4abad4e35c74b387df034cc4ac7b22e5fe6/whisper.cpp#L6224

So try to enable token_timestamps and split_on_words flags and set max_len to the desired maximum segment length in characters. Hope it helps.

@arizhih
Thanks!
I'm looking for split it per word so users can easily select max words per sentence. it's useful for creating video captions where you have limit in the width in the screen.
Splitting it per letters is harder / not accurate.
Is there a way to achieve it through word splitting?

I have another idea.
I can enable token timestamps and take how many words I want. however It may be less accurate and may split in the middle of sentence, does whisper.cpp split sentences smarter by default?

By default whisper produce from 1 to N segments with different length.

When you set token_timestamps and max_len whisper will split large segments into multiple segments, each of them not greater than max_len. If you add split_on_word then each segment will be a little bit larger( to the end of the last word).

It doesn't affect how it produce sentences at all, just how it returns segments.

It doesn't affect how it produce sentences at all, just how it returns segments.

Thanks, so I understand that it's not the right way to produce max words per sentence.
I thought about simpler way: getting token timestamps from whisper and then I can build the sentences in the way I want with max words per sentence.

However, when using token timestamps it produce incorrect tokens, or at least it looks incorrect since it count symbols as single tokens.

regular
[
    {
        "start": 0,
        "stop": 520,
        "text": " It's whoever, not whomever. That's whomever. No whomever is never actually right."
    },
    {
        "start": 520,
        "stop": 934,
        "text": " Well sometimes it's right. Michael is right. It's a made-up word used to trick"
    },
    {
        "start": 934,
        "stop": 1418,
        "text": " students. No actually whomever is the formal version of the word. Obviously"
    },
    {
        "start": 1418,
        "stop": 1792,
        "text": " it's a real word, but I don't know when to use it correctly. Not a native speaker."
    },
    {
        "start": 1792,
        "stop": 2200,
        "text": " I know what's right, but I'm not gonna say because you're all jerks who didn't"
    },
    {
        "start": 2200,
        "stop": 2540,
        "text": " come see my band last night. Do you really know which one is correct? I don't know."
    },
    {
        "start": 2540,
        "stop": 2942,
        "text": " It's whom when it's the object of the sentence and who when is the subject. That"
    },
    {
        "start": 2942,
        "stop": 4942,
        "text": " sounds right. Well it sounds right but is it? How did Ryan use it as an object? As an object. Ryan used me as an object. How did he use it again? It was Ryan wanted Michael the subject to explain the computer system, the object, to whomever, meaning us, the indirect object, which is the correct usage of the word."
    }
]
token timestamps
[
    {
        "start": 0,
        "stop": 14,
        "text": " It"
    },
    {
        "start": 14,
        "stop": 28,
        "text": "'s"
    },
    {
        "start": 28,
        "stop": 79,
        "text": " whoever"
    },
    {
        "start": 93,
        "stop": 93,
        "text": ","
    },
    {
        "start": 94,
        "stop": 115,
        "text": " not"
    },
    {
        "start": 122,
        "stop": 129,
        "text": " wh"
    },
    {
        "start": 129,
        "stop": 147,
        "text": "ome"
    },
    {
        "start": 152,
        "stop": 173,
        "text": "ver"
    },
    {
        "start": 173,
        "stop": 200,
        "text": "."
    },
    {
        "start": 200,
        "stop": 223,
        "text": " That"
    },
    {
        "start": 223,
        "stop": 233,
        "text": "'s"
    },
    {
        "start": 234,
        "stop": 245,
        "text": " wh"
    },
    {
        "start": 245,
        "stop": 262,
        "text": "ome"
    },
    {
        "start": 262,
        "stop": 278,
        "text": "ver"
    },
    {
        "start": 279,
        "stop": 298,
        "text": "."
    },
    {
        "start": 304,
        "stop": 313,
        "text": " No"
    },
    {
        "start": 313,
        "stop": 326,
        "text": " wh"
    },
    {
        "start": 326,
        "stop": 345,
        "text": "ome"
    },
    {
        "start": 345,
        "stop": 364,
        "text": "ver"
    },
    {
        "start": 364,
        "stop": 365,
        "text": " is"
    },
    {
        "start": 380,
        "stop": 410,
        "text": " never"
    },
    {
        "start": 410,
        "stop": 463,
        "text": " actually"
    },
    {
        "start": 463,
        "stop": 496,
        "text": " right"
    },
    {
        "start": 496,
        "stop": 520,
        "text": "."
    },
    {
        "start": 520,
        "stop": 544,
        "text": " Well"
    },
    {
        "start": 544,
        "stop": 597,
        "text": " sometimes"
    },
    {
        "start": 597,
        "stop": 609,
        "text": " it"
    },
    {
        "start": 609,
        "stop": 615,
        "text": "'s"
    },
    {
        "start": 623,
        "stop": 649,
        "text": " right"
    },
    {
        "start": 649,
        "stop": 658,
        "text": "."
    },
    {
        "start": 667,
        "stop": 706,
        "text": " Michael"
    },
    {
        "start": 707,
        "stop": 718,
        "text": " is"
    },
    {
        "start": 718,
        "stop": 741,
        "text": " right"
    },
    {
        "start": 752,
        "stop": 765,
        "text": "."
    },
    {
        "start": 765,
        "stop": 777,
        "text": " It"
    },
    {
        "start": 777,
        "stop": 788,
        "text": "'s"
    },
    {
        "start": 788,
        "stop": 794,
        "text": " a"
    },
    {
        "start": 794,
        "stop": 818,
        "text": " made"
    },
    {
        "start": 818,
        "stop": 819,
        "text": "-"
    },
    {
        "start": 831,
        "stop": 834,
        "text": "up"
    },
    {
        "start": 834,
        "stop": 855,
        "text": " word"
    },
    {
        "start": 858,
        "stop": 879,
        "text": " used"
    },
    {
        "start": 886,
        "stop": 894,
        "text": " to"
    },
    {
        "start": 894,
        "stop": 931,
        "text": " trick"
    },
    {
        "start": 936,
        "stop": 990,
        "text": " students"
    },
    {
        "start": 990,
        "stop": 1008,
        "text": "."
    },
    {
        "start": 1010,
        "stop": 1012,
        "text": " No"
    },
    {
        "start": 1037,
        "stop": 1079,
        "text": " actually"
    },
    {
        "start": 1095,
        "stop": 1095,
        "text": " wh"
    },
    {
        "start": 1095,
        "stop": 1116,
        "text": "ome"
    },
    {
        "start": 1132,
        "stop": 1137,
        "text": "ver"
    },
    {
        "start": 1137,
        "stop": 1151,
        "text": " is"
    },
    {
        "start": 1151,
        "stop": 1172,
        "text": " the"
    },
    {
        "start": 1172,
        "stop": 1214,
        "text": " formal"
    },
    {
        "start": 1214,
        "stop": 1263,
        "text": " version"
    },
    {
        "start": 1263,
        "stop": 1277,
        "text": " of"
    },
    {
        "start": 1277,
        "stop": 1298,
        "text": " the"
    },
    {
        "start": 1298,
        "stop": 1326,
        "text": " word"
    },
    {
        "start": 1326,
        "stop": 1347,
        "text": "."
    },
    {
        "start": 1347,
        "stop": 1417,
        "text": " Obviously"
    },
    {
        "start": 1418,
        "stop": 1428,
        "text": " it"
    },
    {
        "start": 1428,
        "stop": 1435,
        "text": "'s"
    },
    {
        "start": 1440,
        "stop": 1443,
        "text": " a"
    },
    {
        "start": 1443,
        "stop": 1464,
        "text": " real"
    },
    {
        "start": 1464,
        "stop": 1485,
        "text": " word"
    },
    {
        "start": 1485,
        "stop": 1494,
        "text": ","
    },
    {
        "start": 1494,
        "stop": 1505,
        "text": " but"
    },
    {
        "start": 1509,
        "stop": 1512,
        "text": " I"
    },
    {
        "start": 1522,
        "stop": 1530,
        "text": " don"
    },
    {
        "start": 1530,
        "stop": 1538,
        "text": "'t"
    },
    {
        "start": 1547,
        "stop": 1561,
        "text": " know"
    },
    {
        "start": 1561,
        "stop": 1582,
        "text": " when"
    },
    {
        "start": 1582,
        "stop": 1592,
        "text": " to"
    },
    {
        "start": 1592,
        "stop": 1607,
        "text": " use"
    },
    {
        "start": 1607,
        "stop": 1617,
        "text": " it"
    },
    {
        "start": 1617,
        "stop": 1664,
        "text": " correctly"
    },
    {
        "start": 1664,
        "stop": 1678,
        "text": "."
    },
    {
        "start": 1678,
        "stop": 1694,
        "text": " Not"
    },
    {
        "start": 1694,
        "stop": 1698,
        "text": " a"
    },
    {
        "start": 1699,
        "stop": 1730,
        "text": " native"
    },
    {
        "start": 1730,
        "stop": 1761,
        "text": " speaker"
    },
    {
        "start": 1767,
        "stop": 1792,
        "text": "."
    },
    {
        "start": 1792,
        "stop": 1798,
        "text": " I"
    },
    {
        "start": 1800,
        "stop": 1823,
        "text": " know"
    },
    {
        "start": 1823,
        "stop": 1848,
        "text": " what"
    },
    {
        "start": 1848,
        "stop": 1860,
        "text": "'s"
    },
    {
        "start": 1860,
        "stop": 1881,
        "text": " right"
    },
    {
        "start": 1889,
        "stop": 1903,
        "text": ","
    },
    {
        "start": 1904,
        "stop": 1910,
        "text": " but"
    },
    {
        "start": 1923,
        "stop": 1927,
        "text": " I"
    },
    {
        "start": 1927,
        "stop": 1939,
        "text": "'m"
    },
    {
        "start": 1939,
        "stop": 1957,
        "text": " not"
    },
    {
        "start": 1957,
        "stop": 1988,
        "text": " gonna"
    },
    {
        "start": 1988,
        "stop": 2005,
        "text": " say"
    },
    {
        "start": 2005,
        "stop": 2023,
        "text": " because"
    },
    {
        "start": 2050,
        "stop": 2067,
        "text": " you"
    },
    {
        "start": 2067,
        "stop": 2085,
        "text": "'re"
    },
    {
        "start": 2085,
        "stop": 2103,
        "text": " all"
    },
    {
        "start": 2103,
        "stop": 2120,
        "text": " jer"
    },
    {
        "start": 2125,
        "stop": 2133,
        "text": "ks"
    },
    {
        "start": 2133,
        "stop": 2148,
        "text": " who"
    },
    {
        "start": 2157,
        "stop": 2175,
        "text": " didn"
    },
    {
        "start": 2177,
        "stop": 2199,
        "text": "'t"
    },
    {
        "start": 2206,
        "stop": 2218,
        "text": " come"
    },
    {
        "start": 2218,
        "stop": 2231,
        "text": " see"
    },
    {
        "start": 2231,
        "stop": 2240,
        "text": " my"
    },
    {
        "start": 2240,
        "stop": 2258,
        "text": " band"
    },
    {
        "start": 2258,
        "stop": 2276,
        "text": " last"
    },
    {
        "start": 2276,
        "stop": 2293,
        "text": " night"
    },
    {
        "start": 2301,
        "stop": 2312,
        "text": "."
    },
    {
        "start": 2312,
        "stop": 2321,
        "text": " Do"
    },
    {
        "start": 2321,
        "stop": 2334,
        "text": " you"
    },
    {
        "start": 2334,
        "stop": 2361,
        "text": " really"
    },
    {
        "start": 2361,
        "stop": 2379,
        "text": " know"
    },
    {
        "start": 2379,
        "stop": 2402,
        "text": " which"
    },
    {
        "start": 2402,
        "stop": 2411,
        "text": " one"
    },
    {
        "start": 2417,
        "stop": 2424,
        "text": " is"
    },
    {
        "start": 2424,
        "stop": 2456,
        "text": " correct"
    },
    {
        "start": 2456,
        "stop": 2457,
        "text": "?"
    },
    {
        "start": 2471,
        "stop": 2473,
        "text": " I"
    },
    {
        "start": 2473,
        "stop": 2486,
        "text": " don"
    },
    {
        "start": 2486,
        "stop": 2504,
        "text": "'t"
    },
    {
        "start": 2504,
        "stop": 2507,
        "text": " know"
    },
    {
        "start": 2524,
        "stop": 2540,
        "text": "."
    },
    {
        "start": 2540,
        "stop": 2551,
        "text": " It"
    },
    {
        "start": 2551,
        "stop": 2561,
        "text": "'s"
    },
    {
        "start": 2574,
        "stop": 2584,
        "text": " whom"
    },
    {
        "start": 2591,
        "stop": 2608,
        "text": " when"
    },
    {
        "start": 2608,
        "stop": 2619,
        "text": " it"
    },
    {
        "start": 2619,
        "stop": 2630,
        "text": "'s"
    },
    {
        "start": 2630,
        "stop": 2647,
        "text": " the"
    },
    {
        "start": 2647,
        "stop": 2682,
        "text": " object"
    },
    {
        "start": 2682,
        "stop": 2693,
        "text": " of"
    },
    {
        "start": 2693,
        "stop": 2710,
        "text": " the"
    },
    {
        "start": 2710,
        "stop": 2756,
        "text": " sentence"
    },
    {
        "start": 2756,
        "stop": 2773,
        "text": " and"
    },
    {
        "start": 2773,
        "stop": 2790,
        "text": " who"
    },
    {
        "start": 2790,
        "stop": 2813,
        "text": " when"
    },
    {
        "start": 2813,
        "stop": 2824,
        "text": " is"
    },
    {
        "start": 2824,
        "stop": 2841,
        "text": " the"
    },
    {
        "start": 2841,
        "stop": 2879,
        "text": " subject"
    },
    {
        "start": 2881,
        "stop": 2905,
        "text": "."
    },
    {
        "start": 2917,
        "stop": 2942,
        "text": " That"
    },
    {
        "start": 2942,
        "stop": 2969,
        "text": " sounds"
    },
    {
        "start": 2969,
        "stop": 2992,
        "text": " right"
    },
    {
        "start": 2997,
        "stop": 3005,
        "text": "."
    },
    {
        "start": 3005,
        "stop": 3016,
        "text": " Well"
    },
    {
        "start": 3026,
        "stop": 3032,
        "text": " it"
    },
    {
        "start": 3032,
        "stop": 3059,
        "text": " sounds"
    },
    {
        "start": 3059,
        "stop": 3082,
        "text": " right"
    },
    {
        "start": 3082,
        "stop": 3095,
        "text": " but"
    },
    {
        "start": 3095,
        "stop": 3103,
        "text": " is"
    },
    {
        "start": 3104,
        "stop": 3113,
        "text": " it"
    },
    {
        "start": 3113,
        "stop": 3126,
        "text": "?"
    },
    {
        "start": 3126,
        "stop": 3139,
        "text": " How"
    },
    {
        "start": 3139,
        "stop": 3152,
        "text": " did"
    },
    {
        "start": 3152,
        "stop": 3170,
        "text": " Ryan"
    },
    {
        "start": 3170,
        "stop": 3183,
        "text": " use"
    },
    {
        "start": 3183,
        "stop": 3192,
        "text": " it"
    },
    {
        "start": 3192,
        "stop": 3201,
        "text": " as"
    },
    {
        "start": 3201,
        "stop": 3210,
        "text": " an"
    },
    {
        "start": 3210,
        "stop": 3237,
        "text": " object"
    },
    {
        "start": 3237,
        "stop": 3250,
        "text": "?"
    },
    {
        "start": 3250,
        "stop": 3256,
        "text": " As"
    },
    {
        "start": 3260,
        "stop": 3268,
        "text": " an"
    },
    {
        "start": 3268,
        "stop": 3295,
        "text": " object"
    },
    {
        "start": 3295,
        "stop": 3320,
        "text": "."
    },
    {
        "start": 3335,
        "stop": 3358,
        "text": " Ryan"
    },
    {
        "start": 3358,
        "stop": 3392,
        "text": " used"
    },
    {
        "start": 3392,
        "stop": 3409,
        "text": " me"
    },
    {
        "start": 3409,
        "stop": 3426,
        "text": " as"
    },
    {
        "start": 3426,
        "stop": 3442,
        "text": " an"
    },
    {
        "start": 3442,
        "stop": 3464,
        "text": " object"
    },
    {
        "start": 3503,
        "stop": 3521,
        "text": "."
    },
    {
        "start": 3521,
        "stop": 3547,
        "text": " How"
    },
    {
        "start": 3547,
        "stop": 3566,
        "text": " did"
    },
    {
        "start": 3573,
        "stop": 3587,
        "text": " he"
    },
    {
        "start": 3598,
        "stop": 3614,
        "text": " use"
    },
    {
        "start": 3627,
        "stop": 3633,
        "text": " it"
    },
    {
        "start": 3633,
        "stop": 3675,
        "text": " again"
    },
    {
        "start": 3676,
        "stop": 3708,
        "text": "?"
    },
    {
        "start": 3708,
        "stop": 3729,
        "text": " It"
    },
    {
        "start": 3730,
        "stop": 3763,
        "text": " was"
    },
    {
        "start": 3763,
        "stop": 3808,
        "text": " Ryan"
    },
    {
        "start": 3808,
        "stop": 3836,
        "text": " wanted"
    },
    {
        "start": 3840,
        "stop": 3878,
        "text": " Michael"
    },
    {
        "start": 3878,
        "stop": 3896,
        "text": " the"
    },
    {
        "start": 3896,
        "stop": 3952,
        "text": " subject"
    },
    {
        "start": 3964,
        "stop": 3976,
        "text": " to"
    },
    {
        "start": 3976,
        "stop": 4036,
        "text": " explain"
    },
    {
        "start": 4036,
        "stop": 4045,
        "text": " the"
    },
    {
        "start": 4051,
        "stop": 4085,
        "text": " computer"
    },
    {
        "start": 4085,
        "stop": 4105,
        "text": " system"
    },
    {
        "start": 4112,
        "stop": 4121,
        "text": ","
    },
    {
        "start": 4121,
        "stop": 4136,
        "text": " the"
    },
    {
        "start": 4136,
        "stop": 4182,
        "text": " object"
    },
    {
        "start": 4188,
        "stop": 4202,
        "text": ","
    },
    {
        "start": 4214,
        "stop": 4218,
        "text": " to"
    },
    {
        "start": 4218,
        "stop": 4231,
        "text": " wh"
    },
    {
        "start": 4241,
        "stop": 4259,
        "text": "ome"
    },
    {
        "start": 4259,
        "stop": 4281,
        "text": "ver"
    },
    {
        "start": 4289,
        "stop": 4300,
        "text": ","
    },
    {
        "start": 4300,
        "stop": 4359,
        "text": " meaning"
    },
    {
        "start": 4359,
        "stop": 4375,
        "text": " us"
    },
    {
        "start": 4375,
        "stop": 4391,
        "text": ","
    },
    {
        "start": 4391,
        "stop": 4424,
        "text": " the"
    },
    {
        "start": 4424,
        "stop": 4503,
        "text": " indirect"
    },
    {
        "start": 4506,
        "stop": 4568,
        "text": " object"
    },
    {
        "start": 4568,
        "stop": 4584,
        "text": ","
    },
    {
        "start": 4591,
        "stop": 4636,
        "text": " which"
    },
    {
        "start": 4641,
        "stop": 4659,
        "text": " is"
    },
    {
        "start": 4659,
        "stop": 4690,
        "text": " the"
    },
    {
        "start": 4690,
        "stop": 4755,
        "text": " correct"
    },
    {
        "start": 4755,
        "stop": 4755,
        "text": " usage"
    },
    {
        "start": 4755,
        "stop": 4755,
        "text": " of"
    },
    {
        "start": 4755,
        "stop": 4755,
        "text": " the"
    },
    {
        "start": 4755,
        "stop": 4755,
        "text": " word"
    },
    {
        "start": 4755,
        "stop": 4755,
        "text": "."
    }
]

Created with Vibe app.

or at least it looks incorrect since it count symbols as single tokens.

It’s because token is not a word. Whisper has about 54000 tokens and all words is built from this tokens.

Maybe if you set max_len to 1 and enable option
split_on_word it produce one segment for each word.

Maybe if you set max_len to 1 and enable option
split_on_word it produce one segment for each word.

Same

  params.set_token_timestamps(true);
  params.set_split_on_word(true);
  params.set_max_len(1);
transcript.json
[
    {
        "start": 0,
        "stop": 14,
        "text": " It"
    },
    {
        "start": 14,
        "stop": 28,
        "text": "'s"
    },
    {
        "start": 28,
        "stop": 79,
        "text": " whoever"
    },
    {
        "start": 93,
        "stop": 93,
        "text": ","
    },
    {
        "start": 94,
        "stop": 115,
        "text": " not"
    },
    {
        "start": 122,
        "stop": 129,
        "text": " wh"
    },
    {
        "start": 129,
        "stop": 147,
        "text": "ome"
    },
    {
        "start": 152,
        "stop": 173,
        "text": "ver"
    },
    {
        "start": 173,
        "stop": 200,
        "text": "."
    },
    {
        "start": 200,
        "stop": 223,
        "text": " That"
    },
    {
        "start": 223,
        "stop": 233,
        "text": "'s"
    },
    {
        "start": 234,
        "stop": 245,
        "text": " wh"
    },
    {
        "start": 245,
        "stop": 262,
        "text": "ome"
    },
    {
        "start": 262,
        "stop": 278,
        "text": "ver"
    },
    {
        "start": 279,
        "stop": 298,
        "text": "."
    },
    {
        "start": 304,
        "stop": 313,
        "text": " No"
    },
    {
        "start": 313,
        "stop": 326,
        "text": " wh"
    },
    {
        "start": 326,
        "stop": 345,
        "text": "ome"
    },
    {
        "start": 345,
        "stop": 364,
        "text": "ver"
    },
    {
        "start": 364,
        "stop": 365,
        "text": " is"
    },
    {
        "start": 380,
        "stop": 410,
        "text": " never"
    },
    {
        "start": 410,
        "stop": 463,
        "text": " actually"
    },
    {
        "start": 463,
        "stop": 496,
        "text": " right"
    },
    {
        "start": 496,
        "stop": 520,
        "text": "."
    },
    {
        "start": 520,
        "stop": 544,
        "text": " Well"
    },
    {
        "start": 544,
        "stop": 597,
        "text": " sometimes"
    },
    {
        "start": 597,
        "stop": 609,
        "text": " it"
    },
    {
        "start": 609,
        "stop": 615,
        "text": "'s"
    },
    {
        "start": 623,
        "stop": 649,
        "text": " right"
    },
    {
        "start": 649,
        "stop": 658,
        "text": "."
    },
    {
        "start": 667,
        "stop": 706,
        "text": " Michael"
    },
    {
        "start": 707,
        "stop": 718,
        "text": " is"
    },
    {
        "start": 718,
        "stop": 741,
        "text": " right"
    },
    {
        "start": 752,
        "stop": 765,
        "text": "."
    },
    {
        "start": 765,
        "stop": 777,
        "text": " It"
    },
    {
        "start": 777,
        "stop": 788,
        "text": "'s"
    },
    {
        "start": 788,
        "stop": 794,
        "text": " a"
    },
    {
        "start": 794,
        "stop": 818,
        "text": " made"
    },
    {
        "start": 818,
        "stop": 819,
        "text": "-"
    },
    {
        "start": 831,
        "stop": 834,
        "text": "up"
    },
    {
        "start": 834,
        "stop": 855,
        "text": " word"
    },
    {
        "start": 858,
        "stop": 879,
        "text": " used"
    },
    {
        "start": 886,
        "stop": 894,
        "text": " to"
    },
    {
        "start": 894,
        "stop": 931,
        "text": " trick"
    },
    {
        "start": 936,
        "stop": 990,
        "text": " students"
    },
    {
        "start": 990,
        "stop": 1008,
        "text": "."
    },
    {
        "start": 1010,
        "stop": 1012,
        "text": " No"
    },
    {
        "start": 1037,
        "stop": 1079,
        "text": " actually"
    },
    {
        "start": 1095,
        "stop": 1095,
        "text": " wh"
    },
    {
        "start": 1095,
        "stop": 1116,
        "text": "ome"
    },
    {
        "start": 1132,
        "stop": 1137,
        "text": "ver"
    },
    {
        "start": 1137,
        "stop": 1151,
        "text": " is"
    },
    {
        "start": 1151,
        "stop": 1172,
        "text": " the"
    },
    {
        "start": 1172,
        "stop": 1214,
        "text": " formal"
    },
    {
        "start": 1214,
        "stop": 1263,
        "text": " version"
    },
    {
        "start": 1263,
        "stop": 1277,
        "text": " of"
    },
    {
        "start": 1277,
        "stop": 1298,
        "text": " the"
    },
    {
        "start": 1298,
        "stop": 1326,
        "text": " word"
    },
    {
        "start": 1326,
        "stop": 1347,
        "text": "."
    },
    {
        "start": 1347,
        "stop": 1417,
        "text": " Obviously"
    },
    {
        "start": 1418,
        "stop": 1428,
        "text": " it"
    },
    {
        "start": 1428,
        "stop": 1435,
        "text": "'s"
    },
    {
        "start": 1440,
        "stop": 1443,
        "text": " a"
    },
    {
        "start": 1443,
        "stop": 1464,
        "text": " real"
    },
    {
        "start": 1464,
        "stop": 1485,
        "text": " word"
    },
    {
        "start": 1485,
        "stop": 1494,
        "text": ","
    },
    {
        "start": 1494,
        "stop": 1505,
        "text": " but"
    },
    {
        "start": 1509,
        "stop": 1512,
        "text": " I"
    },
    {
        "start": 1522,
        "stop": 1530,
        "text": " don"
    },
    {
        "start": 1530,
        "stop": 1538,
        "text": "'t"
    },
    {
        "start": 1547,
        "stop": 1561,
        "text": " know"
    },
    {
        "start": 1561,
        "stop": 1582,
        "text": " when"
    },
    {
        "start": 1582,
        "stop": 1592,
        "text": " to"
    },
    {
        "start": 1592,
        "stop": 1607,
        "text": " use"
    },
    {
        "start": 1607,
        "stop": 1617,
        "text": " it"
    },
    {
        "start": 1617,
        "stop": 1664,
        "text": " correctly"
    },
    {
        "start": 1664,
        "stop": 1678,
        "text": "."
    },
    {
        "start": 1678,
        "stop": 1694,
        "text": " Not"
    },
    {
        "start": 1694,
        "stop": 1698,
        "text": " a"
    },
    {
        "start": 1699,
        "stop": 1730,
        "text": " native"
    },
    {
        "start": 1730,
        "stop": 1761,
        "text": " speaker"
    },
    {
        "start": 1767,
        "stop": 1792,
        "text": "."
    },
    {
        "start": 1792,
        "stop": 1798,
        "text": " I"
    },
    {
        "start": 1800,
        "stop": 1823,
        "text": " know"
    },
    {
        "start": 1823,
        "stop": 1848,
        "text": " what"
    },
    {
        "start": 1848,
        "stop": 1860,
        "text": "'s"
    },
    {
        "start": 1860,
        "stop": 1881,
        "text": " right"
    },
    {
        "start": 1889,
        "stop": 1903,
        "text": ","
    },
    {
        "start": 1904,
        "stop": 1910,
        "text": " but"
    },
    {
        "start": 1923,
        "stop": 1927,
        "text": " I"
    },
    {
        "start": 1927,
        "stop": 1939,
        "text": "'m"
    },
    {
        "start": 1939,
        "stop": 1957,
        "text": " not"
    },
    {
        "start": 1957,
        "stop": 1988,
        "text": " gonna"
    },
    {
        "start": 1988,
        "stop": 2005,
        "text": " say"
    },
    {
        "start": 2005,
        "stop": 2023,
        "text": " because"
    },
    {
        "start": 2050,
        "stop": 2067,
        "text": " you"
    },
    {
        "start": 2067,
        "stop": 2085,
        "text": "'re"
    },
    {
        "start": 2085,
        "stop": 2103,
        "text": " all"
    },
    {
        "start": 2103,
        "stop": 2120,
        "text": " jer"
    },
    {
        "start": 2125,
        "stop": 2133,
        "text": "ks"
    },
    {
        "start": 2133,
        "stop": 2148,
        "text": " who"
    },
    {
        "start": 2157,
        "stop": 2175,
        "text": " didn"
    },
    {
        "start": 2177,
        "stop": 2199,
        "text": "'t"
    },
    {
        "start": 2206,
        "stop": 2218,
        "text": " come"
    },
    {
        "start": 2218,
        "stop": 2231,
        "text": " see"
    },
    {
        "start": 2231,
        "stop": 2240,
        "text": " my"
    },
    {
        "start": 2240,
        "stop": 2258,
        "text": " band"
    },
    {
        "start": 2258,
        "stop": 2276,
        "text": " last"
    },
    {
        "start": 2276,
        "stop": 2293,
        "text": " night"
    },
    {
        "start": 2301,
        "stop": 2312,
        "text": "."
    },
    {
        "start": 2312,
        "stop": 2321,
        "text": " Do"
    },
    {
        "start": 2321,
        "stop": 2334,
        "text": " you"
    },
    {
        "start": 2334,
        "stop": 2361,
        "text": " really"
    },
    {
        "start": 2361,
        "stop": 2379,
        "text": " know"
    },
    {
        "start": 2379,
        "stop": 2402,
        "text": " which"
    },
    {
        "start": 2402,
        "stop": 2411,
        "text": " one"
    },
    {
        "start": 2417,
        "stop": 2424,
        "text": " is"
    },
    {
        "start": 2424,
        "stop": 2456,
        "text": " correct"
    },
    {
        "start": 2456,
        "stop": 2457,
        "text": "?"
    },
    {
        "start": 2471,
        "stop": 2473,
        "text": " I"
    },
    {
        "start": 2473,
        "stop": 2486,
        "text": " don"
    },
    {
        "start": 2486,
        "stop": 2504,
        "text": "'t"
    },
    {
        "start": 2504,
        "stop": 2507,
        "text": " know"
    },
    {
        "start": 2524,
        "stop": 2540,
        "text": "."
    },
    {
        "start": 2540,
        "stop": 2551,
        "text": " It"
    },
    {
        "start": 2551,
        "stop": 2561,
        "text": "'s"
    },
    {
        "start": 2574,
        "stop": 2584,
        "text": " whom"
    },
    {
        "start": 2591,
        "stop": 2608,
        "text": " when"
    },
    {
        "start": 2608,
        "stop": 2619,
        "text": " it"
    },
    {
        "start": 2619,
        "stop": 2630,
        "text": "'s"
    },
    {
        "start": 2630,
        "stop": 2647,
        "text": " the"
    },
    {
        "start": 2647,
        "stop": 2682,
        "text": " object"
    },
    {
        "start": 2682,
        "stop": 2693,
        "text": " of"
    },
    {
        "start": 2693,
        "stop": 2710,
        "text": " the"
    },
    {
        "start": 2710,
        "stop": 2756,
        "text": " sentence"
    },
    {
        "start": 2756,
        "stop": 2773,
        "text": " and"
    },
    {
        "start": 2773,
        "stop": 2790,
        "text": " who"
    },
    {
        "start": 2790,
        "stop": 2813,
        "text": " when"
    },
    {
        "start": 2813,
        "stop": 2824,
        "text": " is"
    },
    {
        "start": 2824,
        "stop": 2841,
        "text": " the"
    },
    {
        "start": 2841,
        "stop": 2879,
        "text": " subject"
    },
    {
        "start": 2881,
        "stop": 2905,
        "text": "."
    },
    {
        "start": 2917,
        "stop": 2942,
        "text": " That"
    },
    {
        "start": 2942,
        "stop": 2964,
        "text": " That"
    },
    {
        "start": 2964,
        "stop": 2993,
        "text": " sounds"
    },
    {
        "start": 2997,
        "stop": 3023,
        "text": " right"
    },
    {
        "start": 3026,
        "stop": 3042,
        "text": "."
    },
    {
        "start": 3042,
        "stop": 3047,
        "text": " Well"
    },
    {
        "start": 3052,
        "stop": 3057,
        "text": ","
    },
    {
        "start": 3057,
        "stop": 3062,
        "text": " it"
    },
    {
        "start": 3062,
        "stop": 3076,
        "text": " sounds"
    },
    {
        "start": 3077,
        "stop": 3089,
        "text": " right"
    },
    {
        "start": 3089,
        "stop": 3094,
        "text": ","
    },
    {
        "start": 3094,
        "stop": 3101,
        "text": " but"
    },
    {
        "start": 3101,
        "stop": 3106,
        "text": " is"
    },
    {
        "start": 3106,
        "stop": 3111,
        "text": " it"
    },
    {
        "start": 3111,
        "stop": 3121,
        "text": "?"
    },
    {
        "start": 3122,
        "stop": 3137,
        "text": " How"
    },
    {
        "start": 3137,
        "stop": 3152,
        "text": " did"
    },
    {
        "start": 3152,
        "stop": 3171,
        "text": " Ryan"
    },
    {
        "start": 3171,
        "stop": 3186,
        "text": " use"
    },
    {
        "start": 3186,
        "stop": 3196,
        "text": " it"
    },
    {
        "start": 3196,
        "stop": 3205,
        "text": ","
    },
    {
        "start": 3205,
        "stop": 3215,
        "text": " as"
    },
    {
        "start": 3215,
        "stop": 3223,
        "text": " an"
    },
    {
        "start": 3227,
        "stop": 3254,
        "text": " object"
    },
    {
        "start": 3254,
        "stop": 3272,
        "text": "?"
    },
    {
        "start": 3272,
        "stop": 3280,
        "text": " As"
    },
    {
        "start": 3280,
        "stop": 3288,
        "text": " an"
    },
    {
        "start": 3288,
        "stop": 3309,
        "text": " object"
    },
    {
        "start": 3309,
        "stop": 3324,
        "text": "."
    },
    {
        "start": 3324,
        "stop": 3353,
        "text": " Ryan"
    },
    {
        "start": 3353,
        "stop": 3382,
        "text": " used"
    },
    {
        "start": 3382,
        "stop": 3396,
        "text": " me"
    },
    {
        "start": 3396,
        "stop": 3410,
        "text": " as"
    },
    {
        "start": 3410,
        "stop": 3424,
        "text": " an"
    },
    {
        "start": 3424,
        "stop": 3466,
        "text": " object"
    },
    {
        "start": 3494,
        "stop": 3494,
        "text": "."
    },
    {
        "start": 3502,
        "stop": 3506,
        "text": " Is"
    },
    {
        "start": 3506,
        "stop": 3516,
        "text": " he"
    },
    {
        "start": 3520,
        "stop": 3549,
        "text": " right"
    },
    {
        "start": 3549,
        "stop": 3580,
        "text": " about"
    },
    {
        "start": 3580,
        "stop": 3605,
        "text": " that"
    },
    {
        "start": 3605,
        "stop": 3609,
        "text": "?"
    },
    {
        "start": 3627,
        "stop": 3640,
        "text": " How"
    },
    {
        "start": 3640,
        "stop": 3654,
        "text": " did"
    },
    {
        "start": 3654,
        "stop": 3663,
        "text": " he"
    },
    {
        "start": 3663,
        "stop": 3677,
        "text": " use"
    },
    {
        "start": 3677,
        "stop": 3686,
        "text": " it"
    },
    {
        "start": 3686,
        "stop": 3709,
        "text": " again"
    },
    {
        "start": 3709,
        "stop": 3726,
        "text": "?"
    },
    {
        "start": 3726,
        "stop": 3735,
        "text": " It"
    },
    {
        "start": 3735,
        "stop": 3749,
        "text": " was"
    },
    {
        "start": 3749,
        "stop": 3775,
        "text": "..."
    },
    {
        "start": 3794,
        "stop": 3814,
        "text": " Ryan"
    },
    {
        "start": 3814,
        "stop": 3847,
        "text": " wanted"
    },
    {
        "start": 3847,
        "stop": 3885,
        "text": " Michael"
    },
    {
        "start": 3885,
        "stop": 3897,
        "text": ","
    },
    {
        "start": 3897,
        "stop": 3914,
        "text": " the"
    },
    {
        "start": 3914,
        "stop": 3952,
        "text": " subject"
    },
    {
        "start": 3952,
        "stop": 3960,
        "text": ","
    },
    {
        "start": 3964,
        "stop": 3975,
        "text": " to"
    },
    {
        "start": 3975,
        "stop": 4014,
        "text": " explain"
    },
    {
        "start": 4014,
        "stop": 4031,
        "text": " the"
    },
    {
        "start": 4031,
        "stop": 4076,
        "text": " computer"
    },
    {
        "start": 4076,
        "stop": 4105,
        "text": " system"
    },
    {
        "start": 4109,
        "stop": 4120,
        "text": ","
    },
    {
        "start": 4120,
        "stop": 4137,
        "text": " the"
    },
    {
        "start": 4137,
        "stop": 4170,
        "text": " object"
    },
    {
        "start": 4170,
        "stop": 4194,
        "text": "."
    },
    {
        "start": 4214,
        "stop": 4227,
        "text": " Thank"
    },
    {
        "start": 4227,
        "stop": 4242,
        "text": " you"
    },
    {
        "start": 4247,
        "stop": 4265,
        "text": "."
    },
    {
        "start": 4265,
        "stop": 4278,
        "text": " To"
    },
    {
        "start": 4278,
        "stop": 4291,
        "text": " wh"
    },
    {
        "start": 4291,
        "stop": 4310,
        "text": "ome"
    },
    {
        "start": 4310,
        "stop": 4329,
        "text": "ver"
    },
    {
        "start": 4329,
        "stop": 4340,
        "text": ","
    },
    {
        "start": 4358,
        "stop": 4388,
        "text": " meaning"
    },
    {
        "start": 4388,
        "stop": 4401,
        "text": " us"
    },
    {
        "start": 4401,
        "stop": 4411,
        "text": ","
    },
    {
        "start": 4418,
        "stop": 4429,
        "text": " the"
    },
    {
        "start": 4433,
        "stop": 4486,
        "text": " indirect"
    },
    {
        "start": 4486,
        "stop": 4524,
        "text": " object"
    },
    {
        "start": 4525,
        "stop": 4546,
        "text": ","
    },
    {
        "start": 4549,
        "stop": 4573,
        "text": " which"
    },
    {
        "start": 4573,
        "stop": 4584,
        "text": " is"
    },
    {
        "start": 4584,
        "stop": 4600,
        "text": " the"
    },
    {
        "start": 4600,
        "stop": 4636,
        "text": " correct"
    },
    {
        "start": 4641,
        "stop": 4661,
        "text": " usage"
    },
    {
        "start": 4668,
        "stop": 4677,
        "text": " of"
    },
    {
        "start": 4677,
        "stop": 4693,
        "text": " the"
    },
    {
        "start": 4693,
        "stop": 4715,
        "text": " word"
    },
    {
        "start": 4715,
        "stop": 4736,
        "text": "."
    }
]

Maybe I have mistake in how I consume the segments
That's how I create the word segments:

core/src/model.rs#L134

Maybe I have mistake in how I consume the segments That's how I create the word segments:

core/src/model.rs#L134

Yes, you get tokens, but you need to get segment text. Try to use this
let text = state.full_get_segment_text_lossy(s).context("failed to get segment")?;

Yes, you get tokens, but you need to get segment text. Try to use this
let text = state.full_get_segment_text_lossy(s).context("failed to get segment")?;

Notice that I said word segments, in general I already use there get_segment_text in the else statement. Do I need to use get_segment_text even in the loop of the num_tokens?

My proposal was to use max_len 1 and split_on_word and I think that with this options each segment will be a single word.

So you don’t need to use tokens at all, only segments.

for s in 0..num_segments {
        let text = state.full_get_segment_text_lossy(s).context("failed to get segment")?;
        let start = state.full_get_segment_t0(s).context("failed to get start timestamp")?;
        let stop = state.full_get_segment_t1(s).context("failed to get end timestamp")?;
            segments.push(Segment { text, start, stop });
}

If this doesn’t help tomorrow I’ll give you example how to create words from tokens.

@arizhih

It worked!
I tried so many options there but didn't thought about this one

Thank you so much :)