Add arguments `time_off` and `duration` to transcriber

Question

Add arguments `time_off` and `duration` to transcriber

me-kell opened this issue 3 months ago · 2 comments

Currently the transcriber processes the whole input file. From the beginning to the end.

It would be very useful to be able to pass a start time offset and/or a duration to the transcriber.

Here is a proposal how to do it:

Add (ffmpeg's) arguments time_off and duration in python/vosk/transcriber/cli.py after line 46.

parser.add_argument("--time_off", "-ss", default=None, type=int, help="start time offset")
parser.add_argument("--duration", "-d", default=None, type=int, help="duration")

Pass the arguments time_off and duration to ffmpeg in function resample_ffmpeg in python/vosk/transcriber/transcriber.py (line 115):

        cmd = shlex.split("ffmpeg -nostdin -loglevel quiet "
                "-i \'{}\' -ar {} -ac 1 {} {} -f s16le -".format(
                    str(infile), 
                    SAMPLE_RATE, 
                    f'-ss {self.args.time_off}' if self.args.time_off is not None else '', # add this
                    f'-t {self.args.duration}' if self.args.duration is not None else ''   # and this
                    ))

The function resample_ffmpeg_async could be adapted similarly.

Answer 1 · 2024-03-06T16:23:42.000Z

Hi, thank you for the proposal! Looks nice but what is the usecase please? I can't imagine the user needs to start from certain offset instead of just processing the whole file.

Answer 2 · 2024-03-06T20:48:34.000Z

Some use cases:

Have a recording of an interview and a list of the start times of every question and answer. You may want to assign the transcripted parts to their respective time points (question and answer).
You have a music radio programm with the radio speaker commenting every two or three songs. You may want to transcribe only the radio speaker but not the music songs.
And last but not least: you have an audio file with different languages spoken by different speakers. You may want to transcript different parts of the audio in different languages using the corresponding language and model.