Create an SRT subtitle file along with the text output

Question

Create an SRT subtitle file along with the text output

gtasteve opened this issue 9 months ago · 3 comments

Describe the feature

It would be great to have an SRT file to go along with the output MP3 from the /web app.

edit

Answer 1 · 2025-02-27T12:46:49.000Z

`import sounddevice as sd
from kokoro_onnx import Kokoro
from kokoro_onnx.tokenizer import Tokenizer
import time
from scipy.io.wavfile import write

def generate_audio(text, voice, speed, lang):
"""
Generate audio from the given text using the specified voice and language.
"""
tokenizer = Tokenizer()
kokoro = Kokoro("kokoro-v1.0.onnx", "voices-v1.0.bin")

phonemes = tokenizer.phonemize(text, lang=lang)
samples, sample_rate = kokoro.create(
    phonemes, voice=voice, speed=speed, lang=lang, is_phonemes=True
)
return samples, sample_rate

def play_audio(samples, sample_rate):
"""
Play the generated audio.
"""
print("Playing audio...")
start_time = time.time()
sd.play(samples, sample_rate)
sd.wait()
end_time = time.time()
duration = end_time - start_time
return duration

def generate_subtitles(text, output_file, duration):
"""
Generate subtitles for the audio in the .srt format with 10 words per line.
"""
words = text.split()
num_lines = (len(words) + 9) // 10 # Calculate the number of lines
start_time = 0
end_time = duration / num_lines1.1 # Calculate the duration per line
with open(output_file, "w") as f:
for i in range(num_lines):
start_time_str = time.strftime("%H:%M:%S", time.gmtime(start_time))
start_time_str = f"{start_time_str},000"
end_time_str = time.strftime("%H:%M:%S", time.gmtime(start_time + end_time))
end_time_str = f"{end_time_str},000"
line_words = words[i10:(i+1)*10] # Get the words for the current line
line_text = " ".join(line_words)
f.write(f"{i+1}\n")
f.write(f"{start_time_str} --> {end_time_str}\n")
f.write(line_text + "\n\n")
start_time += end_time

def main():
text = "Ere the half-hour ended, five o’clock struck; school was dismissed, and all were gone into the refectory to tea. I now ventured to descend: it was deep dusk; I retired into a corner and sat down on the floor. The spell by which I had been so far supported began to dissolve; reaction took place, and soon, so overwhelming was the grief that seized me, I sank prostrate with my face to the ground."
voice = "af_heart"
speed = 1.0
lang = "en-us"
output_file = "subtitles"

samples, sample_rate = generate_audio(text, voice, speed, lang)
duration = play_audio(samples, sample_rate)
generate_subtitles(text, f"{output_file}.srt", duration)
write(f"{output_file}.wav", sample_rate, samples)
print(f"Subtitles saved to {output_file}.srt")
print(f"Audio saved to {output_file}.wav")

if name == "main":
main()
`

Answer 2 · 2025-03-23T18:56:57.000Z

Disclaimer: I am not a developer on this project, just someone who would also be interested in an SRT file.

This code looks really useful @zhongtanru but it plays the file to get the length, and it assumes that all words (lines) are of equal length when spoken. Ideally the SRT, or at least timestamps, would be generated at the time of synthesis so they can be accurate.

Answer 3 · 2025-03-24T02:11:29.000Z

I'm aware of its limitations, but I haven't found a more effective way to generate the desired subtitle.