Converts SRT subtitle file to SSML file with speech durations.
Note: If looking for a more comprehensive script for also generating synced and translated dubs, visit my other repo.
- Using TTS to generate speech for a video using only subtitles
- Automated translation and dubbing of videos while keeping the dub in sync. You can simply translate the text portions of the subtitles before feeding it into the script. This allows the translations of each line remain the same length of the original speech, so the generated speech should theoretically be a drop-in replacement of the original.
- It takes the text lines from the subtitle file and puts each on a separate line within the
speak
tag - It takes the timestamps for the start/end for each subtitle line, and calculates that time difference in milliseconds. Then uses that for the
duration
attribute for theprosody
tag. This tells the TTS how long it should take to say the line, so it will stay in sync with the original video.- Warning: Not many neural TTS services support this duration feature, so this may not work as expected.
- It also calculates the time difference between the end of one subtitle line and the beginning of the next, and uses that as the
time
attribute for thebreak
tag at the end of each text line. This is also to keep it in sync with the original video.
- Language
- TTS Voice Name
- SSML Version
- xmlns Attributes for tag
- Whether to include the
xmlns:xsi
andxsi:schemaLocation
attributes - Input and Output file names (Defaults:
subtitles.srt
for input andSSML.txt
for output) - Duration Attribute Name (Allows support for Amazon Polly which uses
amazon:max-duration
instead of 'duration')
1
00:00:00,140 --> 00:00:05,050
This is an example of a subtitle file with a bunch of random words I've added with various timestamps.
2
00:00:05,240 --> 00:00:13,290
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim
3
00:00:13,480 --> 00:00:14,250
veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
4
00:00:14,340 --> 00:00:19,930
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla.
5
00:00:20,130 --> 00:00:23,419
Now some examples of some escaped characters such as & and ' and " and < and > just to name a few
<?xml version="1.0" encoding="UTF-8"?>
<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" version="1.0" xml:lang="en-US"><voice name="en-US-DavisNeural">
<prosody duration="4910ms">This is an example of a subtitle file with a bunch of random words I've added with various timestamps.</prosody><break time="190ms"/>
<prosody duration="8050ms">Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim</prosody><break time="190ms"/>
<prosody duration="770ms">veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</prosody><break time="90ms"/>
<prosody duration="5590ms">Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla.</prosody><break time="200ms"/>
<prosody duration="3289ms">Now some examples of some escaped characters such as & and ' and " and < and > just to name a few</prosody>
</voice></speak>