Library for converting the various transcript file formats to a common format.
Originally designed to help users of the Podcast Namespace podcast:transcript
tag.
This is a Node.js module available through npm or yarn.
npm install transcriptator
yarn add transcriptator
There are three primary methods and two types. See the jsdoc for additional information.
The convertFile
function accepts the transcript file data and parses it in to an array of Segment
.
If transcriptFormat
is not defined, will use determineFormat
to attempt to identify the type.
convertFile(data: string, transcriptFormat: TranscriptFormat = undefined): Array<Segment>
The determineFormat
function accepts the transcript file data and attempts to identify the TranscriptFormat
.
determineFormat(data: string): TranscriptFormat
The TranscriptFormat
enum defines the allowable transcript types supported by Transcriptator.
The Segment
type defines the segment/cue of the transcript.
To change the way the startTime
and endTime
are formatted in startTimeFormatted
and endTimeFormatted
,
register a custom formatter to be used instead.
The formatter function shall accept a single argument as a number and return the value formatted as a string.
import { TimestampFormatter } from "transcriptator"
function customFormatter(timestamp) {
return timestamp.toString()
}
TimestampFormatter.registerCustomFormatter(customFormatter)
Additional options are available for combining or formatting two or more segments
To change the options, use the Options.setOptions
function.
The options only need to be specified once and will be used when parsing any transcript data.
To restore options to their default value, call Options.restoreDefaultSettings
.
The IOptions
interface used by Options
defines options for combining and formatting parsed segments.
combineEqualTimes
: boolean- Combine segments if the
Segment.startTime
,Segment.endTime
, andSegment.speaker
match between the current and prior segments - Can be used with
combineSegments
. ThecombineEqualTimes
rule is applied first. - Can be used with
speakerChange
. ThespeakerChange
rule is applied last. - Cannot be used with
combineSpeaker
- Default: false
- Combine segments if the
combineEqualTimesSeparator
: string- Character to use when
combineEqualTimes
is true. - Default:
\n
- Character to use when
combineSegments
: boolean- Combine segments where speaker is the same and concatenated
body
fits in thecombineSegmentsLength
- Can be used with
combineEqualTimes
. ThecombineSegments
rule is applied first. - Can be used with
speakerChange
. ThespeakerChange
rule is applied last. - Cannot be used with
combineSpeaker
- Default: false
- Combine segments where speaker is the same and concatenated
combineSegmentsLength
: number- Max length of body text to use when
combineSegments
is true - Default: See
DEFAULT_COMBINE_SEGMENTS_LENGTH
- Max length of body text to use when
combineSpeaker
: boolean- Combine consecutive segments from the same speaker.
- Note: If this is enabled,
combineEqualTimes
andcombineSegments
will not be applied. - Warning: if the transcript does not contain speaker information, resulting segment will contain entire transcript text.
- Default: false
speakerChange
: boolean- Only include
Segment.speaker
when speaker changes - May be used in combination with
combineSpeaker
,combineEqualTimes
, orcombineSegments
- Default: false
- Only include
import { Options } from "transcriptator"
Options.setOptions({
combineSegments: true,
combineSegmentsLength: 32,
})
Transcripts which follow the SRT/SubRip format
1
00:00:00,780 --> 00:00:06,210
Adam Curry: podcasting 2.0 March
4 2023 Episode 124 on D flat
2
00:00:06,210 --> 00:00:12,990
formable hello everybody welcome
to a delayed board meeting of
The timestamp may contain the hour and minutes but is not required. The millisecond may be separated with either a comma or decimal.
Attempts to find the speaker's name from the beginning of the first line of each segment.
References:
HTML data in format below are considered to be transcripts.
The elements cite
, time
, and p
are used to define a segment.
The cite
element is not required. The order is also not required.
The elements may either be a child of the document directly or a direct child of the html
or body
element.
Elements do not need to be on separate lines.
Example 1
<html>
<body>
<cite>Alban:</cite>
<time>0:00</time>
<p>
It is so stinking nice to like, show up and record this show. And Travis has already put together an
outline. Kevin's got suggestions, I throw my thoughts into the mix. And then Travis goes and does all the
work from there, too. It's out into the wild. And I don't see anything. That's an absolute joy for at least
two thirds of the team. Yeah, I mean, exactly.
</p>
<cite>Kevin:</cite>
<time>0:30</time>
<p>
You guys remember, like two months ago, when you were like, We're going all in on video Buzzcast. I was
like, that's, I mean, I will agree and commit and disagree, disagree and commit, I'll do something. But I
don't want to do this.
</p>
</body>
</html>
Example 2
<p>
It is so stinking nice to like, show up and record this show. And Travis has already put together an outline.
Kevin's got suggestions, I throw my thoughts into the mix. And then Travis goes and does all the work from there,
too. It's out into the wild. And I don't see anything. That's an absolute joy for at least two thirds of the team.
Yeah, I mean, exactly.
</p>
<time>0:00</time>
<p>
You guys remember, like two months ago, when you were like, We're going all in on video Buzzcast. I was like,
that's, I mean, I will agree and commit and disagree, disagree and commit, I'll do something. But I don't want to do
this.
</p>
<time>0:30</time>
JSON data in one of the formats below are considered to be transcripts.
In both formats, the data does not need to be in pretty print format.
Format 1
{
"version": "1.0.0",
"segments": [
{
"speaker": "Alban",
"startTime": 0.0,
"endTime": 4.8,
"body": "It is so stinking nice to"
},
{
"speaker": "Alban",
"startTime": 0.0,
"endTime": 4.8,
"body": "like, show up and record this"
}
]
}
There must be a segments
list of objects containing speaker
, startTime
, endTime
, and body
.
The startTime
and endTime
are assumed to be in seconds.
Format 2
[
{
"start": 1,
"end": 5000,
"text": "Subtitles: @marlonrock1986 (^^V^^)"
},
{
"start": 25801,
"end": 28700,
"text": "It's another hot, sunny day today\nhere in Southern California."
}
]
The top level element must be a list of objects containing start
, end
, and text
.
The start
and end
are assumed to be in milliseconds.
Attempts to find the speaker's name from the beginning of the text
value.
Transcripts which follow the WebVTT/VTT format
WEBVTT
1
00:00:00.001 --> 00:00:05.000
Subtitles: @marlonrock1986 (^^V^^)
2
00:00:25.801 --> 00:00:28.700
It's another hot, sunny day today
here in Southern California.
The index number is optional:
WEBVTT
00:00:00.000 --> 00:00:11.840
Buenas, bienvenidas de vuelta a KDE Express. Esta vez para no perder el ritmo volvemos a la
00:00:11.840 --> 00:00:16.800
versión movilidad que no tenemos a los compañeros disponibles y hoy quería haceros un especial
The timestamp may contain the hour and minutes but is not required. The millisecond may be separated with either a comma or decimal.
Attempts to find the speaker's name from the beginning of the first line of each segment.
References:
Transcripts used for testing are excerpts from the following shows.
- Podcasting 2.0
- podcasting_20_episode_124.srt (from Episode 124)
- Buzzcast
- buzzcast.html
- buzzcast.srt
- buzzcast.json
- How to Start a Podcast
- how_to_start_a_podcast.json
- how_to_start_a_podcast.html
- Podnews Daily (2024-01-25)
- podnews_daily_2024-01-25.vtt
- Podnews Weekly Review (2023-03-17)
- podnews_weekly_review_2023-03-17.html
- Podnews Weekly Review (2023-05-05)
- podnews_weekly_review_2023-05-05.json
- Podnews Weekly Review (2024-01-19)
- podnews_weekly_review_2024-01-19.vtt
- subtitle.js
- LaLaLand.vtt
- LaLaLand.json
- KDE Express
- kde_express-16_kde_en_telegram.vtt
Please see the Contribution Guide