Extend subtitles parser to recognize the format of automatically generated subtitles
stanislaw opened this issue · 2 comments
stanislaw commented
The subtitles are parsed correctly when they are not auto-generated (and we already have a working unit test for this 🥳 ).
When they are auto-generated, the subtitles files contain more structure and lines with funny <c>...</c>
tags that have to be recognized by the parser.
Here is an example:
WEBVTT
Kind: captions
Language: de
00:00:00.000 --> 00:00:02.869 align:start position:0%
hallo <00:00:00.430><c>ich </c><00:00:00.860><c>bin </c><00:00:01.290><c>david </c><00:00:01.720><c>gründer </c><00:00:02.150><c>der </c><00:00:02.580><c>lingus</c>
00:00:02.869 --> 00:00:02.879 align:start position:0%
hallo ich bin david gründer der lingus
00:00:02.879 --> 00:00:05.390 align:start position:0%
hallo ich bin david gründer der lingus
organic <00:00:03.336><c>und </c><00:00:03.793><c>erfinder </c><00:00:04.250><c>der </c><00:00:04.707><c>bilingue</c>
Seems directly relevant:
kamui-fin commented
So far I got the parser to avoid including that additional markup in PR #31. We still need to write code to avoid duplicate lines, correct?
kamui-fin commented
Just added the duplicate removing code. Let me know if I can close this now