bug: decimal/thousand-formatted numbers conflict with timestamp regex
lincolnthalles opened this issue · 3 comments
A valid region-formatted number in content can raise errors.
1490
01:58:16,824 --> 01:58:23,789
<i> Império Austro-Húngaro:
1.567.202. </i>
File "srt.py", line 436, in _check_contiguity
raise SRTParseError(expected_start, actual_start, unmatched_content)
srt.SRTParseError: Expected contiguous start of match or end of input at char 109990, but started at char 110005 (unmatched content: '1.567.202. </i>')
In this case, it's possible to circumvent this error by removing the dot from the timestamp regex, but, as the preceding comment in the source code states, this will break compatibility with non-strict subtitles.
Line 19 in 434d0c1
Suggested possible fixes are:
- create a 'strict' flag to instantiate parse() (or a parse_strict() method), that would use only strict regex patterns;
- add boundaries in order to make the timestamp regex less greedy, so it never matches the contents.
This subtitle works just fine for me:
srt develop % cat test.srt
1490
01:58:16,824 --> 01:58:23,789
<i> Império Austro-Húngaro:
1.567.202. </i>
srt develop % srt normalise -i test.srt
1
01:58:16,824 --> 01:58:23,789
<i> Império Austro-Húngaro:
1.567.202. </i>
Please provide more information on how to reproduce, and your srt version.
This subtitle works just fine for me:
srt develop % cat test.srt 1490 01:58:16,824 --> 01:58:23,789 <i> Império Austro-Húngaro: 1.567.202. </i> srt develop % srt normalise -i test.srt 1 01:58:16,824 --> 01:58:23,789 <i> Império Austro-Húngaro: 1.567.202. </i>
Please provide more information on how to reproduce, and your srt version.
srt version is 3.5.3.
My bad, I provided a bad sample. The issue was caused on my part by pre-processing the file and joining the lines with "\n", which resulted in
1490
01:58:03,632 --> 01:58:10,590
<i>Império Austro-Húngaro:
1.567.202.</i>
That format triggers the error:
CRITICAL:srt_tools.utils:Parsing failed, maybe you need to pass a different encoding with --encoding?
Traceback (most recent call last):
File "/home/lincoln/.local/bin/srt-normalise", line 28, in <module>
main()
File "/home/lincoln/.local/bin/srt-normalise", line 19, in main
output = srt_tools.utils.compose_suggest_on_fail(args.input, strict=args.strict)
File "/home/lincoln/.local/lib/python3.10/site-packages/srt_tools/utils.py", line 208, in compose_suggest_on_fail
return srt.compose(subs, strict=strict, eol=os.linesep, in_place=True)
File "/home/lincoln/.local/lib/python3.10/site-packages/srt.py", line 476, in compose
return "".join(subtitle.to_srt(strict=strict, eol=eol) for subtitle in subtitles)
File "/home/lincoln/.local/lib/python3.10/site-packages/srt.py", line 476, in <genexpr>
return "".join(subtitle.to_srt(strict=strict, eol=eol) for subtitle in subtitles)
File "/home/lincoln/.local/lib/python3.10/site-packages/srt.py", line 294, in sort_and_reindex
for sub_num, subtitle in enumerate(sorted(subtitles), start=start_index):
File "/home/lincoln/.local/lib/python3.10/site-packages/srt.py", line 375, in parse
_check_contiguity(srt, expected_start, actual_start, ignore_errors)
File "/home/lincoln/.local/lib/python3.10/site-packages/srt.py", line 436, in _check_contiguity
raise SRTParseError(expected_start, actual_start, unmatched_content)
srt.SRTParseError: Expected contiguous start of match or end of input at char 355, but started at char 369 (unmatched content: '1.567.202.</i>')
I'm using this library for years and this error occurrence is very rare.
In further investigation, I found that the specific error triggers are:
- empty line before the number / blank line inside the subtitle content
- last line starting with a number with more than 7 digits formatted with comma or dot as decimal separator
1
00:00:00,000 --> 00:00:00,916
Anything
123,456,789,123,456
2
00:00:01,000 --> 00:00:02,236
Whatever
123.456.789
So, as I was in fact feeding an invalid SRT to the library, I'm not sure this must be corrected. But, on the other side, this library tolerance to non-strict SRT files is a feature.
Thanks for your time, and feel free to close this.
Thanks for the update. Blank lines are illegal in SRT content, so this is a wontfix. Any attempt to parse an SRT with blank lines in the content is best effort. Any parse_strict
or similar method would also thus reject your incoming SRT block.