cdown/srt

bug: decimal/thousand-formatted numbers conflict with timestamp regex

lincolnthalles opened this issue · 3 comments

A valid region-formatted number in content can raise errors.

1490
01:58:16,824 --> 01:58:23,789
<i> Império Austro-Húngaro:
1.567.202. </i>
File "srt.py", line 436, in _check_contiguity
    raise SRTParseError(expected_start, actual_start, unmatched_content)
srt.SRTParseError: Expected contiguous start of match or end of input at char 109990, but started at char 110005 (unmatched content: '1.567.202. </i>')

In this case, it's possible to circumvent this error by removing the dot from the timestamp regex, but, as the preceding comment in the source code states, this will break compatibility with non-strict subtitles.

srt/srt.py

Line 19 in 434d0c1

RGX_TIMESTAMP_MAGNITUDE_DELIM = r"[,.:,.。:]"

Suggested possible fixes are:

  • create a 'strict' flag to instantiate parse() (or a parse_strict() method), that would use only strict regex patterns;
  • add boundaries in order to make the timestamp regex less greedy, so it never matches the contents.
cdown commented

This subtitle works just fine for me:

srt develop % cat test.srt
1490
01:58:16,824 --> 01:58:23,789
<i> Império Austro-Húngaro:
1.567.202. </i>

srt develop % srt normalise -i test.srt 
1
01:58:16,824 --> 01:58:23,789
<i> Império Austro-Húngaro:
1.567.202. </i>

Please provide more information on how to reproduce, and your srt version.

This subtitle works just fine for me:

srt develop % cat test.srt
1490
01:58:16,824 --> 01:58:23,789
<i> Império Austro-Húngaro:
1.567.202. </i>

srt develop % srt normalise -i test.srt 
1
01:58:16,824 --> 01:58:23,789
<i> Império Austro-Húngaro:
1.567.202. </i>

Please provide more information on how to reproduce, and your srt version.

srt version is 3.5.3.

My bad, I provided a bad sample. The issue was caused on my part by pre-processing the file and joining the lines with "\n", which resulted in

1490

01:58:03,632 --> 01:58:10,590

<i>Império Austro-Húngaro:

1.567.202.</i>

That format triggers the error:

CRITICAL:srt_tools.utils:Parsing failed, maybe you need to pass a different encoding with --encoding?
Traceback (most recent call last):
  File "/home/lincoln/.local/bin/srt-normalise", line 28, in <module>
    main()
  File "/home/lincoln/.local/bin/srt-normalise", line 19, in main
    output = srt_tools.utils.compose_suggest_on_fail(args.input, strict=args.strict)
  File "/home/lincoln/.local/lib/python3.10/site-packages/srt_tools/utils.py", line 208, in compose_suggest_on_fail
    return srt.compose(subs, strict=strict, eol=os.linesep, in_place=True)
  File "/home/lincoln/.local/lib/python3.10/site-packages/srt.py", line 476, in compose
    return "".join(subtitle.to_srt(strict=strict, eol=eol) for subtitle in subtitles)
  File "/home/lincoln/.local/lib/python3.10/site-packages/srt.py", line 476, in <genexpr>
    return "".join(subtitle.to_srt(strict=strict, eol=eol) for subtitle in subtitles)
  File "/home/lincoln/.local/lib/python3.10/site-packages/srt.py", line 294, in sort_and_reindex
    for sub_num, subtitle in enumerate(sorted(subtitles), start=start_index):
  File "/home/lincoln/.local/lib/python3.10/site-packages/srt.py", line 375, in parse
    _check_contiguity(srt, expected_start, actual_start, ignore_errors)
  File "/home/lincoln/.local/lib/python3.10/site-packages/srt.py", line 436, in _check_contiguity
    raise SRTParseError(expected_start, actual_start, unmatched_content)
srt.SRTParseError: Expected contiguous start of match or end of input at char 355, but started at char 369 (unmatched content: '1.567.202.</i>')

I'm using this library for years and this error occurrence is very rare.

In further investigation, I found that the specific error triggers are:

  • empty line before the number / blank line inside the subtitle content
  • last line starting with a number with more than 7 digits formatted with comma or dot as decimal separator
1
00:00:00,000 --> 00:00:00,916
Anything

123,456,789,123,456

2
00:00:01,000 --> 00:00:02,236
Whatever

123.456.789

So, as I was in fact feeding an invalid SRT to the library, I'm not sure this must be corrected. But, on the other side, this library tolerance to non-strict SRT files is a feature.

Thanks for your time, and feel free to close this.

cdown commented

Thanks for the update. Blank lines are illegal in SRT content, so this is a wontfix. Any attempt to parse an SRT with blank lines in the content is best effort. Any parse_strict or similar method would also thus reject your incoming SRT block.