Parsing SRT data with leading whitespace

Question

Parsing SRT data with leading whitespace

senpos opened this issue 5 years ago · 6 comments

Hi,

Platform: Windows 7 64-bit
Python version: 3.7.3 64-bit
Library version: 1.11.0

I am trying to parse this file (./demo.srt) with the following code:

import srt

with open(r"demo.srt") as fd:
    subs = srt.parse(fd)
    for line in subs:
        print(line)

I receive the following error:

D:\test_srt>d:/test_srt/venv/Scripts/activate.bat

(venv) D:\test_srt>d:/test_srt/venv/Scripts/python.exe d:/test_srt/demo.py
Traceback (most recent call last):
  File "d:/test_srt/demo.py", line 5, in <module>
    for line in subs:
  File "d:\test_srt\venv\lib\site-packages\srt.py", line 341, in parse
    _raise_if_not_contiguous(srt, expected_start, actual_start)
  File "d:\test_srt\venv\lib\site-packages\srt.py", line 377, in _raise_if_not_contiguous
    raise SRTParseError(expected_start, actual_start, unmatched_content)
srt.SRTParseError: Expected contiguous start of match or end of input at char 0, but started at char 2 (unmatched content: '\n\n')

pysrt handles this file without any problems. POEdit is working with it as well.
So, I guess, srt is valid.

Would be thankful if you take a look at this and thanks for your work.

Answer 1 · 2019-07-25T06:25:56.000Z

It looks like if I remove those newlines at the file beginning - everything works.
Shouldn't they be handled automatically?

Answer 2 · 2019-07-25T13:03:19.000Z

There's no formal SRT spec, so there's no real definition of what's valid or not. I've never seen this particular case, so srt never learned to handle it :-)

However, it should be easy enough to add functionality to deal with it. Thanks for the report!

Answer 3 · 2019-07-25T13:33:24.000Z

0d7e5fd adds support for this. I'll merge once CI passes.

Answer 4 · 2020-12-13T16:22:40.000Z

Platform: Windows 7 32bit
Python Version: 3.8.2
Library Version: srt 3.4.1

It still is showing this error:
raise SRTParseError(expected_start, actual_start, unmatched_content)
srt.SRTParseError: Expected contiguous start of match or end of input at char 0, but started at char 3 (unmatched content: 'ï»¿')

Can you help me with it?

Answer 5 · 2020-12-14T11:49:53.000Z

@JafarAbbas33 If you have a new issue, please open a new issue instead of commandeering an old one. However, ï»¿ is a UTF-8 BOM in ISO-8859-1. You need to read the file with the right encoding.

Answer 6 · 2020-12-15T13:19:05.000Z

Yes, you are right. Sorry. By the way for someone having the same problem, they can use something like:
with open(fname, encoding='utf-8-sig') as f: