cdown/srt

Parsing SRT data with leading whitespace

senpos opened this issue · 6 comments

Hi,

Platform: Windows 7 64-bit
Python version: 3.7.3 64-bit
Library version: 1.11.0

I am trying to parse this file (./demo.srt) with the following code:

import srt

with open(r"demo.srt") as fd:
    subs = srt.parse(fd)
    for line in subs:
        print(line)

I receive the following error:

D:\test_srt>d:/test_srt/venv/Scripts/activate.bat

(venv) D:\test_srt>d:/test_srt/venv/Scripts/python.exe d:/test_srt/demo.py
Traceback (most recent call last):
  File "d:/test_srt/demo.py", line 5, in <module>
    for line in subs:
  File "d:\test_srt\venv\lib\site-packages\srt.py", line 341, in parse
    _raise_if_not_contiguous(srt, expected_start, actual_start)
  File "d:\test_srt\venv\lib\site-packages\srt.py", line 377, in _raise_if_not_contiguous
    raise SRTParseError(expected_start, actual_start, unmatched_content)
srt.SRTParseError: Expected contiguous start of match or end of input at char 0, but started at char 2 (unmatched content: '\n\n')

pysrt handles this file without any problems. POEdit is working with it as well.
So, I guess, srt is valid.

Would be thankful if you take a look at this and thanks for your work.

It looks like if I remove those newlines at the file beginning - everything works.
Shouldn't they be handled automatically?

cdown commented

There's no formal SRT spec, so there's no real definition of what's valid or not. I've never seen this particular case, so srt never learned to handle it :-)

However, it should be easy enough to add functionality to deal with it. Thanks for the report!

cdown commented

0d7e5fd adds support for this. I'll merge once CI passes.

Platform: Windows 7 32bit
Python Version: 3.8.2
Library Version: srt 3.4.1

It still is showing this error:
raise SRTParseError(expected_start, actual_start, unmatched_content)
srt.SRTParseError: Expected contiguous start of match or end of input at char 0, but started at char 3 (unmatched content: '')

Can you help me with it?

cdown commented

@JafarAbbas33 If you have a new issue, please open a new issue instead of commandeering an old one. However,  is a UTF-8 BOM in ISO-8859-1. You need to read the file with the right encoding.

Yes, you are right. Sorry. By the way for someone having the same problem, they can use something like:
with open(fname, encoding='utf-8-sig') as f: