cdown/srt

SRT tooling should allow using non-system encodings

Closed this issue · 4 comments

Hi there,

Just a heads up — I just tried using this library on a collection of SRT files I have, and I'm getting an SRTParseError exception on each file. I've attached one of the files here:

srttest.txt

cdown commented

Hey, sorry, I didn't see this until now.

The reason you're getting an SRTParseError is because this file contains a UTF-8 BOM, which is unsupported (srt tools open files with encoding utf-8, not utf-8-sig).

I've been considering adding some way for people to flag other encodings with the command line tools, let me look into it.

cdown commented

(Specifically, it is because argparse.FileType() does not accept an encoding parameter)

cdown commented

I think the most reasonable way to solve this is to set your encoding correctly in your environment, to be honest.

Here is a half-baked attempt at solving the issue with an -e parameter, but doing this for output is a bit more ugly (requiring some copying of code for now):

commit 32063fa63de57a9a7a452a907161ac3636a03067
Author: Chris Down <chris@chrisdown.name>
Date:   Tue Sep 27 00:22:36 2016 +0100

    Add -e/--encoding

diff --git srt_tools/srt-fix-subtitle-indexing srt_tools/srt-fix-subtitle-indexing
index 2c44419..a01d6c0 100755
--- srt_tools/srt-fix-subtitle-indexing
+++ srt_tools/srt-fix-subtitle-indexing
@@ -6,8 +6,8 @@ import srt

 def main():
     args = srt_tools.utils.basic_parser().parse_args()
-    subtitles_in = srt.parse(args.input.read())
-    output = srt.compose(subtitles_in, strict=args.strict)
+    srt_tools.utils.set_basic_args(args)
+    output = srt.compose(args.input, strict=args.strict)
     args.output.write(output)


diff --git srt_tools/utils.py srt_tools/utils.py
index f08a5d6..13602a4 100755
--- srt_tools/utils.py
+++ srt_tools/utils.py
@@ -1,32 +1,39 @@
 #!/usr/bin/env python

 import argparse
+import srt
 import sys
 import logging
 import itertools
+import collections
+
+
+log = logging.getLogger(__name__)


 def basic_parser(multi_input=False):
     parser = argparse.ArgumentParser(description=__doc__)

+    # Cannot use argparse.FileType as we need to know the encoding from the
+    # args
+
     if multi_input:
         parser.add_argument(
             '--input', '-i', metavar='FILE',
-            action='append', type=argparse.FileType('r'),
+            action='append',
             help='the files to process',
             required=True,
         )
     else:
         parser.add_argument(
             '--input', '-i', metavar='FILE',
-            default=sys.stdin, type=argparse.FileType('r'),
+            default=sys.stdin,
             help='the file to process (default: stdin)',
         )

     parser.add_argument(
         '--output', '-o', metavar='FILE',
         default=sys.stdout,
-        type=argparse.FileType('w'),
         help='the file to write to (default: stdout)',
     )
     parser.add_argument(
@@ -40,9 +47,26 @@ def basic_parser(multi_input=False):
         const=logging.DEBUG, default=logging.WARNING,
         help='enable debug logging',
     )
+    parser.add_argument(
+        '--encoding', '-e',
+        help='the encoding to read/write files in (default: '
+             'from your environment)',
+    )
     return parser


+def set_basic_args(args):
+    if isinstance(args.input, collections.MutableSequence):
+        for i, input_filename in enumerate(args.input):
+            with open(input_filename, encoding=args.encoding) as input_f:
+                args.input[i] = srt.parse(input_f.read())
+    elif args.input is sys.stdin:
+        log.warning('-e/--encoding has no effect on %s', stream)
+    else:
+        with open(args.input, encoding=args.encoding) as input_f:
+            args.input = srt.parse(input_f.read())
+
+
 def sliding_window(seq, width=2):
     seq_iter = iter(seq)
     sliced = tuple(itertools.islice(seq_iter, width))
cdown commented

Added in dea16c2. Confirmed it works with your test input:

srt encoding % PYTHONPATH=. srt_tools/srt-fixed-timeshift --seconds 5 -i ~/Downloads/srttest.txt -o /tmp/srttest-out.srt -e utf-8-sig
srt encoding % diff -u ~/Downloads/srttest.txt /tmp/srttest-out.srt | head -20                                                       
--- /home/cdown/Downloads/srttest.txt	2016-12-25 09:44:45.677276478 +0000
+++ /tmp/srttest-out.srt	2016-12-25 09:44:55.577107368 +0000
@@ -1,169 +1,169 @@
 1
-00:00:04,500 --> 00:00:07,250
+00:00:09,500 --> 00:00:12,250
 Hello, my name is Yorgui Loeffler, and I'm in Quebec, dang it!
 
 2
-00:00:07,670 --> 00:00:10,930
+00:00:12,670 --> 00:00:15,930
 In this lesson, I will be showing you a few things that are important to me.
 
 3
-00:00:13,720 --> 00:00:15,870
+00:00:18,720 --> 00:00:20,870
 Always scratch with a pick!
 
 4
-00:00:45,200 --> 00:00:49,290

Merry Christmas! :-)