pbs/pycaption

Extract text from caption

Closed this issue · 3 comments

Is there a way to only extract the text from caption?

Example:

caps = u'''1
00:00:01,500 --> 00:00:12,345
Small caption'''

reader = SRTReader()
reader().some_method(caps)  # Small caption

There's a way, but it's not that simple.

if you don't know the format of the captions, you can do this (works even if you do know the format)

from pycaption import detect_format
my_text = <your unicode string containing the caption>

reader = detect_format(my_text)
caption_set = reader().read(my_text)

raw_content = [[node.content for node in caption.nodes if node.type_ != node.STYLE] for caption in caption_set.get_captions('en-US')]
# At this point we've got a list of "rows"
# each "row" is a list of text and references to None. These references mean we're supposed to have a new line here

# At this point you can do this
text_rows = [u''.join(row) for row in raw_content]  # if you don't care about the line breaks

# Or this
text_rows = [u''.join([content if isinstance(content, unicode) else u'\n' for content in row]) for row in raw_content]  # if you DO care about the line breaks

Hope you like list comprehensions :P
But anyway, if you can't figure this out, just post here the part that's giving you troubles.

Thanks. The above worked. I only had to make the following modification:

caption_set = reader().read(<your unicode string containing the caption>)

Yeah... that. I corrected it eventually, but good thing you cought it.