Extract text from caption
Closed this issue · 3 comments
ksindi commented
Is there a way to only extract the text from caption?
Example:
caps = u'''1
00:00:01,500 --> 00:00:12,345
Small caption'''
reader = SRTReader()
reader().some_method(caps) # Small caption
vladiibine commented
There's a way, but it's not that simple.
if you don't know the format of the captions, you can do this (works even if you do know the format)
from pycaption import detect_format
my_text = <your unicode string containing the caption>
reader = detect_format(my_text)
caption_set = reader().read(my_text)
raw_content = [[node.content for node in caption.nodes if node.type_ != node.STYLE] for caption in caption_set.get_captions('en-US')]
# At this point we've got a list of "rows"
# each "row" is a list of text and references to None. These references mean we're supposed to have a new line here
# At this point you can do this
text_rows = [u''.join(row) for row in raw_content] # if you don't care about the line breaks
# Or this
text_rows = [u''.join([content if isinstance(content, unicode) else u'\n' for content in row]) for row in raw_content] # if you DO care about the line breaks
Hope you like list comprehensions :P
But anyway, if you can't figure this out, just post here the part that's giving you troubles.
ksindi commented
Thanks. The above worked. I only had to make the following modification:
caption_set = reader().read(<your unicode string containing the caption>)
vladiibine commented
Yeah... that. I corrected it eventually, but good thing you cought it.