pbs/pycaption

Invalid characters in WebVTT Text

Closed this issue · 1 comments

According to the WebVTT specification, "A WebVTT cue text span consists of one or more characters other than U+000A LINE FEED (LF) characters, U+000D CARRIAGE RETURN (CR) characters, U+0026 AMPERSAND characters (&), and U+003C LESS-THAN SIGN characters (<)."

Am I incorrect in understanding that unless the span in question is one of a very limited set of elements (class, italics, bold, underline, ruby, voice, language, or timestamp¹), characters such as < and > should be escaped to &lt; and &gt;, respectively? The WebVTT sample used in testing currently does not have these escaped, and the JavaScript WebVTT parser throws an error when the < character is used like this.

Fixed in 6586819. There is still a sample with the illegal < character, but we chose to keep it there to ensure that, although the WebVTTWriter never outputs it, the WebVTTReader can still read it without failing. We allow it on read because many players also do (e.g. Chrome, Firefox and Safari) even though according to the specification it is indeed not allowed.