alexandercerutti/sub37

Improve adapters returned interface

alexandercerutti opened this issue · 2 comments

As per the first version of sub37, the whole architecture expects Adapters always to return a ParseResult, containing data and perhaps some ParseError.

The idea was to have ParseResult and ParseError exposed by the BaseAdapter so that every adapter could use them without importing anything else and without making @sub37/server export other material.

However, the current situation reveals itself to be somehow inconsistent:

  1. BaseAdapter/index.ts exports BaseAdapter (default), ParseResult, and ParseError classes. We should technically move Parse* to a different file;

  2. BaseAdapter exposes ParseResult as an instance property, but not ParseError;

  3. Related to the points above, tests and WebVTT adapter use ParseResult from there, while BaseAdapter should have been the entry point to access them.
    This happens because BaseAdapter.ParseResult is actually a method that returns a new ParseResult and, therefore it cannot be used as right-hand in an instanceof comparison: we must use instanceof ParseResult, but we should use instanceof BaseAdapter.ParseResult;

  4. BaseAdapter exposes both ParseResult and ParseError types by using a namespace;

  5. ParseError class is exposed from BaseAdapter/index.ts module, but it is not used as a class (this happens due to the point 3, so it is an useless class.

Fixing the whole situation is quite breaking, so should be fixed in the next major.

Giving a look on this over a year later, and while working on the TTML Adapter, made me think that adapters could virtually take a lot of time to parse cues and return all the Cues.

I'm working on a Macbook Pro M1 machine, which is quite fast using Chrome (a TTML track of The Office 1x1 from Netflix, which is quite verbose, gets compiled within 130-180ms).

I acknowledge this "benchmark" is not trustable at all and, in fact, testing on some Intel machines highlighted that sync parsing could take over 500ms, which becomes quite critical as the browsers main thread is blocked for such time.

A Safari on iOS seems to take less than 50ms, which is incredible to me.

As the goal of this package is to run also on televisions (CTVs), I expect way higher parsing times on such low-end devices (yeah, they have very limited hardware).

However, as per today, sub37 doesn't own a system to change this. An early idea was to allow async tracks. Still, postponing the parsing either on microtasks or "macrotasks" (timer), doesn't potentially prevent blocking the main thread.

In order to improve such aspect, I am of the idea that a "parsing job" should be chunked and streamed.

This means introducing a concept of "resumability" of the work (weather it is stateless or stateful on the adapters side).

This tecnically could be achieved by using yielding and generators, but as CTVs browsers integration (per my experience) might still run on an incomplete ES6 integrations or full ES5, yield is not an option.

Even generators polyfilling could be a not-so-good choice, as this might reduce performace (tests to be done), according to some comments in Shaka-Player (I read it once, I don't remember where it is).

So, creating a streaming system would mean that the returned structure should allow @sub37/server to subscribe, like an observable, to such structure in order to receive new cues without considering time as an important factor.

In order to create job chunks, we should measure how many milliseconds it takes a cue to get fully parsed, to establish then how many cues can we ship in a chunk and, hence, how many cues can we parse within a pre-established amount of time.

This "pre-established" could be perhaps the 16ms of an animation or whatever time we want. We could even say "100ms" (completely random value, but within 100ms, user see things as immediate).

This chunking and streaming system could also partially open us to the streaming of track data over the internet for live streaming videos, for which the text is live generated, for which today there are some limitations in terms of requirements (adding new chunks require them to still have a header, which is a non-sense thing to me - and I created it).

Another limitation of the current interface, could be the impossibility to emit document-related details.

Each track (in particular TTML's but also WebVTT could have one) has what is called "document", which is a relative representation of the root of everything contained in it.

A thing that never happened until now, while developing TTML adapter, is that we could have some "global" attributes belonging to the document and that could not be strictly be assigned to single Cues (which is the case of "global styles" for WebVTT).

An example case is the tts:extent, which defines the width and the height of the rendering area that, in our case, should go to change how the renderer behaves.

Future subtitles formats could have something like this, so it could be the case to let @sub37/server to handle such different format of data.