w3c/media-and-entertainment

Frame accurate seeking of HTML5 MediaElement

Opened this issue ยท 90 comments

I've heard a couple of companies point out that one of the problems that makes it hard (at least harder than it could be) to do post-production of videos in Web browsers is that there is no easy way to process media elements on a frame by frame basis, whereas that is the usual default in Non-Linear Editors (NLE).

The currentTime property takes a time, not a frame number or an SMPTE timecode. Conversion from/to times to/from frame numbers is doable but supposes one knows the framerate of the video, which is not exposed to Web applications (a generic NLE would thus not know about it). Plus that framerate may actually vary over time.

Also, internal rounding of time values may mean that one seeks to the end of the previous frame instead of the beginning of a specific video frame.

Digging around, I've found a number of discussions and issues around the topic, most notably:

  1. An long thread from 2011 on Frame accuracy / SMPTE, which led to improvements in the precision of seeks in browser implementations:
    https://lists.w3.org/Archives/Public/public-whatwg-archive/2011Jan/0120.html
  2. A list of use cases from 2012 for seeking to specific frames. Not sure if these use cases remain relevant today:
    https://www.w3.org/Bugs/Public/show_bug.cgi?id=22678
  3. A question from 2013 on whether there was interest to expose "versions of currentTime, fastSeek(), duration, and the TimeRanges accessors, in frames, for video data":
    https://www.w3.org/Bugs/Public/show_bug.cgi?id=8278#c3
  4. A proposal from 2016 to add a rational time value for seek() to solve rounding issues (still open as of June 2018):
    whatwg/html#609

There have probably been other discussions around the topic.

I'm raising this issue to collect practical use cases and requirements for the feature, and gauge interest from media companies to see a solution emerge. It would be good to precisely identify what does not work today, what minimal updates to media elements could solve the issue, and what these updates would imply from an implementation perspective.

There have probably been other discussions around the topic.

Yes. Similar discussions happened during the MSE project: https://www.w3.org/Bugs/Public/show_bug.cgi?id=19676

There's some interesting research here, with a survey of current browser behaviour.

The current lack of frame accuracy effectively closes off entire fields of possibilities from the web, such as non-linear video editing, but it also has unfortunate effects on things as simple as subtitle rendering.

I should also mention that there is some uncertainty about the precise meaning of currentTime - particularly when you have a media pipeline where the frame/sample coming out of the end may be 0.5s further along the media timeline than the ones entering the media pipeline. Some people think currentTime reflects what is coming out of the display/speakers/headphones. Some people think it should reflect the time were video and graphics are composited as this is easy to test and suits apps trying to sync graphics to video or audio. Simple implementations may re-use a time available in a media decoder.

Daiz commented

what minimal updates to media elements could solve the issue

Related to the matter of frame accuracy on the whole, one idea would be to add a new property to VideoElement called .currentFrameTime which would hold the presentation time value of the currently displayed frame. As mentioned in the research repository of mine (also linked above), .currentTime is not actually sufficient right now in any browser for determining the currently displayed frame even if you know the exact framerate of the video. .currentFrameTime could at least solve this particular issue, and could also be used for monitoring the exact screen refreshes when displayed frames change.

Related to the matter of frame accuracy on the whole, one idea would be to add a new property to VideoElement called .currentFrameTime which would hold the presentation time value of the currently displayed frame.

The currently displayed frame can be hard to determine, e.g. if the UA is running on a device without a display with video being output over HDMI or (perhaps) a remote playback scenario ( https://w3c.github.io/remote-playback/ ).

Remote playback cases are always going to be best effort to keep the video element in sync with the remote playback state. For video editing use cases, remote playback is not as relevant (except maybe to render the final output).

There are a number of implementation constraints that are going to make it challenging to provide a completely accurate instantaneous frame number or presentation timestamp in a modern browser during video playback.

  • The JS event loop will run in a different thread than the one painting pixels on the screen. There will be buffering and jitter in the intermediate thread hops.
  • The event loop often runs at a different frequency than the underlying video, so frames will span multiple loops.
  • Video is often decoded, painted, and composited asynchronously in hardware or software outside of the browser. There may not be frame-accurate feedback on the exact paint time of a frame.

Some estimates could be made based on knowing the latency of the downstream pipeline. It might be more useful to surface the last presentation timestamp submitted to the renderer and the estimated latency until frame paint.

It may also be more feasible to surface the final presentation timestamp/time code when a seek is completed. That seems more useful from a video editing use case.

Understanding the use cases here and what exactly you need know would help guide concrete feedback from browsers.

Daiz commented

One of the main use cases for me would be the ability to synchronize content changes outside video to frame changes in the video. As a simple example, the test case in the frame-accurate-ish repo shows this with the background color change. In my case the main thing would be the ability to accurate synchronize custom subtitle rendering with frame changes. Being even one or two screen refreshes off becomes a notable issue when you want to ensure subtitles appearing/disappearing with scene changes - even a frame or two of subtitles hanging on the screen after a scene change happens is very much notable and ugly to look at during playback.

It depends on the inputs to the custom subtitle rendering algorithm. How do you determine when to render a text cue?

Daiz commented

Currently, I'm using video.currentTime and doing calculations based on the frame rate to try to have cues appear/disappear when the displayed frame changes (which is the behavior I want to achieve). As mentioned before, this is not sufficient for frame-accurate rendering even if you know the exact frame rate of the video. There are ways to improve the accuracy with some non-standard properties (like video.mozPaintedFrames in Firefox), but even then the results aren't perfect.

It depends on the inputs to the custom subtitle rendering algorithm. How do you determine when to render a text cue?

Perhaps @palemieux could comment on how the imsc.js library handles this?

One of the main use cases for me would be the ability to synchronize content changes outside video to frame changes in the video. As a simple example, the test case in the frame-accurate-ish repo shows this with the background color change. In my case the main thing would be the ability to accurate synchronize custom subtitle rendering with frame changes. Being even one or two screen refreshes off becomes a notable issue when you want to ensure subtitles appearing/disappearing with scene changes - even a frame or two of subtitles hanging on the screen after a scene change happens is very much notable and ugly to look at during playback.

This highlights the importance of being clear what currentTime means as hardware-based implementations or devices outputting via HDMI may have several frames difference between the media time of the frame being output from the display and the frame being composited with graphics.

With the timingsrc [1] library we are able to sync content changes outside the video with errors <10ms (less than a frame).

The library achieves this by

  1. using an interpolated clock approximating currentTime (timingobject)
  2. synchronizing video (mediasync) relative to a timing object (errors about 7ms)
  3. synchronizing javascript cues (sequencer - based on setTimeout) relative to the same timing object (errors about 1ms)

This still leaves delays from DOM changes to on-screen rendering.

In any case, this should typically be sub-framerate sync.

This assumes that currentTime is a good representation of the reality of video presentation. If it isn't, but you know how wrong it is, you can easily compensate.

Not sure if this is relevant to the original issue, which I understood to be about accurate frame stepping - not sync during playback?

Ingar Arntzen

[1] https://webtiming.github.io/timingsrc/

how the imsc.js library handles this

@jpiesing I can't speak for @palemieux obviously but my understanding is that imsc.js does not play back video and therefore does not do any alignment; it merely identifies the times at which the presentation should change.

However it is integrated into the dash.js player which does need to synchronise the subtitle presentation with the media. I believe it uses Text Track Cues, and from what I've seen they can be up to 250ms late depending on when the Time Marches On algorithm happens to be run, which can be as infrequent as every 250ms, and in my experience often is.

As @Daiz points out, that's not nearly accurate enough.

What @nigelmegitt said :)

What is needed is a means of displaying/hiding HTML (or TTML) snippets at precise offsets on the media timeline.

What is needed is a means of displaying/hiding HTML (or TTML) snippets at precise offsets on the media timeline.

@palemieux this is exactly what I described above.

The sequencer of the timingsrc library does this. It may be used with any data, including HTML or TTML.

Not sure if this is relevant to the original issue, which I understood to be about accurate frame stepping - not sync during playback?

@ingararntzen It is a different use case, but a good one nonetheless. Presumably, frame accurate time reporting would help with synchronised media playback across multiple devices, particularly where different browser engines are involved, each with a different pipeline delay. But, you say you're already achieving sub-frame rate sync in your library, based on currentTime, so maybe not?

@ingararntzen forgive my lack of detailed knowledge, but the approach you describe does raise some questions at least in my mind:

  • does it change the event handling model so that it no longer uses Time Marches On?
  • What happens if the event handler for event n completes after event n+1 should begin execution?
  • Does the timing object synchronise against the video or does it cause the video to be synchronised with it? In other words, in the case of drift, what moves to get back into alignment?
  • How does the interpolating clock deal with non-linear movements along the media timeline in the video, such as pause, fast forward and rewind?

Just questions for my understanding, I'm not trying to be negative!

Daiz commented

On the matter of "sub-framerate sync", I would like to point out that for the purposes of high quality media playback, this is not enough. Things like subtitle scene bleeds (where a cue remains visible after a scene change occurs in the video) are noticeable and ugly even if they remain on-screen for just an extra 15-30 milliseconds (ie. less than a single 24FPS frame, which is ~42ms) after a scene change occurs. Again, you can clearly see this yourself with the background color change in this test case (which has various tricks applied to increase accuracy) - it is very clear when the sync is even slightly off. Desktop video playback software outside browsers do not have issues in this regard, and I would really like to be able to replicate that on the web as well.

@nigelmegitt These are excellent questions, thank you ๐Ÿ‘

does it change the event handling model so that it no longer uses Time Marches On?

yes. the sequencer is separate from the media element (which also means that you can use it for use cases where you don't have a media element). It takes direction from a timing object, which is basically just a thin wrapper around the system clock. The sequencer uses <setTimeout()> to schedule enter/exit events at the correct time.

What happens if the event handler for event n completes after event n+1 should begin execution?

Being run in the js environment, sequencer timeouts may be subject to delay if there are many other activities going on (just like any appcode). The sequencer guarantees the correct ordering, and will report how much it was delayed. It something like the sequencer was implemented by browsers natively, this situation could be improved further I suppose. The sequencer itself is light-weight, and you may use multiple for different data sources and/or different timing objects.

Does the timing object synchronise against the video or does it cause the video to be synchronised with it? In other words, in the case of drift, what moves to get back into alignment?

Excellent question! The model does not mandate one or the other. You may 1) continuously update the timing object from the currentTime, or 2) you may continuously monitor and adjust currentTime to match the timing object (e.g. using variable playbackrate).

Method 1) is fine if you only have one media element, you are doing sync only within one webpage, and you are ok with letting the media element be the master of whatever else you want to synchronize. In other scenarios you'll need method 2), for at least (N-1) synchronized things. We use method 1) only occasionally.

The timingsrc has a mediasync function for method 2) and a reversesync function for method 1) (...I think)

How does the interpolating clock deal with non-linear movements along the media timeline in the video, such as pause, fast forward and rewind?

The short answer: using mediasync or reversesync you don't have to think about that, it's all taken care of.

Some more details:
The mediasync library creates a interpolated clock internally as an approximation on currentTime. It can distinguish the natural increments and jitter of currentTime from hard changes by listening to events (i.e. seekTo, variableplaybackrate etc.)

@chrisn

Presumably, frame accurate time reporting would help with synchronised media playback across multiple devices, particularly where different browser engines are involved, each with a different pipeline delay. But, you say you're already achieving sub-frame rate sync in your library, based on currentTime, so maybe not?

So, while the results are pretty good, there is no way to ensure that they are always that good (or that they will stay this good), unless these issues are put on the agenda through standardization work.

There are a number of ways to improve/simplify sync.

  • as you say, exposing accurate information on downstream delays, frame count, media offset is always a good thing.
  • currentTime values are also not timestamped, which means that you dont really know when it was sampled internally.
  • The jitter of currentTime is terrible.
  • Good sync depends on an interpolated clock. I guess this would also make it easier to convert back and forth between media offset and frame numbers.
  • there are also improvements seekTo and playbackrate which would improve things considerably

you don't have to think about that

@ingararntzen in this forum we certainly do want to think about the details of how the thing works so we can assure ourselves that eventual users genuinely do not have to think about them. Having been "bitten" by the impact of timeupdate and Time Marches On we need to get it right next time!

Having noted that Time Marches On can conformantly not be run frequently enough to meet subtitle and caption use cases, it does have a lot of other things going for it, like smooth handling of events that take too long to process.

In the spirit of making the smallest change possible to resolve it, here's an alternative proposal:

  • Change the minimum frequency to 50 times per second, instead of 4 times per second.

I would expect that to be enough to get frame accuracy at 25fps.

@nigelmegitt - sure thing - I was more thinking of the end user here - not you guys :)

If you want me to go more into details that's ok too :)

Assuming that framerates are uniform is going to go astray at some point, as mp4 can contain media with different rates.
The underlying structure has Movie time and Media time - the former is usually an arbitrary fraction, the latter a ratio specifically designed to represent the timescale of the actual samples, so for US-originated video this will be 1001/30000.

Walking through the media rates and getting fame times is going to give you glitches with longer files

If you want to construct an API like this I'd suggest mirroring what QuickTime did - this had 2 parts: the movie export API, which would give you callbacks for each frame rendered in sequence, telling you the media and movie times.
Or the GetNextInterestingTime() API which you could call iteratively and it would do the work of walking the movie, track edits and media to get you the next frame or keyframe.

Mozilla did make seekToNextFrame, but that was deprecated:
https://developer.mozilla.org/en-US/docs/Web/API/HTMLMediaElement/seekToNextFrame

@Diaz For your purposes, is it more important to have a frame counter, or an accurate currentTime?
What do you believe currentTime should represent?

Daiz commented

@mfoltzgoogle That depends - what exactly do you mean by a frame counter? As in, a value that would tell me the absolute frame number of the currently displayed frame, like if I have a 40000 frame long video with a constant frame rate of 23.976 FPS, and when currentTime is about 00:12:34.567 (754.567s), this hypothetical frame counter would have a value of 18091? This would most certainly work be useful for me.

To reiterate, for me the most important use case for frame accuracy right now would be to accurately snap subtitle cue changes to frame changes. A frame counter like described above would definitely work for this. Though since I personally work on premium VOD content where I'm in full control of the content pipeline, accurate currentTime (assuming that it means that with a constant frame rate / full frame rate information I would be able to reliably calculate the currently displayed frame number) would also work. But I think the kind of frame counter described above would be a better fit as more general purpose functionality.

We would need to consider skipped frames, buffering states, splicing MSE buffers, and variable FPS video to nail down the algorithm to advance the "frame counter", but let's go with that as a straw-man. Say, adding a .frameCounter read-only property to <video>.

When you observe the .frameCounter for a <video> element, say in requestAnimationFrame, which frame would that correspond to?

@mfoltzgoogle Instead of a "frame counter", which is video-centric, I would consider adding a combination of timelineOffset and timelineRate, with timelineOffset being an integer and timelineRate a rational, i.e. two integers. The absolute offset (in seconds) is then given by timelineOffset divided by timelineRate. If timelineRate is set to the frame rate, then timelineOffset is equal to an offset in # of frames. This can be adapted to other kinds of essence that do not have "frames".

Daiz commented

When you observe the .frameCounter for a

For frame accuracy purposes, it should obviously correspond to the currently displayed frame on the screen.

Also, something that I wanted to say is I understand that there's a lot of additional complexity to this subject under various playback scenarios and that it's probably not possible to guarantee frame accuracy under all scenarios. However, I don't think should stop us from pursuing frame accuracy where it would indeed be possible. Like if I have just a normal browser window in full control of video playback playing video on a normal screen attached to my computer, even having frame accuracy just there alone would be a huge win in my books.

The underlying structure has Movie time and Media time - the former is usually an arbitrary fraction, the latter a ratio specifically designed to represent the timescale of the actual samples, so for US-originated video this will be 1001/30000.

@kevinmarks-b "media time" is also used elsewhere as a generic term for "the timeline related to the media", independently of the syntax used, i.e. it can be expressed as an arbitrary fraction or a number of frames etc, for example in TTML.

the most important use case for frame accuracy right now would be to accurately snap subtitle cue changes to frame changes. A frame counter like described above would definitely work for this.

@Daiz I agree the use case is important and would like to achieve the same result, but I disagree that a frame counter would work. In fact, a frame counter would absolutely not work!

The reason is that we typically distribute a single subtitle file but have multiple profiles of video encoding, where one approach to managing the bitrate adaptively is to vary the frame rate. I think our lowest frame rate profile is about 6.25 fps. In that situation, quantizing subtitle times to the frame rate is a very bad idea. For more on this, see EBU-TT-D annex E.

That's why we use media time expressions with decimal fractions of seconds, and arrange that those time expressions work against the media at some canonical frame rate, such as 25fps in the knowledge that it will also work at other frame rates.

Daiz commented

@nigelmegitt Do note that I was primarily talking about my use case - I do the same thing with multiple profiles and single subtitle file(s), but I keep the frame rate consistent across all the variants.

Still, even for your use case with varying frame rates I'd expect the frame counter to be useful since even if you couldn't use the frame numbers themselves, you could still observe the value for the exact moments when frames change and act on that. Though if you have information about loaded chunks (and which ones are lower framerate), then it shouldn't be too hard to make use of the frame number itself either (this really applies in general with variable frame rate - as long as you have full information about the variations exposed to JS (which could even be eg. pre-formatted data sent by the server for the application, not necessarily something provided by VideoElement/MediaElement itself), it should theoretically be possible to always be up to date on where you are in terms of video both frame- and time-wise).

This is the difficulty when the subtitles are outside the composition engine, and this is where losing QT's multi-dataref abstraction for media hurts. Text tracks did make it into mp4, but I don't think the ability to edit them dynamically did.
@nigelmegitt for your decimation use case, having the subtitles on the timescale of the highest framerate video makes sense.

The baseline assumption that captions and subtitles should obscure the video is also odd to me - it's a hangover from analogue displays and title-safe ares with overscan. With the abundance of pixels we have now, rendering the captions or subtitles in a separate screenspace that doesn't obscure the action seems hugely preferable, and would mitigate the composition issues.

Daiz commented

The baseline assumption that captions and subtitles should obscure the video is also odd to me

With the abundance of pixels we have now, rendering the captions or subtitles in a separate screenspace that doesn't obscure the action seems hugely preferable

Rendering the subtitles outside the video area is generally a pretty terrible idea from both readability and ergonomic standpoints. When the subtitles are on screen, in decent font, decent font size, with decent margins and decent styling, then you can read them by basically just glancing while keeping your primary focus on the video itself the whole time, which is important because you are watching constantly progressing and moving video at the same time. Not to mention that video content is often watched from a much larger distance (eg. TV screens). If the subtitles are outside the video frame, suddenly you have to constantly move your eyes in order to read them, which would make for a terrible viewing experience all around.

To borrow a demonstration from an old Twitter thread of mine, here's an example of bad subtitle styling:

styling_bad

Some of the issues:

  1. Way too little vertical margin - more likely to require active eye movement in order to read the subtitles at all
  2. Small font size - making the text too small will require more focusing in order to take in the text
  3. Non-optimal styling in general - the border is thin and there's no shadow to "elevate" the subs from video, can result in subs blending to BG thus making them harder to read
  4. No line breaking - it's faster to read two short lines in a Z-like motion than moving your eyes over a wide horizontal line

Here's the same subs again with the aforementioned issues fixed:

styling_good

From a pixel counting perspective, these subs indeed obscure more of the video than the former example (or if you placed the subs outside the video frame entirely), but from a user experience standpoint, it actually enables the viewer to focus much better on the content since they don't have to actively divert their attention and focus away from the video to read the subtitles.

Apologies for the long and mostly off-topic post, but I think it's important to point out that "why don't we just render the subs off screen" is not a good strategy to pursue and recommend.

In comment to @nigelmegitt [1] increase the timeupdate to 50hz, I'd just like to point out two things:

  1. As transferring the playback state (e.g. currentTime) from player to JS takes more than 0 time, the value received in JS will always be outdated. We see this very clearly as massive jitter in the currentTime reporting. Going to 50hz will make this error smaller, but it will still be there.

  2. Triggering events with event handlers at 50hz will make increase resource usage by a lot, even if there is no real reason for it. For example, updating a progress bar at 50hz is just silly, similarly, a subtitle that is shown for 5.4 seconds will have executed the code for checking if it should be removed 269 times for no reason.

In my opinion, using a sequencer as @ingararntzen suggests is a much better idea - as long as the play state continues normally, timeouts will ensure minimal resource usage and high precision. This means that we need a timestamp added to the event (the time when it was generated). This holds for all media events.

[1] #4 (comment)

@Snarkdoof - as you indicate, if timeUpdate events were timestamped by the media player (using a common clock - say performance.now), then it would become much easier to maintain a precise, interpolated media clock in JS. Importantly, the precision of such an interpolated clock would not be highly sensitive to the frequency of the timeUpdate event (so we could avoid turning that up unnecessarily).

In addition, a precise interpolated media clock is a good basis for using timeouts to fire enter/exit events for subtitles etc. If one would prefer enter/exit events to be generated natively within the media element, using timeouts instead of poll-based time-marches-on should be an attractive solution there as well.

So, adding timestamps to media events seems to me like a simple yet important step forward.
Similarly, there should be timestamps associated with control events such as play, pause, seekTo, ...

you could still observe the value for the exact moments when frames change and act on that

@Daiz It is more important to align with audio primarily, and video frames secondarily. Otherwise you end up with the quantisation problem where you only update subtitles on frame boundaries, and the system breaks down at low frame rates. Whatever design we end up with can't require alignment with encoded video frames in all cases for that reason. Note that playback systems sometimes generate false frames after decoding, which generates a new set of frame boundaries!

The baseline assumption that captions and subtitles should obscure the video is also odd to me - it's a hangover from analogue displays and title-safe ares with overscan. With the abundance of pixels we have now, rendering the captions or subtitles in a separate screenspace that doesn't obscure the action seems hugely preferable, and would mitigate the composition issues.

@kevinmarks-b Testing with the audience has shown that in limited cases they do prefer subtitles outside the video, particularly when the video is embedded in a bigger page and is relatively small. In general though the audience does prefer subtitles super-imposed over the video. I speculate that smaller eye movements are easier and result in less of the video content being missed during the eye movement.

@Snarkdoof , in response to #4 (comment), you make good points, thank you!

@nigelmegitt that is interesting - have you tested that for letterboxed content? Putting subtitles in the black rather than over the picture?

Daiz commented

@kevinmarks-b I can't necessarily speak for in general but I personally prefer for the subtitles to remain in the actual video area even with 2.35:1 content, though I tend to pair that with a smaller vertical margin than what I'd use with 16:9 content:

subvideoframe

@kevinmarks-b I'm not aware of tests with letterboxed content, though I know it is a practice that is used for example in DVD players that play 16:9 video in a 4:3 aspect ratio video output. I've not seen many implementations that work well though - there is often some overlap between the subtitles and the active image, which is somewhat jarring.

More importantly, letterboxing tends to be an exceptional case rather than the norm, so it is not a general solution.

Thinking more about #4 (comment) :

transferring the playback state (e.g. currentTime) from player to JS takes more than 0 time, the value received in JS will always be outdated.

Can this not be modelled or measured adaptively though to minimise the impact?

Triggering events with event handlers at 50hz will make increase resource usage by a lot, even if there is no real reason for it.

@Snarkdoof have you got data to back this up? Running time marches on when the list of events to add to the queue is empty is not going to take many CPU clock cycles. Relative to the resource needed to play back video, I suspect it is tiny. But the advantage of doing it is huge when there is an event to add to the queue. It'd be good to move from speculation (on both our parts!) to measurement.

Daiz commented

@nigelmegitt Aligning subtitle cue changes with video frames is very much important too for high quality media playback purposes. Here's a quick demonstration of subtitle scene bleeding that you can get if the cue changes are not properly aligned with frame changes - in this example the timing is a whole frame off (~42ms), but shorter bleeds are similarly notable and extremely ugly. It's true that frame alignment may not be always desirable (mostly if you're dealing with very low FPS video), but it should definitely be possible. As I mentioned earlier, desktop playback software does not have issues in this regard, and I'd really like for that to be the case for web video as well.

@nigelmegitt Compensating for the transfer time between player and JS execution can, as I probably worded badly, be compensated for by adding a timestamp for when the even was created (or the data "created" if you wish). A timestamp should in my opinion be added to ALL events, in particular media events - one can then see that 32ms ago the player was at position X, which means that you can now be quite certain that the player is at X+32ms right now. :)

I don't have any data to back up the resource claim, but logic dictates that any code looping at high frequency will necessarily keep the CPU awake. For many devices, video decoding is largely done in HW, and keeping CPUs running on lower power states is terribly important. My largest point really is that if we export the clock properly (e.g. performance now + currentTime), it's trivial to calculate the correct position at any time in JS. This makes it very easy to use timeouts to wake up directly from JS.

I've got a suspicion that you are not really looking for ways to use JS to cover your needs in regards to timed data but rather to have built in support in the players and use Data Cues etc. Just to be clear - the JS sequencer @ingararntzen has mentioned typically wakes up and provides callbacks with well under 1ms error on most devices - if that doesn't cover all subtitle needs, I don't know what kind of eye sight you guys have. ;-) We have larger issues with CSS updates (e.g. style.opacity=1) taking longer on slower devices (e.g. Raspberry Pis) than we do with synchronizing the actual function call.

One comment on @nigelmegitt's #4 (comment) and @Snarkdoof's #4 (comment). There seems to be a slight confusion between the frequency at which the "time marches on" algorithm runs and the frequency at which that algorithm triggers timeupdate events.

The "time marches on" algorithm only triggers events when needed, and timeupdate events once in a while. Applications willing to act on cues within a particular text track should not rely on timeupdate events but rather on cuechange events of the TextTrack object (or on enter/exit events of individual cues), which are fired as needed whenever the "time marches on" algorithm runs.

The HTML spec requires the user agent to run the "time marches on" algorithm when the current playback position of a media element changes, and notes that this means that "these steps are run as often as possible". The spec also mandates that the "current playback position" be increased monotonically when the media element is playing. I'm not sure how to read that in terms of minimum/maximum frequency. Probably as giving full leeway to implementations. Running the algorithm at 50Hz seems doable though (and wouldn't trigger 50 events per second unless there are cues that need to switch to a different state). Implementations may optimize the algorithm as long as it produces the same visible behavior. In other words, they could use timeouts if that's more efficient than looping through cues each time.

When you observe the .frameCounter for a element, say in requestAnimationFrame, which frame would that correspond to?

For frame accuracy purposes, it should obviously correspond to the currently displayed frame on the screen.

@Daiz requestAnimationFrame typically runs at 50-60Hz, so once every 16-20ms, before the next repaint. You mentioned elsewhere that 15-30ms delays were noticeable for subtitles. Wouldn't it be more interesting to get the "frame that will be rendered on next repaint" instead, to avoid reacting up to 16ms late to a change of frame?

I'm not saying that it's easy to get from an implementation perspective, given the comment raised by @mfoltzgoogle #4 (comment). In particular, I suspect browser repaints are not necessarily synchronized with video repaints, but the problem seems to exist in any case.

Thanks @tidoust that is helpful. Last time I looked at this, a month or so back, I assured myself that time marches on itself could be run only every 250ms conformantly, but the spec text you pointed to suggests that timing constraint only applies to timeupdate events. Now I wonder if I misread it originally.

Nevertheless, time marches on frequency is dependent on some unspecified observation of the current playback position of the media element changing, which looks like it should be more often than 4Hz (every frame? every audio sample?).

In practice, I don't think browsers actually run time marches on whenever the current playback position advances by e.g. 1 frame or 1 audio sample. The real world behaviour seems to match the timing requirements for firing timeupdate events, at the less frequent end.

@tidoust It's a good point that the "internal" loop of Time Marches On does not trigger JS events every time, but increasing the loop speed of any loop (or doing more work in each pass) will use more resources. As I see it there are two main ways of timing things that are relevant to media on the web:
1: Put the "sequencer" logic (what's being triggered when) inside the browser and trigger events and
2: Put the "sequencer" logic in JS and trigger events

If 1) is chosen, a lot more complexity is moved to a very tight loop that's frankly busy with more important stuff. It is also less flexible as arbitrary code cannot be run in this way (nor would we want it to!). 2) depends solely on exporting a timestamp with the currentTime (and preferably other media events too), which would allow a JS Timing Object to accurately export the internal clock of the media. As such, a highly flexible solution can be made using fairly simple tools, like the open timingsrc implementation. Why would we not want to choose a solution that is easier, more flexible and if anything, saves CPU cycles?

cuechange also has a lot of other annoying issues, like not trigging when as skip event occurs (e.g. jumping "mid" subtitle), making it necessary to cut and paste several lines of code to check the active cues to behave as expected.

  1. is chosen, a lot more complexity is moved to a very tight loop that's frankly busy with more important stuff.

@Snarkdoof is it really busy with more important stuff? Really?

2: Put the "sequencer" logic in JS and trigger events

Browsers only give a single thread for event handling and JS, right? So adding more code to run in that thread doesn't really help address contention issues.

cuechange also has a lot of other annoying issues, like not trigging when as skip event occurs

The spec is explicit that it is supposed to trigger in this circumstance. Is this a spec vs implementation-in-the-real-world issue?

I have the sense that we haven't got good data about how busy current CPUs are handling events during media playback in a browser, with subtitles alongside. The strongest requirements statement we can make is that we do want to achieve adequate synchronisation (whatever "adequate" is defined as) with minimal additional resource usage.

Daiz commented

@tidoust

Wouldn't it be more interesting to get the "frame that will be rendered on next repaint" instead, to avoid reacting up to 16ms late to a change of frame?

Good point. The key thing is indeed to be able to act essentially in sync with the frame changes, so that you can match external actions to frame changes in the video.

Wouldn't it be more interesting to get the "frame that will be rendered on next repaint" instead, to avoid reacting up to 16ms late to a change of frame?

Good point. The key thing is indeed to be able to act essentially in sync with the frame changes, so that you can match external actions to frame changes in the video.

I agree with that aim butt then you need to be very careful about definitions as there may be several frame-times worth of delay between where graphics and video are composited and what the user is actually seeing. I suspect both are needed!

The key thing is indeed to be able to act essentially in sync with the frame changes, so that you can match external actions to frame changes in the video.

@Daiz as I just pointed out on the M&E call (minutes), this is only known to be true at the 25-30fps sort of rate, designed to be adequately free of flicker for video. It's unknown at high frame rates, and entirely inadequate at low frame rates, where synchronisation with audio is more important.

We should avoid generalising based on assumptions that the 25-30fps rate will continue to be prevalent, and gather data where we don't yet have it. We also need a model that works for other kinds of data than subtitles and captions, since they may have more or less stringent synchronisation requirements.

@Snarkdoof Like @nigelmegitt, I don't necessarily follow you on the performance penalties. Regardless, what I'm getting out of this discussion on subtitles is that there are possible different ways to improve the situation (they are not necessarily exclusive).

One possible way would be to have the user agent expose a frame number, or a rational number. This seems simple in theory, but apparently hard to implement. Good thing is that it would probably make it easy to act on frame boundaries, but these boundaries might be slightly artificial (because the user agent will interpolate these values in some cases).

Another way would be to make sure that an application can relate currentTime to the wall clock, possibly completed with some indication of the downstream latency. This is precisely what was done in the Web Audio API (see the definition of the AudioContext interface and notably the getOutputTimestamp() method and the outputLatency property). It seems easier to implement (it may be hard to compute the output latency, but adding a timestamp whenever currentTime changes seems easy). Now an app will still have some work to do to detect frame boundaries, but at least we don't ask the user agent to report possibly slightly incorrect values.

I note that this thread started with Non-Linear Editors. If someone can elaborate on scenarios there and why frame numbers are needed there, that would be great! Supposing they are, does the application need to know the exact frame being rendered during media playback or is is good enough if that number is only exact when the media is paused/seeked?

@nigelmegitt @tidoust - I guess I just never understood the whole time marches on algorithm to be honest, it seems like a very strange way to wait for a timeout to happen, in particular when the time to wait can be very reliably be calculated well in advance. The added benefit of doing this properly in JS is that the flexibility is excellent - there is no looping anywhere, there is an event after a setTimeout, re-calculated when some other event is triggered (skip, pause, play etc). We use it for all kinds of things - showing subtitles, switching between sources, altering css, preloading images at a fixed time, etc. Preloading is trivial if you give a sequencer a time shifted timing object. Say you need up to 9 seconds to prepare an image - time shift it to 10 seconds more than the playback clock and do nothing else!

I might of course be absolutely in the black on the Time marches on, text and data cues (I did test them, and found them horrible a couple of years ago). But the only thing I crave is the timestamp on the event - it will solve our every need (almost) and at barely any cost. :)

Daiz commented

@nigelmegitt As I also mentioned earlier, yes, I recognize that there are different things that are important too, but for the here and now (and I don't expect this to change anytime soon), I want the ability to act on exact frame boundaries so that I can get as close to replicating the desktop high quality media playback experience on the web as possible, and having subtitles align on frame boundaries for scene changes in order to avoid scene bleeding is a basic building block of that.

I'm not too concerned with the exact details of how we get there, so that's open for discussion and what we're here for, but the important thing is that we do get there eventually in a nice and performant fashion (ie. one shouldn't have to compile a video decoder with emscripten to do it etc).

In respons to @Snarkdoof's post about the two approaches to synchronizing cue events and @nigelmegitt's response

The strongest requirements statement we can make is that we do want to achieve adequate synchronisation (whatever "adequate" is defined as) with minimal additional resource usage.

I don't have have any input on the question on resource consumption, but a point concerning maximization of precision:

It is an important principle to put the synchronization as close as possible to what is being synchronized. Another way to put it is to say that the final step matters.

In approach 1, with sequencing logic internally in the media element, the last step is transport of the cue events across threads to JS.

In approach 2, with sequencing logic in JS, the final step is the firing of a timeout in JS. This seems precise down to 1 or 2 ms. Additionally the correctness of the timeout calculation depends on the correctness by which currentTime can be calculated in JS, which is also very precise (and could easily be improved).

I don't know the relevant details of approach 1). I'm worried that the latency of the thread switch might unknown or variable, and perhaps different across architectures. If so, this would detract from precision, but I don't know how much. Do anyone know?

Also, in my understanding a busy event loop in JS affects both approaches similarly.

I want the ability to act on exact frame boundaries so that I can get as close to replicating the desktop high quality media playback experience on the web

@Daiz OK, within the constraints of your use case, I share the requirement. Outside of those constraints, it gets more complicated. Seems from the thread as though that's something we can both agree to.

There's been some speculation here about thread switching and the impact that may have, and if indeed there are multiple threads executing the script and therefore processing the event queue. It's always been my understanding that the script is only executed in a single thread. Can anyone clarify this point, perhaps a browser implementer?

Throwing my hat in the ring here with a couple alternative use cases. As background my company manages a large amount of police body camera video. We support redaction of video via the browser, as well as playback of evidence via the browser.

For the redaction and evidence playback use cases our customers want the ability to step through a video frame-by-frame. If you assume a constant framerate and are able to determine that framerate out of band then you can get something that approximates frame-by-frame seek. However there are many scenarios (be it rounding of the currentTime value, or encoder delay that renders a frame a few ms late) that can result in a frame being skipped (which is a big worry for our customers). There are hacks around this (rendering frames on the server and shipping down frame by frame view) but all the info we need is already in the browser, it would be great if we had the ability to progress through a video frame by frame.

For redaction we have a use case that is similar to the subtitles sync issue. When users are in the editing phase of redaction we do a preview of what will be redacted where we need JS controlled objects to be synced with the video as tightly as we can. In this use case it's slightly easier than subtitles because when playing back at normal speed (or 2x or 4x) redaction users are usually ok with some slight de-sync. If they see something concerning they usually pause the video and then investigate it frame-by-frame.

Some of the suggested solutions, like currentFrameTime, could be extended to enable the frame-by-frame use case.

@boushley Thanks, that is useful! From a user experience perspective, how would the frame-by-frame stepping work in your case, ideally?

  1. The user activates frame-by-frame stepping. Video playback is paused. The user controls which frame to render and when a new frame needs to be rendered (e.g. with a button or arrow keys). Under the hoods, the page seeks to the right frame, and video playback is effectively paused during the whole time.
  2. The user activates frame-by-frame stepping. The video moves from one frame to the other in slow motion without user interaction. Under the hoods, the page does that by setting playbackRate to some low value such as 0.1, and the user agent is responsible for playing back the video at that speed.

In both cases, it seems indeed hard to do frame by frame stepping without exposing the current frame/presentation time, and allowing the app to set it to some value to account for cases where the video uses variable framerate.

It seems harder to guarantee precision in 2. as seen in this thread [1], but perhaps that's doable when video is played back at low speed?

[1] #4 (comment)

I note that this thread started with Non-Linear Editors. If someone can elaborate on scenarios there and why frame numbers are needed there, that would be great! Supposing they are, does the application need to know the exact frame being rendered during media playback or is is good enough if that number is only exact when the media is paused/seeked?

We also perform media manipulation server-side on the basis of users choosing points in the media timeline in a browser-based GUI. Knowing exactly what the user is seeing when the media is paused is critical.

Challenges that we've found with current in-browser capabilities include,

  • Allowing the user to reliably review the points in time previously selected
    • Expected behaviour - browser will seek to a previously selected time-point and the user will see the same content as when they made their selection
    • Actual behaviour - in some cases, in some browsers, the frame the users sees may be off-by-one
    • Another way of describing the above is just to observe that, with playback paused, sometimes executing video.currentTime = video.currentTime in the js console will change the displayed video frame!
  • Matching results that server-side processing with what the user requested in the browser
    • Expected behaviour - the point on the media timeline 'chosen' by the user is reflected by back-end processing
    • Actual behaviour - it seems challenging to relate a currentTime value from the browser to a point on the media timeline within server-side components
    • To make the above more concrete, if you wanted to run ffmpeg on the server-side and have it make a jpg of video frame that the user is currently looking at, how would you transform the value of currentTime (or any other proposed mechanism) into a select video filter. (Substitute ffmpeg with your preferred media framework as desired :)

We currently do frame-stepping by giving the js knowledge (out of band) of the frame-rate and seeking in frame-sized steps.

Users also want to be able to step backwards and forwards by many frames at a time (e.g. hold 'shift' to skip faster). That's currently implemented by just seeking in larger steps.

@tidoust our current experience is that the user has a skip ahead / skip back X seconds control. When they pause that changes to a frame forward / frame back control. So we're definitely looking at use case 1. And if you're going for playback at something like 1/10 of normal speed (or 3-6 fps) you can pretty easily pull that off in JS if you have a way of progressing to the next frame or previous frame. This use case feel like it should be easily doable, although I think it'll be interesting if we can do it in a way that enables other use cases as well.

@dholroyd we've definitely seen some of these off by a single frame issues in our redaction setup. Would be great if there was a better way of identifying and linking between a web client and a backend process manipulating the video. I believe one of the keys for the editing style use case is that while we want playback to be as accurate as possible, the key is that when paused it needs to be exactly accurate.

@Diaz I spoke with the TL of Chrome's video stack and they gave me a pointer to an implementation that you can play around with now.

First, behind --enable-experimental-canvas-feature, are some additional attributes on HTMLVideoElement that contain metadata about frames uploaded as WebGL textures, including timestamp. [1]

The longer term plan is a WebGL extension to expose this data [2], and implementation has begun [3] but I am not sure of its status.

I agree there are use cases outside of WebGL upload for accurate frame timing data, and it should be possible to provide it on HTMLVideoElement's that are not uploaded as textures. However, if the canvas/WebGL solution works for you, then that makes a stronger case to expose it elsewhere.

Note that any solution may be racy with vsync depending on the implementation and it may be off by 16ms depending on where vsync happens in relation to the video frame rendering and the execution of rAF.

That's really all the help I can provide at this time. There are many other use cases and scenarios discussed here that I don't have time to address or investigate them right now.

Thanks.

[1] https://bugs.chromium.org/p/chromium/issues/detail?id=639174
[2] https://www.khronos.org/registry/webgl/extensions/proposals/WEBGL_video_texture/
[3] https://bugs.chromium.org/p/chromium/issues/detail?id=776222

This is a great discussion, identifying a number of different use cases. I suggest that the next step is to consolidate this into an explainer document that describes each use case and identifies any spec gaps or current implementation limitations. A simple markdown document in this repo would be fine. Would anyone like to start such a document?

One detail for such a document (I'm not volunteering to write) is video frame reordering. Widely deployed video codecs such as AVC reorder and often offset the presentation time of pictures relative to their order and timing in the compressed bitstream. For instance, frames 1, 2, 3, 4 in the compressed stream might be displayed in order e.g. 2, 1, 4, 3 and presentation time can be delayed several frames. Frame rate changes are not unusual in adaptively streamed video. Operations such as seeking, editing, and splicing of the compressed stream, e.g. in an MSE buffer, do not happen at the presentation times often assumed. Audio, TTML, HTML, events, etc. must take presentation reordering and delay into account for frame accurate synchronization at some "composition point" in the media pipeline.

@KilroyHughes I've always made the assumption that all those events are related to the post-decode (and therefore post-reordering) output. It would make no sense to address out of order frame counts from the compressed bitstream in specifications whose event time markers relate to generic video streams and for which video codecs are out of scope.

Certainly in TTML, the assumption is that there is a well defined media timeline against which times in the media timebase can be related; taking DASH/MP4 as an example, the track fragment decode time as modified by the presentation time offset provides that definition.

I'd push back quite strongly against any requirement to traverse the architectural layers and impose content changes on a resource like a subtitle document, whether it is provided in-band or out-of-band, just to take into account a specific set of video encoding characteristics.

There's a Chromium bug about synchronisation accuracy of Text Track Cue onenter() and onexit() events in the context of WebVTT at https://bugs.chromium.org/p/chromium/issues/detail?id=576310 and another (originally from me, via @beaufortfrancois) asking for developer input on the feasibility of reducing the accuracy threshold in the spec from the current 250ms, at https://bugs.chromium.org/p/chromium/issues/detail?id=907459 .

1c7 commented

Because this thread is way too long. I didn't read them all.
Let me provide one more use case

Subtitle Editing software

I want to build a Subtitle Editing software using Electron.js
because Aegisub is not good enough. (hotkey, night mode, etc)

The point is:

I want build something that simple but able to improve one part of workflow.
not aim to replace Aegisub. because they have way to many feature.

So

Frame by frame and precise control to millisecond like 00:00:12:333 is important.

Here is my design (it's screenshot from Invision Studio, not an actually desktop)

image

I design many version because I want this to be beautiful

image

Here is Electron app (an actually working app)

image

as you can see the Electron app is still Work in Progress. half-built.

and now I found out there are no Frame by frame and precise control to millisecond like 00:00:12:333
image
which is very bad..

Conclusion

Use some hack like <canvas> OR just abandon Web tech(HTML/CSS/JS) Electron.js
just build Native app (OC & Swift on XCode)

A couple of updates:

  1. The Media & Entertainment IG discussed the issue at TPAC. Also see the Frame accurate synchronization slides I presented to guide the discussion.

  2. Also, for frame accuracy rendering scenarios (during playback), note the recent proposal to extend HTMLVideoElement with a requestAnimationFrame function to allow web authors to identify when and which frame has been presented for composition:

This issue was originally raised for the general HTML media element, and the discussion has mainly been about video elements. I have just come upon another use case, for audio elements. Setting currentTime on an audio element whose resource is a WAV file works well. However when the resource is an MP3 file the accuracy is very poor (I checked on Chrome and Firefox).

I'm pretty sure the cause is something that occurs in general with compressed media, either audio or video: depending on the file format, it can be complex to work out where in the compressed media to seek to in order to get to an arbitrary desired point. I guess some kind of heuristic is generally used.

When there are no timestamps within the compressed media, that's even harder, and of course such timestamps would reduce the efficiency of the compression. Effectively the only way to do it reliably is to play back the entire media, which might be very long, and generate a map that connects audio sample count to file location.

Clearly doing that would be a costly operation, in general. Nevertheless, perhaps there is some processing that can be done to try to improve the heuristics, without doing a full decode? An API call to pre-process the media to generate such a map could provide an opt-in for applications that need it, without imposing it on those applications that do not need it.

MDN doesn't really hint about the seek accuracy of audio codecs at https://developer.mozilla.org/en-US/docs/Web/Media/Formats/Audio_codecs and it looks like the HTMLMediaElement interface itself doesn't offer this kind of accurate seek preparation; there is perhaps an analogy with the preload attribute that defines how much data to load, but it is clearly a different thing.

An example of this audio seeking accuracy issue can be observed at https://bbc.github.io/Adhere/ (in Firefox or Chrome, definitely) by loading the Adhere demo video and comparing the experience loading and playing the demo TTML2 "Adhere demo with pre-recorded audio, single WAV file" with the single MP3 file version. The playback start times are very different in the two cases. There seems to be an end effect sometimes too, but that's something else.

I've seen this issue with MP3 before and it is always with the VBR ones, CBR worked fine (but most MP3 are VBR)

mpck adhere_demo_audio.mp3
SUMMARY: adhere_demo_audio.mp3
    version                       MPEG v2.0
    layer                         3
    average bitrate               59527 bps (VBR)
    samplerate                    22050 Hz
    frames                        2417
    time                          1:03.137
    unidentified                  0 b (0%)
    errors                        none
    result                        Ok

Thanks for the extra analysis @Laurian . I suspect you're right that MP3 is a particular offender, but we should not focus on one format specifically, but on the more general problem that for some media encodings it can be difficult to seek accurately, and look for a solution that might work more widely.

Typically I think implementers have gone down the route of finding some detailed specifications of media types that work for their particular application. In the web context it seems to me that we need something that would work widely. The two approaches I can think of so far that might work are:

  1. Categorise the available media types as "accurately seekable" and "not accurately seekable" and have something help scripts discover which one they have at runtime, depending on UA capabilities, so they can take some appropriate action.
  2. Add a new interface that requests UAs to pre-process media in advance in preparation for accurate seeking, even if that is a costly operation. This seems better to me than an API for "no really please do seek accurately to this time" because that would have an arbitrary performance penalty that would be hard to predict, so not great for editing applications if performance is desirable.

Nigel, I'm not seeing the difference in the demo you shared. With either MP3 or WAV selected, playback starts at time zero. I must be doing something wrong..?

@chrisn listen to the audio description clips as they play back - the words you hear should match the text that shows under the video area, but they don't, especially for the MP3 version.

@nigelmegitt good work! I can't find the BBC Adhere repo anymore.. was it moved or removed?

@giuliogatto unfortunately the repo itself is still not open - we're tidying some bits up before making it open source, so please bear with us. It's taking us a while to get around to alongside other priorities ๐Ÿ˜”

@nigelmegitt ok thanks! Keep up the good work!

1c7 commented

@Daiz I saw a new methods here: https://stackoverflow.com/questions/60645390/nodejs-ffmpeg-play-video-at-specific-time-and-stream-it-to-client

How

  1. Use ffmpeg to live stream local video
  2. Use Electron.js to display live stream video

Do you think it's possible to use this way to achive subtitle display (with near-perfect sync)?

I haven't experiment this myself,
so I am not sure if it work

I was thinking build this project: https://github.com/1c7/Subtitle-Timeline-Editor/blob/master/README-in-English.md

in Swift & OC & SwiftUI for mac-only desktop app
but seem live ffmpeg+electron.js live stream is somewhat possible too

1c7 commented

One more possible way do to it. (for Desktop)

If building a desktop app with electron.js

node-mpv can be used to control a local version mpv

so load subtitle and display subtitle is doable (.ass is fine)
and edit it and then reload subtitle is also possible.
frame to frame playback with left arrow key and right arrow key is also possible.

Node.js code

const mpvAPI = require('node-mpv');
const mpv = new mpvAPI({},
	[
		"--autofit=50%", // initial windows size
	]);

mpv.start()
	.then(() => {
		// video
		return mpv.load('/Users/remote_edit/Documents/1111.mp4')
	})
	.then(() => {
		// subtitle
		return mpv.addSubtitles('/Users/remote_edit/Documents/1111.ass')
	})
	.then(() => {
		return mpv
	})
	// this catches every arror from above
	.catch((error) => {
		console.log(error);
	});


// This will bind this function to the stopped event
mpv.on('stopped', () => {
	console.log("Your favorite song just finished, let's start it again!");
	// mpv.loadFile('/path/to/your/favorite/song.mp3');
});

package.json

{
  "name": "test-mpv-node",
  "version": "1.0.0",
  "description": "",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "author": "",
  "license": "ISC",
  "dependencies": {
    "node-mpv": "^2.0.0-beta.0"
  }
}

Conclusion

  • Electron.js + ffmpeg live stream seem possible
  • Electron.js + mpv(using node-mpv module) also possible

@nigelmegitt ok thanks! Keep up the good work!

Apologies, forgot to update this thread: the library part of the Adhere project was moved to https://github.com/bbc/adhere-lib/ so that we could open it up.

Just, to make it clear, if I do video.currentTime = frame * framerate, do I have a guarantee that the video will indeed seek to the appropriate frame? I understand that reading from currentTime is not realiable, but I would expect that writing to currentTime is. From my experience, doing video.currentTime = frame * framerate + 0.0001 seems to work quite reliably (not sure if the 0.0001 is needed), but I'd like to be sure I'm not missing subtle edge cases.

As a next step, I suggest that we summarise this thread into a short document that covers the use cases and current limitations. It should take into account what can be achieved using new APIs such as WebCodecs and requestVideoFrameCallback, and be based on practical experience.

This thread includes discussion of frame accurate seeking and frame accurate rendering of content, so I suggest that the document includes both, for completeness.

Is anyone interested in helping to do this? Specifically, we'd be looking for someone who could edit such a document.

This would be really cool to have guarantees on how to reach a special frame. For instance, I was thinking that:

this.video.currentTime = (frame / this.framerate) + 0.00001;

was always reaching the accurate frame... But it turns out it's not! (at least not using Chromium 95.0) Sometimes, I need a larger value for the additional term, like at least on one frame, I needed to do:

this.video.currentTime = (frame / this.framerate) + 0.001;

(this appear to fail for me when trying to reach for instance the frame 1949 of a 24fps video) similarly, reading out the current time.

Edit: similarly, reading this.video.currentTime (even when paused using requestVideoFrameCallback) seems to be not frame accurate.

It's way worse for me. I've made a 20 fps testing video, where seeking currentTime to 0.05 should display the 2nd frame, but I have to go all the way to 0.072 for it to finally flip.

This makes it impossible to implement frame accurate video cutting/editing tools, as the time ffmpeg needs to seek to a frame is always a lot different than what video element needs to display it, and trying to add or subtract these arbitrary numbers just feels like a different kind of footgun.

What is the state of the art of this? Can this currently be achieved only with WEBCodecs API?

mzur commented

Here is an approach that uses requestVideoFrameCallback() as a workaround to seek to the next/previous frame: https://github.com/angrycoding/requestVideoFrameCallback-prev-next

Is that one really working?
Cause on https://web.dev/articles/requestvideoframecallback-rvfc:

Note: Unfortunately, the video element does not guarantee frame-accurate seeking. This has been an ongoing subject of discussion. The WebCodecs API allows for frame accurate applications.

The technique by @mzur leads to better accuracy, but in our experience doesn't lead to perfect results always either.