Alternate serialization for media overlays
HadrienGardeur opened this issue · 27 comments
We've iceboxed the work on media overlays for some time but I'd like to re-start discussions on this by proposing a new serialization for them.
Instead of having a separate syntax for MO, I'd like to explore the ability to represent them using our existing model for RWPM, which means:
- each media overlay node is a full collection (with potentially
metadata
,links
and subcollections) - instead of specialized elements (
textref
andaudioref
), we would use the Link Object which opens the door to a lot of things (text + audio + video or two text references in different languages)
Here's an example in this proposed syntax where the text is paired with both audio and a video:
{
"metadata": {
"role": ["chapter"]
},
"links": [
{
"href": "chapter1.html",
"type": "text/html"
}
],
"children": [
{
"links": [
{
"href": "chapter1.html#par1",
"type": "text/html"
},
{
"href": "chapter1.mp3#t=0,20",
"type": "audio/mpeg"
},
{
"href": "chapter1.webm#t=0,26",
"type": "video/webm"
}
]
},
{
"links": [
{
"href": "chapter1.html#par2",
"type": "text/html"
},
{
"href": "chapter1.mp3#t=20,28",
"type": "audio/mpeg"
},
{
"href": "chapter1.webm#t=26,37",
"type": "video/webm"
}
]
}
]
}
These media overlays could either be referenced directly at a publication level:
"links": [
{
"rel": "alternate",
"href": "overlay.json",
"type": "application/media-overlay+json",
}
]
But they could also be referenced as alternate
resources in the readingOrder
:
{
"href": "chapter1.html",
"type": "text/html",
"alternate": [
{
"href": "overlay1.json",
"type": "application/media-overlay+json"
}
]
}
Any thoughts on this? cc @danielweck @llemeurfr
Interesting. So you are proposing 2 solutions to point to a Media Overlay "node" from the publication manifest:
- specify each Media Overlay Node = Json file as href in each item of the reading order, or
- choose a "primary" resource type (chapter1.html in your exemple) and specify it in each item of the reading order, plus add a Media Overlay Node = Json file as an "alternate" property.
In the first solution, why would we select a given resource type (text here) as primary rather than another (e.g. audio)? publisher's choice?
The issue with this format is that Media overlay nodes are split is N different and small json files.
It would be much more compact if the reading order was able to handle not only Link Objects, but also "composite" objects like ... a collection.
I don't have a strong opinion on publication level vs reading order mostly because I'm not enough of an expert on media overlays.
In the first solution, why would we select a given resource type (text here) as primary rather than another (e.g. audio)?
Is there really a primary resource though? I don't think so. If the text is displayed and I can hear it at the same time, it feels pretty equal to me.
It would be much more compact if the reading order was able to handle not only Link Objects, but also "composite" objects like ... a collection.
I really dislike that idea, sorry... It can already get pretty messy with the fact that a Link Object can contain arrays of Link Objects through alternate
or children
.
What you're proposing is much much worse IMO and completely disconnected from the concept of a Link Object (since a collection can represent pretty much anything).
In your example:
{
"metadata": {
"role": ["chapter"]
}
}
What would role
map to? A new JSON Schema type?
https://github.com/readium/webpub-manifest/tree/master/schema
I don't think there's anything in schema.org that would be a good fit for role
in this context, so it would be mapped to a URL of our own.
In the JSON Schema, this would be a string + an enum with all known values for roles.
In many commercial mainstream EPUB3 Media Overlays, the SMIL files are indeed tiny (typically: illustrated fixed layout children's "read aloud" books, with minimal amounts of synchronized text/audio).
However, reflowable publications converted from DAISY Digital Talking Books; or nowadays also natively authored as EPUB3; usually consist in large sentence-level SMILs, with many hours of audio playback.
There are rare edge cases, but the vast majority of MO content is authored using the 1-to-1 mapping of HTML documents, SMIL and audio files ("one spine item in the reading order => one SMIL file => one audio file"). Occasionally "several SMIL files => a single audio file", but that is just an implementation detail that does not affect the model we are discussing here. In principle there may be "several contiguous HTML files in the reading order => a single SMIL" but frankly I have personally not come across this authoring practice in the real world.
In Readium1 these SMIL files are parsed eagerly (ahead of rendering time) into their JSON equivalent (there are C++ and JS parsers, depending on the target platforms). These generated payloads are used to populate an in-memory Javascript data model that represents the state of the publication at runtime.
There is a concrete real benefit in having the entirely of the timing tree (i.e. aggregated SMIL trees) loaded in memory. This is used in the Readium1 implementation to support a linear timeline bar that the user can drag from 0-100% (actually, zero-time to total-duration indicated by the top-level publication metadata, or alternatively by the sum of all reading-order SMILs). The MO engine then maps this linear time representation to a structural position inside the SMIL timing tree, simply by scanning the loaded MO model. This also makes it easier to handle skippability and escapability during playback.
The obvious drawback is some upfront parsing cost, and increased memory consumption (this latter point is not such a big deal on modern devices though ... MO is not really meant for low-end e-ink devices). In my opinion, the benefits far outweigh the drawbacks.
That being said, what differentiates Readium2 is the clear architectural facet of backend/server side state (where the MO models are populated just as in Readium1), combined with the client side runtime which may load SMIL timing trees just-in-time (i.e. the additional HTTP request to MO links for individual HTML spine items / chapters).
In that Readium2 case, a MO playback engine implementation will probably want to load the timing tree for the entire publication anyway, for the reasons I mentioned before. Therefore, the top-level MO link in the publication manifest will have to deliver a data model similar to the one generated by Readium1 (i.e. a simple array-like aggregation of contiguous parsed SMIL trees, no need to be smart by somehow merging timing trees).
@danielweck so you're suggesting that streamers and publication servers should merge all the SMIL together and only serve a publication-level link to the media overlay JSON document?
The initial discussion is slightly different since the focus was on syntax and (potentially) authoring.
No, I am quite happy continuing to serve individual SMIL "chapters" (i.e. mirroring exactly the logical organisation of an EPUB3 Media Overlays), in addition to the full-spine aggregated SMILs. I am however suggesting that the JSON syntax of the full-spine multi-SMIL consists simply in an array-like combination of each individual referenced SMIL in the original EPUB3 Media Overlays (no attempt to make a smart merge of the SMIL timing trees). The potential downside of this approach, is that the edge-case mapping "multi HTML contiguous spine items => single SMIL" can produce redundant data, unless some smart trimming of the timing tree is performed beforehand at the parser level (but note that extracting SMIL timing containers from a SMIL tree is a bit like attempting to split CSS style definitions ... the context is important and easy to break, due to semantically/structurallly-meaningful nested sequence containers).
As discussed at the conference call:
-
base URL to resolve the link object
href
which are possibly relative paths (no need forself
link, just the originating URL/path for the JSON resource?) -
mandatory media / content type for link objects in the
links
tuple, so that a typical Media Overlay processing agent can discover the "text" and "audio" pairs (equivalent to the leaves in the SMIL timing tree). -
example with typical deep nested media pairs/tuples (
par
SMIL node), descendants ofseq
SMIL nodes which represent the structural/semantics of targeted HTML documents. See: https://github.com/readium/webpub-manifest/blob/master/schema/link.schema.json#L57 (rootchildren
relates to the extensibility mechanism for sub-collections: https://github.com/readium/webpub-manifest/blob/master/schema/publication.schema.json#L70 )
The SMIL body
is a seq
time container root, that can carry its own role and back-reference (textref) to the HTML document it maps to (structural/semantics). So the children
array is problematic in that respect.
Let's try a deeper (more typical) SMIL timing tree, with intermediary seq
containers (used to structurally and semantically map with the targeted HTML document) all the way down to the par
media pairs / tree leaves (audio
, text
).
SORRY, POSTED TOO QUICKLY (WILL PUBLISH AGAIN)
Here is the example from the specification:
http://www.idpf.org/epub/31/spec/epub-mediaoverlays.html#sec-media-overlays-structure
chapter1.smil
:
<smil xmlns="http://www.w3.org/ns/SMIL" xmlns:epub="http://www.idpf.org/2007/ops" version="3.0">
<body>
<!-- a chapter -->
<seq id="id1" epub:textref="chapter1.xhtml#s01" epub:type="chapter">
<!-- the section title -->
<par id="id2">
<text src="chapter1.xhtml#section1_title"/>
<audio src="chapter1_audio.mp3"
clipBegin="0:23:23.84"
clipEnd="0:23:34.221"/>
</par>
<!-- some sentences in the chapter -->
<par id="id3">
<text src="chapter1.xhtml#text1"/>
<audio src="chapter1_audio.mp3"
clipBegin="0:23:34.221"
clipEnd="0:23:59.003"/>
</par>
<par id="id4">
<text src="chapter1.xhtml#text2"/>
<audio src="chapter1_audio.mp3"
clipBegin="0:23:59.003"
clipEnd="0:24:15.000"/>
</par>
<!-- a figure -->
<seq id="id7" epub:textref="chapter1.xhtml#figure">
<par id="id8">
<text src="chapter1.xhtml#photo"/>
<audio src="chapter1_audio.mp3"
clipBegin="0:24:18.123"
clipEnd="0:24:28.764"/>
</par>
<par id="id9">
<text src="chapter1.xhtml#caption"/>
<audio src="chapter1_audio.mp3"
clipBegin="0:24:28.764"
clipEnd="0:24:50.010"/>
</par>
</seq>
<!-- more sentences in the chapter (outside the figure) -->
<par id="id12">
<text src="chapter1.xhtml#text3"/>
<audio src="chapter1_audio.mp3"
clipBegin="0:25:45.515"
clipEnd="0:26:30.203"/>
</par>
<par id="id13">
<text src="chapter1.xhtml#text4"/>
<audio src="chapter1_audio.mp3"
clipBegin="0:26:30.203"
clipEnd="0:27:15.000"/>
</par>
</seq>
</body>
</smil>
chapter1.xhtml
:
<html xmlns="http://www.w3.org/1999/xhtml"
xmlns:epub="http://www.idpf.org/2007/ops"
xml:lang="en"
lang="en">
<head>
<title>Media Overlays Example of EPUB Content Document</title>
</head>
<body id="sec1">
<section id="sectionstart" epub:type="chapter">
<h1 id="section1_title">The Section Title</h1>
<p id="text1">The first phrase of the main text body.</p>
<p id="text2">The second phrase of the main text body.</p>
<figure id="figure">
<img id="photo"
src="photo.png"
alt="a photograph for which there is a caption" />
<figcaption id="caption">The photo caption</figcaption>
</figure>
<p id="text3">The third phrase of the main text body.</p>
<p id="text4">The fourth phrase of the main text body.</p>
</section>
</body>
</html>
This is the resulting JSON:
UPDATED to enclose role
and id
in metadata
.
{
"metadata": {
"duration": "0:27:15.000",
"role": [
"body"
]
},
"links": [
{
"href": "chapter1.xhtml",
"type": "application/xhtml+xml"
}
],
"children": [
{
"metadata": {
"id": "id1",
"role": [
"chapter"
]
},
"links": [
{
"href": "chapter1.xhtml#s01",
"type": "application/xhtml+xml"
}
],
"children": [
{
"metadata": {
"id": "id2"
},
"links": [
{
"href": "chapter1.xhtml#section1_title",
"type": "application/xhtml+xml"
},
{
"href": "chapter1_audio.mp3#t=0:23:23.84,0:23:34.221",
"type": "audio/mpeg"
}
]
},
{
"metadata": {
"id": "id3"
},
"links": [
{
"href": "chapter1.xhtml#text1",
"type": "application/xhtml+xml"
},
{
"href": "chapter1_audio.mp3#t=0:23:34.221,0:23:59.003",
"type": "audio/mpeg"
}
]
},
{
"metadata": {
"id": "id4"
},
"links": [
{
"href": "chapter1.xhtml#text2",
"type": "application/xhtml+xml"
},
{
"href": "chapter1_audio.mp3#t=0:23:59.003,0:24:15.000",
"type": "audio/mpeg"
}
]
},
{
"metadata": {
"id": "id7"
},
"links": [
{
"href": "chapter1.xhtml#figure",
"type": "application/xhtml+xml"
}
],
"children": [
{
"metadata": {
"id": "id8"
},
"links": [
{
"href": "chapter1.xhtml#photo",
"type": "application/xhtml+xml"
},
{
"href": "chapter1_audio.mp3#t=0:24:18.123,0:24:28.764",
"type": "audio/mpeg"
}
]
},
{
"metadata": {
"id": "id9"
},
"links": [
{
"href": "chapter1.xhtml#caption",
"type": "application/xhtml+xml"
},
{
"href": "chapter1_audio.mp3#t=0:24:28.764,0:24:50.010",
"type": "audio/mpeg"
}
]
}
]
},
{
"metadata": {
"id": "id12"
},
"links": [
{
"href": "chapter1.xhtml#text3",
"type": "application/xhtml+xml"
},
{
"href": "chapter1_audio.mp3#t=0:25:45.515,0:26:30.203",
"type": "audio/mpeg"
}
]
},
{
"metadata": {
"id": "id13"
},
"links": [
{
"href": "chapter1.xhtml#text4",
"type": "application/xhtml+xml"
},
{
"href": "chapter1_audio.mp3#t=0:26:30.203,0:27:15.000",
"type": "audio/mpeg"
}
]
}
]
}
]
}
Note that the above conversion does not translate the time/clock values to seconds (so we can easily cross-reference with the original EPUB3 SMIL).
Note that the above JSON introduces the id
property, to preserve the SMIL info.
Note that metadata
duration
is not present in the original SMIL, we derive it from the EPUB OPF package metadata and expose it here as first-class citizen to avoid levels of indirection when retrieving the Media Overlays JSONs.
Note that the children
JSON keys/properties are not in the "link" object, instead they are part of the extensibility mechanism. I will publish an alternative proposal that leverages the "link" object's own children
property. https://github.com/readium/webpub-manifest/blob/6930a12439d7b36f2302d1ef233a6ad41b4854d6/schema/link.schema.json#L57
Note that the initial children
array contains only one child (i.e. the root of the SMIL tree). The role
= body
was added for illustration purposes, but the original SMIL in fact does not explicitly have this epub:type
.
This alternative syntax proposal purely relies on the "link" object's children
property to express hierarchy, and maps directly to the SMIL tree of seq
(with the initial body
) and par
leaves:
UPDATE: added empty href
(#
) to link
objects, to pass JSON Schema validation.
{
"metadata": {
"duration": "0:27:15.000"
},
"links": [
{
"role": [
"body"
],
"href": "chapter1.xhtml",
"type": "application/xhtml+xml",
"children": [
{
"id": "id1",
"role": [
"chapter"
],
"href": "chapter1.xhtml#s01",
"type": "application/xhtml+xml",
"children": [
{
"id": "id2",
"href": "#",
"children": [
{
"href": "chapter1.xhtml#section1_title",
"type": "application/xhtml+xml"
},
{
"href": "chapter1_audio.mp3#t=0:23:23.84,0:23:34.221",
"type": "audio/mpeg"
}
]
},
{
"id": "id3",
"href": "#",
"children": [
{
"href": "chapter1.xhtml#text1",
"type": "application/xhtml+xml"
},
{
"href": "chapter1_audio.mp3#t=0:23:34.221,0:23:59.003",
"type": "audio/mpeg"
}
]
},
{
"id": "id4",
"href": "#",
"children": [
{
"href": "chapter1.xhtml#text2",
"type": "application/xhtml+xml"
},
{
"href": "chapter1_audio.mp3#t=0:23:59.003,0:24:15.000",
"type": "audio/mpeg"
}
]
},
{
"id": "id7",
"href": "chapter1.xhtml#figure",
"type": "application/xhtml+xml",
"children": [
{
"id": "id8",
"href": "#",
"children": [
{
"href": "chapter1.xhtml#photo",
"type": "application/xhtml+xml"
},
{
"href": "chapter1_audio.mp3#t=0:24:18.123,0:24:28.764",
"type": "audio/mpeg"
}
]
},
{
"id": "id9",
"href": "#",
"children": [
{
"href": "chapter1.xhtml#caption",
"type": "application/xhtml+xml"
},
{
"href": "chapter1_audio.mp3#t=0:24:28.764,0:24:50.010",
"type": "audio/mpeg"
}
]
}
]
},
{
"id": "id12",
"href": "#",
"children": [
{
"href": "chapter1.xhtml#text3",
"type": "application/xhtml+xml"
},
{
"href": "chapter1_audio.mp3#t=0:25:45.515,0:26:30.203",
"type": "audio/mpeg"
}
]
},
{
"id": "id13",
"href": "#",
"children": [
{
"href": "chapter1.xhtml#text4",
"type": "application/xhtml+xml"
},
{
"href": "chapter1_audio.mp3#t=0:26:30.203,0:27:15.000",
"type": "audio/mpeg"
}
]
}
]
}
]
}
]
}
Thanks for these examples @danielweck.
- In your first example, both
id
androle
should be inmetadata
, otherwise they'll be treated as subcollections by the schema and won't be valid. - In your second example, you introduce
id
androle
in the Link Object. This wouldn't be rejected by the schema but they would also be undefined under our current model. In general we try to minimize such extensions directly in the Link Object through the use ofproperties
instead. - There's another issue with the second example since many Link Objects do not have a
href
which would be invalid according to our schema.
@HadrienGardeur
Yes, to illustrate the challenge of preserving information from the original SMIL timing containers (body
, seq
and par
), I intentionally used the id
and role
(epub:type
) properties in the link
object ... which are unfortunately not supported "natively": https://github.com/readium/webpub-manifest/blob/master/schema/link.schema.json (and rel
isn't semantically correct either)
There are existing publications that use a valid EPUB3 "design" / authoring pattern in NavDoc, in order to partition / categorize TOC links. For example, Children's Litterature
https://github.com/IDPF/epub3-samples/blob/master/30/childrens-literature/EPUB/nav.xhtml#L11
This results in a hierarchy of link
objects with intermediary containers that lack the href
property. This is handled correctly at rendering time in r2-testapp-js
and readium-desktop
(aka Thorium). Not sure about other platforms.
This results in a hierarchy of link objects with intermediary containers that lack the href property. This is handled correctly at rendering time in r2-testapp-js and readium-desktop (aka Thorium). Not sure about other platforms.
An alternative to that would be a default href
, for example #
.
I added empty href
(#
) to link
objects, to pass JSON Schema validation.
I enclosed role
and id
in metadata
.
Based on our recent discussions, I think we can close this issue. We need to keep an eye on the W3C CG and make sure that we align with it.
Can you open an issue specifically for that @danielweck ?
Follow-up: #109