Benefits of multi-attribute approach, comments on "Specification for Spoken Presentation in HTML"

Question

Benefits of multi-attribute approach, comments on "Specification for Spoken Presentation in HTML"

Opened this issue 4 years ago · 2 comments

With several years of experience in the voice space, and SSML in particular, I wanted to comment on the specification for spoken presentation in HTML.

In general I do agree that getting closer to HTML's semantic elements will help developers add spoken information. While I have also heard of the Aural CSS spec in the past, being able to tie even closer to individual HTML elements may make the adoption easier. Contrariwise, CSS is easier to apply across elements as well as incorporate into the class structure of pre-existing content.

Regardless, I would state that the multi-attribute approach is preferable over the single-attribute approach. As the document states, embedding JSON into an HTML attribute is non-standard and may easily introduce errors into the markup. Additionally, the JSON language supports arrays of information which may contrast with the pre-existing nested structure of HTML.

Consider a non-standard tag like Google's media. It represents a media layer that may have one or multiple nested audio tags but with attributes that are generally applicable.

Consider the example:

<speak>
  <seq>
    <media repeatCount="3" soundLevel="+2.28dB"
      fadeInDur="2s" fadeOutDur="0.2s">
      <audio speed="200%"
        src="https://actions.google.com/.../cat_purr_close.ogg"/>
    </media>
  </seq>
</speak>

In this case, would this be represented as just the media attributes embedded in JSON on the parent? Would the integrator then need to keep these in memory for nested children?

<div data-ssml='{"media":{"repeatCount": 3, "soundLevel"="+2.28dB", "fadeInDur": "2s", "fadeOutDur": "0.2s"}}'>
    <span data-ssml='{"audio": {"speed": "200%", src=".."}}'>Audio description</span>
</div>

This seems prone to typos and may blend HTML and SSML metadata in a way that may be hard to follow.

Doing this as separate attributes brings the entire set of metadata closer to HTML in a way that feels more consistent with other HTML capabilities.

<div data-ssml-media-repeat-count="3" data-ssml-media-sound-level="+2.28dB" data-ssml-media-fade-in-dur="2s" data-ssml-fade-out-dur="0.2s">
    <span data-ssml-audio-speed="200%" data-ssml-audio-src="...">Audio description</span>
</div>

This is a bit more complicated of a transition, but may provide additional benefits that I will detail below. I recognize in this case that media is a specific case, and that the proposed attributes do not have any nesting. However, I do wonder how this spec would need to change if a new SSML tag is ever added that would nest children.

One benefit of the multi-attribute approach is its better support for non-standard attributes, because an integrator would be able to ignore specific attributes that they don't support while other integrators could extend the SSML spec in a more adaptable way. Rather than browsers having to validate JSON against a known spec, breaking up the attributes lets them focus on only the attributes they need to know while a TTS engine can update based on spec improvements on its own.

Another benefit is around tooling. By breaking up the attributes, it could let tool makers more easily provide browser extensions that extend HTML DevTools and let a developer make atomic changes to these attributes. The tooling can expose these attributes more easily, then update the single attribute of the element and simulate the change. One example is the Angular DevTools which makes Angular, built on-top of HTML, easier to debug.

SSML today largely follows standards, but cloud speech synthesis platforms have various non-standard features. As such it may be beneficial to adopt features in a way that can handle this flexibility. As adding SSML attributes is a change that can have a large impact on pages, making it easy to add tooling will allow for developers to better optimize their SSML so it sounds correct.

Answer 1 · 2021-06-23T14:44:29.000Z

Thank you for the comment. It's still being reviewed by the TF.

Answer 2 · 2021-06-23T14:45:39.000Z

+1 to @AutoSponge there is a lot here to digest and review. Appreciate the time and effort applied to the response!