w3c/adapt

Provide for language and direction metadata for content to be simplified

aphillips opened this issue · 25 comments

(From your I18N self-review #133)

We do define attribute names and values as tokens that are human readable but we don't expect those to be localized. However we do have content simplification that does have alternative text descriptions.

Also, might the language and base direction of the text being simplified affect the interpretation of the symbols? There doesn't appear to be language or direction metadata associated with the symbol regime.

Our concern here, in case it was unclear, is that language and direction metadata (such as the lang and dir attributes in HTML) can affect the processing of the assistive technology or might be useful to the assistive technology in doing language-related processes (such as selecting a dictionary or symbol set), so you should pay extra attention to providing access to and encouraging the use of metadata.

Thank you for your feedback.

This has started an interesting discussion. It is worth noting that our
use-cases are to map between different symbol sets in the same natural
language. We are not enabling translation for different symbol sets across
different natural languages at this time.

The reference numbers are from Bliss symbolics, which was designed to be an
international language similar to Esperanto. They already have a published
set of usage rules at:
http://www.blissymbolics.org/images/bliss-rules.pdf

Also, we have added a Hebrew example for symbols illustrating how to form a
conjugated term (see
https://raw.githack.com/w3c/personalization-semantics/edits07062020/content/index.html#symbol-explanation
).

We have also added a best practice page which may also help. See
https://github.com/w3c/personalization-semantics/wiki/Best-practices-for-symbol-values
.

Finally, we intend to look into adding another page-level metadata value
via schema.org (https://schema.org/accessibilityFeature) possibly in
coordination with digital publishing (
https://www.w3.org/wiki/WebSchemas/Accessibility#accessibilityFeature_in_detail
)

Please let us know if you have more questions at this time.

Closing issue since we have not been advised of any objections.

I was actioned by I18N to reopen this issue.

We rely on the hosting document to specify the language and direction. The user agent implementing our specifications is expected to adhere to the language and direction specified. Thus, we respecify that information.

The I18N WG wasn't clear on what the resolution of this issue was and whether we understand the reply (and hence actioned me with following up). We see that the symbols are used to replace natural language that appears or would otherwise appear in the page. Does this mean that the "natural language" of the symbols is expected/required to be the same as that of the language it is simplifying? Does the interpretation of symbols depend on the natural language (or its grammar) in some way? If the language can be established, various other assistive technologies might make use of that presentationally or during processing.

Direction also has us concerned. Depending on how the symbols are encoded, the Unicode bidirectional algorithm (UBA) might cause display problems for users. For example, this example:

<span data-symbol="13621 12324 17511">cup of Tea</span>

Let's replace "13621 12324 17511" with "A", "B", and "C" for illustration. In a left to right context, the symbols might render as: ABC. In a right to left context the same symbols (being, hopefully, neutrals) might display as CBA (because they are read from right to left). Is that your expectation (e.g. Arabic readers read symbols right-to-left)? How would direction and presentation be managed? Is the source direction inherited from the replaced/simplified text? Or is it fixed?

r12a commented

I didn't review this spec, but i took a closer look at it over the past week. Here are some of the additional questions i have:

[1] Rendering

How will the person reading the text actually see the bliss symbols? It would be helpful, i think to include some pictures in the spec to show an example of what the reader will actually see.

[2] Differing syntaxes

As i understand it, Bliss Symbols are written in a language with its own syntax, which generally uses a Subject Verb Object (SVO) in order, eg. John likes Jane indicates that John is doing the liking.

One out of many differences of syntax when writing other languages, is that they may have very different word orders, such as Subject (or Topic) Object Verb (eg. Japanese, Hindi, etc.) or Verb Subject Object (Arabic). Furthermore, the subject in many languages is not expressed separately as a pronoun, but is omitted altogether (Japanese, …), described by the verb conjugation (Arabic, Spanish, …), or expressed in particles added to the end of the word (as in agglutinative languages).

I mention this because it seems that the spec allows the Bliss Symbols to be substituted for just the key words in a sentence. Presumably, that means that individual words would be replaced by symbols, and other language words would be left visible?? And this would suggest an arrangement of symbols which follows that of the syntax of the underlying language, rather than the SOV order in which one normally reads Bliss Symbols??

Is this correct? And have you tested it out with users who speak syntactically diverse languages to see whether it actually works?

[3] Grammatical compatibility

Also, if only key words are rendered using symbols i wonder whether the Bliss Symbols can carry sufficient morphological information, given that the sequencing of symbols won't follow the normal Bliss Symbols language syntax, and that language can express morphological information in very different ways from that used for normal sequences of Bliss Symbols.

For example, numerous languages have singular, dual, and more-than-two forms of words (eg. Russian, Arabic, etc). It's not clear to me that the syntax and morphology used for Bliss Symbols allows for such shades of meaning or for modifications of nouns/verbs that reflect such distinctions.

If only key words are represented by Bliss Symbols, how will that work, especially given that such distinctions can also be reflected in other words in a sentence, often in a way that helps clarify the meaning in flexible word orders?

[3] Spaces

The relationships between Bliss Symbols and therefore the meaning of a sequence of symbols is normally reliant on different types of space: including full/half/quarter spaces. The arrangement of symbols such as

<span data-symbol="13621 12324 17511">cup of Tea</span>

doesn't allow for such variations in spacing, and therefore will presumably produce difficulties for users to understand the message.

[4] Combining marks

Verb tense and voice markers in normal Bliss Symbols are indicated using marks that combine with the symbols. How it that indicated when only numbers are used?

I may have missed something, but i don't recall that such markers have numeric identifiers(?). Even if there are, will it be possible to combine them correctly with the Bliss symbols that are displayed?

r12a commented

It looks like the 'indicators' do have numeric identifiers (in the early part of the list).

We've come to think that a joint telecon might be a more helpful way to resolve this issue. If possible, we'd prefer to do that before TPAC. Our regular telecon is 10:00 AM (Boston), in case that might work.
We agree with the concerns we understand I18N is raising here. We aren't transforming content at that level, e.g. if a page is in a right to left lang, we simply inherit that situation.
Perhaps our reference to multiple symbol sets is part of the confusion? These would all be within the same natural language group. The only reason for translation is that some people went to a school where they learned symbol set A, and others to a school where they learned set B. Without our transformations, they can't understand one another--but this is entirely in the realm of AAC symbol sets, and not at all about English, Arabic, or any other natural language.
It may also be helpful to see our brief personalization video. Thoughts?

Oops. I told you the time, but not the day of week. Very sorry. Our telecon is Mondays at 10:00 AM (Boston). We are, of course, open to other options.

r12a commented

So another question to help us prepare for the call:

this is entirely in the realm of AAC symbol sets, and not at all about English, Arabic, or any other natural language.

I don't think we're talking about translation here. We're talking about how these symbols will be used by Arabic, Japanese, Thai, Inuit, etc. content developers and readers to create text that is in their own language – languages with very different approaches to syntax and morphology.

Perhaps something that confuses me is my expectation of what you are using Bliss Symbols for. Bliss Symbols are normally used with a grammar and syntax of their own (subject verb object, 2 types of plural only, etc.). But perhaps what you are doing here is simply adding illustrations for key words in a sentence, rather than producing the content in the 'language' of Bliss Symbols? Are you simply drawing on the pictures in the BS inventory for those illustrations.

Such an approach (for which i'm not yet considering the merits/demerits) would presumably allow for interchanging the symbol sets as i saw in the video. Although, it seems that you are using BS ids in the markup, regardless of the symbol set in use – which is intriguing.

With the proviso that Augmentative and Alternative Communications (AAC) is not my specific area of accessibility expertise, I believe it's correct to say that users of symbol sets like Bliss use them because they're unsuccessful with usual language orthography. While this undoubtedly falls on a spectrum, i wouldn't expect we'd render a paper on particle physics using symbolics--an intriguing concept, nevertheless. I can categorically say we're not setting rules for how AAC users of any particular language culture express themselves. Those are already defined in common usage. We simply facilitate them in web documents marked up using our overlay. Which brings me to how we specifically use Bliss. The value add we're adopting from Bliss is their inter-symbolic index which allows us to translate content into whatever set a particular user knows. This is where school A vs. school B comes in. Perhaps a bit of unrelated parallel history from braille would help here? Before the American Foundation for the Blind (AFB), an organization I worked for for a decade, came into being, the braille user from Masachusetts couldn't read braille published in New York, and neither of these two could read braille published by the Overbrook School in Pennsylvania. These all used similar orthographic conventions, with different symbol definitions. AFB was pressly organized to harmonize a single braille alphabet (known these days as BANA). We need web technology to do similar things for AAC symbol users. The degree of standard vs symbol markup is up to the page author--possibly the proxy server that applies our overlay. hth!

@r12a writes:

it seems that you are using BS ids in the markup, regardless of the symbol set in use – which is intriguing.

Exactly! Our intention is to use the Bliss ID's as our common 'taxonomy' - each of those IDs being mapped to a word/concept in the Bliss library.

However there are users who may have a preference (read: expectation) of using symbols that are familiar to them already (as opposed to the specific symbols found in Bliss) - they may for example already have the Mulberry Set (https://globalsymbols.com/symbolsets/mulberry?locale=en) installed in their user-agent configuration. The intent is that going forward those symbol sets can map their symbols against the numeric identifier that is the Bliss ID (the "Rosetta stone"). For example, when using Bliss ID # 9683 (Above)

<span data-symbol="9683">Above</span>

Bliss (Above) symbol:
image

OpenMoji (Above) symbol:
image

Mulberry Symbols (Above) symbol:
image


perhaps what you are doing here is simply adding illustrations for key words in a sentence, rather than producing the content in the 'language' of Bliss Symbols

Correct. Please note that at this time we do not expect a direct and 100% accurate 'translation' of the content, but rather a "close enough" conversion. (Conceptually, think of a situation similar to YouTube's auto-captioning function: often good, sometimes bad, but overall better than nothing...)

This does not preclude content authors from using Bliss symbolics as a native writing language, but it was not envisioned as our primary use-case: we anticipate content conversion as being the more likely scenario in the wild.

This issue is pretty much the only substantive issue keeping us from advancing to CR, so we'd dearly love to get a resolution soon. While it may still be helpful to hold a joint mtg during TPAC, we'd really appreciate resolving just this issue soonest. Any chance we could schedule a joint telecon in August? Perhaps on the 23rd or 30th at our usual 10:00 AM Boston? If we need to go into September, the first week isn't good, so would need to be 13th at the soonest. Please advise or propose an alternative--thanks!

Checking this thread, @r12a I'm wondering whether you can make any of the APA WG times that Janina is inviting you to? The intersection of I18n and accessibility is important for us to get right. Real-time discussion might help APA better understand and hopefully address your concerns. Thanks!

Hi all,
I created a calendar with the meeting call-information, please find it at:
https://www.w3.org/events/meetings/ecc75877-fca3-425a-b739-c4aa4c081ce6#joining
according our joint meeting plan

I am not sure this helps with your discussions, but reading the recent minutes I thought I would attach an image of an English sentence 'I read your red book today' using ARASAAC symbols translated into Modern Standard Arabic (by a colleague) that is read from right to left. Bliss, as a semantic graphical language, allows characters to be adapted to suit any concept (https://www.blissonline.se/chart) and can therefore produce better interpretations of text. However, in the case of the Bliss lists available online, the symbols are being used in a similar way to pictographic symbols rather than as flexible grammatical entities. So there will always be some challenges when supporting literacy skills. But by using simplification techniques as suggested and machine learning, hopefully improved meaning can be achieved, so a word like 'spring' in English can be interpreted correctly. At the moment we depend on just the concepts for look ups across symbol sets in multiple languages but these need to be refined. https://globalsymbols.com/
Arabic English sentence structure ARASAAC

Personalization TF met with @aphillips regarding these concerns. In the discussion it seemed to us that the specification will be valid regardless of language direction. The reason is as follows: current Augmentative and Alternative Communication (AAC) practice is to show one symbol per referent. Since authors associate the symbols with words, the symbols will follow word order correctly.

i18n requested that we demonstrate this with a sample marked up with the data-symbol attribute as per the specification. We created an HTML document featuring a recipe in two languages, in RTL and LTR sections. While we do not yet have an implementation that renders the symbols, reading the HTML will demonstrate the consistency of meaning. Below is the ingredients heading, in both languages:

<html ... lang="en-us">
	...
	<h3 data-number="1.1.1" id="eh3a">
		<span data-symbol="22909">Ingredients</span>
	</h3>
	...
	<div dir="rtl">
		<h3 id="hh3a">
			<span data-symbol="22909">רכיבים</span>
		</h3>
	...

And below is a sentence, "Place the beans in a food processor (or blender, but food processor is easier), pulse the cooked chickpeas to mash them." The English LTR:

<span>Place the </span><span data-symbol="24021">beans</span>
<span> in a </span><span data-symbol="22393">food processor</span>
<span> (or </span><span data-symbol="22392">blender</span>
<span>, but </span><span data-symbol="22393">food processor</span>
<span> is easier), pulse the cooked </span><span data-symbol="24021">chickpeas</span>
<span> to </span><span data-symbol="25581">mash</span><span> them.</span>

Below find the same text in Hebrew, RTL.

<span>מניחים את </span><span data-symbol="24021">הגרגירי חומוס</span>
<span data-symbol="22393">במעבד מזון</span>
<span> (או </span><span data-symbol="22392">בבלנדר</span>
<span>, אבל </span><span data-symbol="22393">המעבד</span>
<span> קל יותר), דוחסים את </span><span data-symbol="24021">החומוס המבושל</span>
<span> כדי </span><span data-symbol="25581">למעוך</span> <span>.אותם</span>

Notice how symbol order is preserved: the IDs appear in the same sequence, but this time they are associated with words that will render RTL.

Attached is the complete file: Personalization-symbol-attribute-example-LTR-RTL.zip

E. A. Draffan, an expert in AAC who developed an Arabic Symbol Dictionary for AAC users, shows this same principal at work in her sample above.

Many thanks to Charles LaPierre for the recipe and Steve Lee for marking up the symbols.

We have an update on our rendering work relating to this issue. Here's an image of how the document @lwolberg posted above looks when rendered by an in-development Chrome extension that adds Bliss symbols:

The multilingual hummus recipe document being rendered with Bliss symbols

It would be possible to share the Chrome extension with some of you (there are some restrictions at present), so you could try this out in-browser on your own machines. However several set-up steps are required in order to get it working, so we were wondering if this output from the extension would suffice?

r12a commented

Thank you for the recipe example. Unfortunately, we still struggled a little. Recipe instructions are not the most useful items of text for showing some of the features we were looking for, such as how to handle syntactic differences across languages, since the short imperatives used tend to start with a verb and contain no subject – an unusual pattern for English (which is normally SVO), but not so for RTL languages like Arabic and Hebrew, which use VSO order in normal sentence patterns.

A sentence or two that show how such different syntaxes would be handled would be useful. We see that the images in #144 (comment) do show different orders for the images in the English and Arabic as they progress from the start of the line (left in English, right in Arabic) to the end. We were looking for confirmation of whether that matches your expectation by looking at the Hebrew examples, but we weren't able to get that confirmation. (Here we were also disadvantaged because we don't read Hebrew and couldn't even copy the text into a translator to figure out which words were which. Nor could we understand the meanings of the symbols shown, so as to easily comprehend what we were trying to compare.)

We noticed that the ordering of the symbols within a square box continues to be LTR when embedded in the RTL text. Can you confirm whether you checked that this is ok for RTL script readers?

We note also that the direction in which the symbols face, where there is an apparent directionality (see for example those that look like ( or contain /, etc.) also appears to remain LTR.

Thank you.

Personalization TF and APA-WG thank you for this response, and the details of SVO and VSO which we were not aware of. However, as you will see below, this concern does not seem critical. AAC is used for procedural texts, and the markup shared above--not the rendering, the markup--shows symbol order is preserved under LTR or RTL.

To be clear, I quote r12a and respond issue by issue:

[r12a wrote] Thank you for the recipe example. Unfortunately, we still struggled a little.... A sentence or two that show how such different syntaxes would be handled would be useful.

[TF Response] There is a critically important reason why we marked up a recipe and not a sentence. We did this on advice received from multiple AAC experts, as follows: AAC is most used on short texts, procedural texts or instructions. For lengthier discursive or narrative texts, AAC users nearly universally turn to audio and video. No AAC expert that we consulted with knew of an AAC user that would want AAC on every word of a story, article or web page: when they have a lot to read, they have it read to them by an assistive technology or turn to an audio or video alternative source.

[r12a wrote] We see that the images in... do show different orders for the images in the English and Arabic..... We were looking for confirmation of whether that matches your expectation....

[TF Response] The Content Module specification stipulates markup, not rendering. Rendering is at the discretion of the user-agent or other technologies downstream and is not in scope of the specification. Matatk shared a possible rendering, and r12a's comment addresses this -- but all of this was provided only as a convenience and is out of scope of the specification. The HTML markup associates a symbol to one or more words, at the discretion of the page's author. This association of symbol to text is unaffected by LTR or RTL of the marked up content.

[r12a wrote] Here we were also disadvantaged because we don't read Hebrew and couldn't even copy the text into a translator...

[TF Response] We sympathize! It was not easy for our TF to scrounge up AAC experts fluent in RTL languages as well as native speakers for an accurate translation. We did just that, at no small effort, to share the representative sample above, the HTML marked up recipe. We repeat that this HTML sample is available to i18n since beginning of December 2021.

We conclude with the conclusion above: Notice how symbol order is preserved: the IDs appear in the same sequence, but this time they are associated with words that will render RTL.

r12a commented

Thank you for your response. The i18n WG discussed your comments during their telecon. We will now close our tracker for this issue.

However, we would like to note that we raised a couple of questions about the rendering which weren't replied to directly.

We noticed that the ordering of the symbols within a square box continues to be LTR when embedded in the RTL text. Can you confirm whether you checked that this is ok for RTL script readers?

We note also that the direction in which the symbols face, where there is an apparent directionality (see for example those that look like ( or contain /, etc.) also appears to remain LTR.

We hear your argument that you consider these questions to be outside the scope of your document, but we are concerned that the overall solution, for which your document provides a contributory part, will need to ascertain and meet user requirements for these points. We are wondering what can be done to ensure that these concerns are raised with the appropriate group?

We will test for these situations during CR with implementations. We thank you for putting emphasis on this, as a result we now have experts that we can turn to as we do implementations.