QE for HTML input doesn't work
abhi-agg opened this issue · 28 comments
A quick experiment with wasm test page for html text translation shows weird byte ranges for words.
An example:
Sentence: "How are <b>you</b> doing?"
Translated sentence: "Wie geht <b>es dir</b>?"
"words":["Wie", "geht", "<b", "es d"]
"wordByteRanges":[{"begin":0,"end":3}, {"begin":4,"end":8}, {"begin":9,"end":11}, {"begin":12,"end":16}]
Looks like, QE ignores tags completely (treating them as if they are non-existent) in the translated sentences and compute the byte ranges of the words in the translated sentences. Is it something that needs to be fixed or am I doing something wrong at my end?
Attaching the image for detailed results.
Looks like, QE ignores tags completely (treating them as if they are non-existent) in the translated sentences and compute the byte ranges of the words in the translated sentences.
Because I translated How are you doing?
after removing all the tags from html sentence and I get the same byte ranges for both html and non-html versions.
Ah, I totally missed that. I was under the impression that QualityEstimation scores uses those same offsets that are stored in the AnnotatedText
class that HTML updates. It does… but it aggregates them into words and then stores its own byte offsets all before the HTML is inserted into the translation.
I'll see whether I can make the QualityEstimator
class work on token offsets instead. Then it automatically stays up to date with the byte offset updates that HTML makes. (And it would also benefit from changes like #298.)
I'll see whether I can make the QualityEstimator class work on token offsets instead.
QE uses words (not continuous subwords), I don't expect the above approach to fare well.
I have always wanted QE to use Annotation after relaxing the continuity constraints. Then we should be able to uniformly update QE "annotation" (based on words, no continuity) and plaintext vocab "annotation" (based on subwords, sentence piece guaranteed continuity).
This will break the negotiated API between QE and Mozilla, but I strongly believe is the proper way to go. Especially since #298 also points towards this. In effect, this would mean pushing down some of the "insert" tag (opening or closing) before token t
as a method on to Annotation, which automatically fixes the rest of the Annotation. While I do not know the full detail, I understand that in the current state of HTML you are editing Annotation
outside the class to accomplish what is currently happening in HTML - some of this would be moving in and contained within Annotation. This means you can fix up the QE "Annotation" the same way you fix up the plain-text annotations.
I do not expect the above to be trivial.
QE is slated to be implemented in W5 (March 7-11). HTML is expected to be implemented in W3 (February 21-25), which is next week. I guess it's reasonable to keep this soft at W4 here and prioritize based on how HTML completes or not by W3.
QE uses words (not continuous subwords), I don't expect the above approach to fare well.
My idea was to remove the subwordToWord
call here:
That call just turns the token ranges that QualityEstimator already uses internally into byte ranges:
bergamot-translator/src/translator/quality_estimator.cpp
Lines 270 to 285 in 2844ced
That step is something a client can easily do itself, it has access to the AnnotatedText.
Hell, I could even integrate it in a way similar to 83e869c at that point, and have quality estimation scores be part of the output HTML. Then the client wouldn't need to do any work at all. But that all depends on the scenario in which we need both HTML and quality scores at the same time.
QE is slated to be implemented in W5 (March 7-11). HTML is expected to be implemented in W3 (February 21-25), which is next week. I guess it's reasonable to keep this soft at W4 here and prioritize based on how HTML completes or not by W3.
The whole idea of filing this issue was based on @kpu's suggestion (btw @kpu thanks a lot for this suggestion 👍🏾) in last plenary to start integrating and find bugs quickly so that technical support from involving partners could be solicited asap.
QE feature is definitely going to be a part of the final delivery and so is the HTML. In my opinion, this fix doesn't have to wait till W4. We need to test QE again after this bug is fixed and that might resurface some new issues which further might take some time to fix.
All of this goes in general for anything that is becoming a part of the final delivery.
cc @andrenatal @lonnen
@jelmervdl is working on it.
My idea was to remove the
subwordToWord
call here:That call just turns the token ranges that QualityEstimator already uses internally into byte ranges:
That step is something a client can easily do itself, it has access to the AnnotatedText.
@jelmervdl Thanks for working on it. I would add one thing. If I remember correctly, the translation models (and perhaps supervised QE models as well) provide scores of the "subwords" and then some post-processing is done to convert the "subword" scores to "word" scores in the engine. We discussed in the beginning with the QE team and @jerinphilip that it is better for the client to receive the quality scores at "word" level so that client doesn't need to do this post-processing. It is more efficient to do it in the engine as it has greater control and access to more information, plus avoiding replicating the same logic for every client. The client was kind of made oblivious to the concept of "subword" when the html translation came inside the engine.
Please correct me if I am wrong somewhere. Open for discussion here if keeping it the same way for html text poses a technical challenge 👍🏾
@abhi-agg @andrenatal Please take a look at #357 and #358 to express an opinion about API.
#357: Byte spans expanded to cover HTML removal/insertion. (I note @abhi-agg is a reviewer!)
#358: explicit HTML tags inserted with QE scores as an attribute.
Of course QE words and HTML tags are unsynchronized i.e. a QE word can span over an HTML open or closing tag. So #358 repeats the QE score tag as necessary to cover the QE word.
We discussed in the beginning with the QE team and @jerinphilip that it is better for the client to receive the quality scores at "word" level so that client doesn't need to do this post-processing.
I agree with the reasoning here. In both #357 I effectively moved the subwordToWord
call to the WASM bindings part. In Javascript you'll still receive scores per "word" (where the byterange for that word can include some HTML tags as well, but never more text)
In #358 you get this behaviour for free since I assign the same <font x-bergamot-word-score="">
tag to all subwords in a word, and the HTML insertion part is smart enough to then group all subwords into the same font tag.
Open for discussion here if keeping it the same way for html text poses a technical challenge
#358 looks promising. To carry forward, we need to reach synchronization on the following:
- Is QE intended by the extension devs to be operational for Outbound translation? This would mean we will have to look into #357 with plaintext, no HTML.
- Is there a mockup for the UI / Interactions with QE from the extension dev side? If #358 is working with HTML render, it will need to be coordinated with how the JS or CSS at extension interacts for this purpose. There's no ask for anything concrete here, but a loose sketch of what is planned to fit the active pull request to the needs. @jelmervdl appears to have shown a demonstration of how 358 is intended to be used in the description.
-
QE has only delivered three models and they're all out of English, so there is no bidirectional QE. Which is sad because QE might have been useful in that context. However, text/plain translation remains a requirement.
-
There is a mockup. See https://github.com/browsermt/coordination/blob/master/docs/D1.3-Bergamot_User_interface_with_quality_estimation.pdf in the private repo.
There is a mockup. See https://github.com/browsermt/coordination/blob/master/docs/D1.3-Bergamot_User_interface_with_quality_estimation.pdf in the private repo.
I found the following underline based proposal, and a few variants in Section 5 relating to QE.
From meetings, I think it's important to discuss the following. I think @jelmervdl's render might have to do the thresholding (in C++) that was earlier meant for @abhi-agg to do to get the probability values to classes (Section 5: Major/Critical, Major/Minor/Critical, Poor/Ok). Right now we're exporting real values as data. I'm not sure if the decision rule can be applied in CSS (I've lost track of how capable it has become), @jelmervdl's currently using continuous colouring. There is the alternate option of picking the data from attributes and doing the attaching the class to the element in JS.
All mock-ups seem to assume plain text (e.g. a form input), not HTML. In that case, what is currently in main will be sufficient. #357 won't hurt, but the offsets weren't broken to begin with in this scenario.
To clarify a bit further: #357 just fixes the byte offsets in case the input is marked as HTML. If the input was text, there was no issue. So for form translation (which I assume are always plain text unless you're going to attempt auto translation in contenteditable
elements…) it is not necessary.
#358 adds the quality scores as HTML to HTML output, but this only works if the input was HTML to begin with. Of course, it could also be used with plain text if you'd encode all entities in the text before sending it to the translator (and mark it as HTML). The output can easily be rendered like the mock-ups. Thresholding can easily be done in the extension using Javascript or CSS. The added tags that don't meet the threshold won't affect rendering. You might need to be a bit creative with horizontal padding to fill in the gaps between consecutive words that do meet the threshold though. The way I insert the tags, it will leave the spaces that are not part of the word outside any of the score tags. But that's not insurmountable.
Written to @abhi-agg and @andrenatal seeking comment on #357 and #358 and am coming to the notion #358 is a better option. Blocked awaiting their feedback.
I have few queries here:
- Both #357 and #358 solve the same issue. i.e. returning QE scores for HTML text translation and not the plain text translation.
- For plain text translation, QE scores will continue to be returned as before (i.e. byte ranges representing the "words" and their corresponding scores). This means JS side will format the text for QE by processing these byte ranges and scores.
Am I right?
@abhi-agg Plain text remains as it was already working. However, what I would recommend as the easiest thing for you is to follow the suggestion in: #355 (comment) specifically:
HTML: pick #358 and you get the tags back.
text/plain: you encode it then submit as HTML. Render the HTML (Note it will be a snippet of HTML, not a full page.)
Then you don't have to handle byte ranges in either case.
HTML: pick #358 and you get the tags back.
text/plain: you encode it then submit as HTML. Render the HTML (Note it will be a snippet of HTML, not a full page.)
Using this approach means always setting ResponseOptions::HTML flag to true
. Right? And in that case wouldn't the engine crash for sentences like Hello < world
(e.g this issue).
HTML: pick #358 and you get the tags back.
text/plain: you encode it then submit as HTML. Render the HTML (Note it will be a snippet of HTML, not a full page.)Using this approach means always setting HTML flag to
true
. Right? And in that case wouldn't the engine crash for sentences likeHello < world
?
HTML: pick #358 and you get the tags back.
text/plain: you encode it then submit as HTML. Render the HTML (Note it will be a snippet of HTML, not a full page.)Using this approach means always setting ResponseOptions::HTML flag to
true
. Right? And in that case wouldn't the engine crash for sentences likeHello < world
(e.g this issue).
Hello < world
is a valid HTML snippet. Hello < world
is not. That's what encode refers to.
I suppose one could always have HTML mode on then implement text/plain translation by encode -> pass as HTML -> decode. That would be less efficient though.
For QE, to render the colors, you are presumably doing HTML rendering in all cases anyway, hence the suggestion of having us generate the HTML output even if the input is (escaped) text/plain.
Little Javascript example of how to encode plain text to HTML for when you want to submit plain text to the engine but want HTML as output:
function encodePlainTextAsHTML(text) {
const div = document.createElement('div');
div.appendChild(document.createTextNode(text));
return div.innerHTML;
}
And then also set responseOptions.html = true.
All mock-ups seem to assume plain text (e.g. a form input), not HTML.
If plaintext (textarea, document) is being translated, I don't think there's a way to annotate it with reds or colors (https://stackoverflow.com/a/12831155/4565794) without converting to HTML first.
Suggest taking a look at Grammarly for an example. It's essentially the same problem, highlight poor confidence tokens and leave the others be. The textarea or input field is kept hidden, a contentEditable
div takes over. The rich text annotations go on the div which support HTML. The textarea or form field can be populated with just the text content.
@jerinphilip I assume that people type in the translation input field, and the quality is shown in a separate translation output field (which isn't editable when the UI is in this state…). Only once the translation is accepted you'd need to convert the translation output field to plain text to be inserted in the actual form field.
Hello < world
is a valid HTML snippet.Hello < world
is not. That's what encode refers to.
Oh. Thanks for correcting 👍🏾
Little Javascript example of how to encode plain text to HTML for when you want to submit plain text to the engine but want HTML as output:
function encodePlainTextAsHTML(text) { const div = document.createElement('div'); div.appendChild(document.createTextNode(text)); return div.innerHTML; }And then also set responseOptions.html = true.
@jelmervdl With responseOptions.html = true
, the sentence Hello < world
if passed as it is will break the engine but the encoded version of it will not. Right?
@abhi-agg Yep! With responseOptions.html = true
:
Okay: Hello < world
Not okay: Hello < world
@jelmervdl Thanks for clarifying. Please correct me if I am wrong but with the approach #358 (comment), extension will still have to parse the html translation result to show color-coded words based on thresholds.
@abhi-agg How exactly are you showing the color-coded words? Wouldn't the natural way to do that be rendering the HTML with some CSS?
I've added an example of how the output of #358 can be used to the demo page in that branch. That example treats the input field as plain text, escapes it, and renders the translation output as HTML. Then with a little bit of Javascript for thresholding and CSS I render sentence and word level quality indicators.
(Based on thresholds that have no meaning, I just wanted some thresholds that are met often enough for screenshot purposes.)