keymanapp/lexical-models

[sil_kmhmu] LM only matches the last character being typed

Closed this issue · 4 comments

From the get go, I see that the prediction only match the last character being typed rather than a continuous string coming before it, i.e. When one type ເ, the model tries to match words beginning with that character, but then when the next character (ຄ) is typed, the model now tries to match word beginning with ຄ, not the combination of the two (ເຄ), so one may not be able to get the suggestion for words like ເຄືອນ at all. Talk to me if the description not understandable.

The lexical model package: https://drive.google.com/file/d/1Gsz6U5Ww45AjWbfiilLdmz7qYeg0mnKz/view?usp=sharing

The associated keyboard: https://keyman.com/keyboards/sil_kmhmu

So, weirdly enough, the default wordbreaker will simply not work in Khmhu because it uses Unicode's default word boundary specification that says to split at every single Lao character. It's like the standard gave up for Lao, stating:

For Thai, Lao, Khmer, Myanmar, and other scripts that do not typically use spaces between words, a good implementation should not depend on the default word boundary specification.

[Source]

Until we have an easy way to customize the word boundary specification, Khmhu will need a custom word breaker. Here's one that I think might work:

/*
  ptwl1 1.0 generated from template.

  This is a minimal lexical model source that uses a tab delimited wordlist.
  See documentation online at https://help.keyman.com/developer/ for
  additional parameters.
*/

const source: LexicalModelSource = {
  format: 'trie-1.0',
  /**
   * A custom word breaker, because the unmodified Unicode Default Word Boundary
   * specification breaks at every single Lao character.
   *
   * We run the default word breaker, and then join contiguous spans that
   * contain Lao script text.
   */
  wordBreaker: function(text): Span[] {
    /* All assigned characters in this range: https://www.unicode.org/charts/PDF/U0E80.pdf */
    const LAO = /^[\u0e81\u0e82\u0e84\u0e86-\u0e8a\u0e8c-\u0ea3\u0ea5\u0ea7-\u0ebd\u0ec0-\u0ec4\u0ec6\u0ec8-\u0ecd\u0ed0-\u0ed9\u0edc-\u0edf]+$/;

    /* Split, as normal. */
    let originalSpans = wordBreakers['default'](text);

    /* The rest of the algorithm only works if there's at least one span, so exit early if there are no spans to join. */
    if (originalSpans.length < 1) {
      return [];
    }

    /* We're going to be amending the output spans, starting with the first one. */
    let spans = [];
    spans.push(originalSpans[0]);

    for (let i = 1; i < originalSpans.length; i++) {
      let previous = spans[spans.length - 1];
      let current = originalSpans[i];
      
      if (spansAreBackToBack(previous, current) && isLao(previous) && isLao(current)) {
        /* previous and current spans are contiguous Lao text. Join them! */
        spans[spans.length - 1] = concatenateSpans(previous, current);
      } else {
        spans.push(current);
      }
    }

    return spans;

    /* === Helper functions === */

    function isLao(span: Span) {
      let text = span.text;
      return LAO.test(text);
    }

    function spansAreBackToBack(former: Span, latter: Span): boolean {
      return former.end === latter.start;
    }

    function concatenateSpans(former: Span, latter: Span) {
      if (latter.start !== former.end) {
        throw new Error(`Cannot concatenate non-contiguous spans: ${JSON.stringify(former)}/${JSON.stringify(latter)}`);
      }

      return {
        start: former.start,
        end: latter.end,
        length: former.length + latter.length,
        text: former.text + latter.text
      };
    }
  }
  sources: ['wordlist.tsv'],
};
export default source;

EDIT: testing this model on my own machine shows that it works for the exmaple given; namely:

$ lmlayer-cli -f sil_international.kjg-laoo.ptwl1.model.js -p 'ເຄ'
What was typed First suggestion Second suggestion Third suggestion
ເຄ ເຄືອນ ເຄຣືອງ ເຄາະ

@MakaraSok can you update this issue with progress?

@MakaraSok Has this issue been resolved?

@DavidLRowe Yes, it's been resolved a while ago.