Refactor shaping and encoding checks structure
Closed this issue · 3 comments
We would like to move to a more general model for checks that would entail:
- encoding checks (e.g. are all base characters for English included?),
- shaping checks (e.g. does the joining behaviour in Arabic work?).
We prefer to run these checks automatically based on the language definitions rather than list those checks for each language. The advantage is better scalability and resistance to human errors (we welcome non-technical contributors).
My idea for this was to implement different kind of checks as python classes with a common signature. The checks would have access to the Shaper to implement any checks as needed when passed a set of characters.
To avoid over-explicit definitions of what checks should run for what languages, I was considering a kind of matching mechanism where each check out define a set of conditions. When a check's conditions are met for a language, run it. Such conditions could be:
- presence of an orthography attribute
- script
- presence of a particular unicodepoint
So for example, a general encoding check would trigger for any presence of base
. For Arabic shaping, the condition could be the script being Arabic
. And for yet to be implemented shaping checks for brahmic script conjuncts the presence of a specific orthography attribute under which they are stored would opt-in to those checks.
One additional thought could be returning more nuanced verdicts than just pass/fail, or return pass/fail plus textual information. E.g. for conjunct checks a pass may be based on a threshold, so informing the user about those details might be necessary.
This is implemented but WIP in this branch.