Support embedded directions
Hywan opened this issue · 16 comments
A string can contain both left-to-right and right-to-left text. We need a better algorithm to guess the current direction of a text :-).
Hey there, coming from reddit :) Some suggestions for an algorithm to solve this issue:
- Check if the string contains LRM (0x200e) or RLM (0x200f) (and treat ARM 0x061c arabic letter mark as "alias" for RLM), as they are specifically used to mark the string, in which it should be interpreted.
- If it contains both directions, return BIDI (should add this as constant)
- else if it only contains LRM, return LTR
- else if it only contains RLM and / or ARM, return RTL
- Set default assumption on the first character
- Check if we find any markers (LRM, LRE, LRO (may LRI) and RLM, RLE, RLO (may RLI), ARM) which would imply a direction change compared to the first character, if so, return BIDI
- Check the string if it contains a character from the opposing direction, if so, return BIDI, if not, return the respective direction based on the assumption we have from the first string.
Does this sound reasonable? As I cannot think of any sane way to detect that "私 - is a japanese letter" "should" be LTR, the user has decide by himself what to do with BIDI text.
@boast It sounds reasonable yes. I didn't check how other implemetation deals with it. Any PR :-)?
As for reference implementations: https://github.com/waiting-for-dev/string-direction
Or http://en.wikipedia.org/wiki/Bi-directional_text on that topic (notice the table with the classifications). I'll work on it tonight 👍 However, probably need to refactor some methods into helper protected methods to do the checks more granulated.
I tried my best to adapt the coding style. No tests broken (or lets say: some tests failed on my Ubuntu Dev Machine before I changed anything, seems like those collator and normalizer tests (especially when they are not available) are broken?) and added a new one following more or less the spec described above.
ping?
Hey there, thank you for the ping. I was occupied this half year with doing my bachelor degree in CS. ;) We should define our definitive approach for this problem together and then I / we can work out the implementation. My knowledge about the problem comes specifically from these sources:
IMHO, we should first decide on the actual "goal" and "usecase" of this method. Why and when is the information "which direction is this text going" needed? Because one can go crazy on the "strong", "weak" and "normal" characters and contexts...
So far, we use getCharDirection
to decide the behavior of append
, prepend
and other methods. This method only checks the first character. We must check the last character first. Second, it should be great to have a method to know if we have bi-directional text. I don't know really why it can be useful yet but I am sure it will be. We can also add methods to force to change the direction of the text (maybe we would like to write french in reverse order 😉). And a most useful usage is:
- Iterate over direction portions. It can be particularly useful when transforming it into HTML for instance (or PDF, text etc.).
- Also, with the
append
andprepend
methods for instance, we can say:$str->append('text', $str::RTL);
to force appending something in the opposite direction (to have bi-directionnal text thus).
PS: How your bachelor goes 😉?
Another use case:
- When comparing strings, we would compare portion of directions, not the whole string at once. This some usages I think of.
Hey Ivan,
thanks for asking - my bachelor is done now, so I think, I will find some time to contribute.
I will try to implement the algorithm according to the UNICODE BIDIRECTIONAL ALGORITHM. Especially the table Bidirectional Character Types looks very interesting and exactly what is lacking as of now ("weak" characters as numbers and punctuation are not handled correctly by our algorithm).
Excellent news!
Just a short update: I wrote a small script which parses the official bidi-classes from the unicode consortium (http://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedBidiClass.txt). It generates an optimized regex (not working atm, I miss something XD). The regex get quiet large though, but may some more optimizations are possible. The script is a small console app (Bin-folder) which allows easy regeneration if the spec should change.
After my regex works, I will implement the unicode bidi algorithm from http://www.unicode.org/reports/tr9/.
Why do we need such a regular expressions?
We need to distinguish between the different types of bidirectional
characters. Especially as some characters "change" their directions
depending on context (read: surrounding characters). It's quite complex at
the start, but as soon as you have the groups and get the hang of it, you
can exclude a lot of cases very fast.
On Wed, 14 Oct 2015 13:39 Ivan Enderlin notifications@github.com wrote:
Why do we need such a regular expressions?
—
Reply to this email directly or view it on GitHub
#21 (comment).
Ok :-).