Support embedded directions

Question

Support embedded directions

Hywan opened this issue 10 years ago · 16 comments

A string can contain both left-to-right and right-to-left text. We need a better algorithm to guess the current direction of a text :-).

Hywan commented 10 years ago

ping?

Hywan commented 9 years ago

Ok :-).

Answer 1 · 2015-01-26T14:49:44.000Z

Hey there, coming from reddit :) Some suggestions for an algorithm to solve this issue:

Check if the string contains LRM (0x200e) or RLM (0x200f) (and treat ARM 0x061c ‭arabic letter mark as "alias" for RLM), as they are specifically used to mark the string, in which it should be interpreted.
- If it contains both directions, return BIDI (should add this as constant)
- else if it only contains LRM, return LTR
- else if it only contains RLM and / or ARM, return RTL
Set default assumption on the first character
Check if we find any markers (LRM, LRE, LRO (may LRI) and RLM, RLE, RLO (may RLI), ARM) which would imply a direction change compared to the first character, if so, return BIDI
Check the string if it contains a character from the opposing direction, if so, return BIDI, if not, return the respective direction based on the assumption we have from the first string.

Does this sound reasonable? As I cannot think of any sane way to detect that "私 - is a japanese letter" "should" be LTR, the user has decide by himself what to do with BIDI text.

Answer 2 · 2015-01-26T15:11:22.000Z

@boast It sounds reasonable yes. I didn't check how other implemetation deals with it. Any PR :-)?

Answer 3 · 2015-01-26T15:39:41.000Z

As for reference implementations: https://github.com/waiting-for-dev/string-direction

Or http://en.wikipedia.org/wiki/Bi-directional_text on that topic (notice the table with the classifications). I'll work on it tonight 👍 However, probably need to refactor some methods into helper protected methods to do the checks more granulated.

Answer 4 · 2015-01-27T08:17:23.000Z

@boast Thank you! :-)

Answer 5 · 2015-01-29T13:31:28.000Z

I tried my best to adapt the coding style. No tests broken (or lets say: some tests failed on my Ubuntu Dev Machine before I changed anything, seems like those collator and normalizer tests (especially when they are not available) are broken?) and added a new one following more or less the spec described above.

Answer 6 · 2015-08-03T12:20:23.000Z

Hey there, thank you for the ping. I was occupied this half year with doing my bachelor degree in CS. ;) We should define our definitive approach for this problem together and then I / we can work out the implementation. My knowledge about the problem comes specifically from these sources:

http://unicode.org/reports/tr9/
https://en.wikipedia.org/wiki/Bi-directional_text (extensive list)

IMHO, we should first decide on the actual "goal" and "usecase" of this method. Why and when is the information "which direction is this text going" needed? Because one can go crazy on the "strong", "weak" and "normal" characters and contexts...

Answer 7 · 2015-08-03T12:48:28.000Z

So far, we use getCharDirection to decide the behavior of append, prepend and other methods. This method only checks the first character. We must check the last character first. Second, it should be great to have a method to know if we have bi-directional text. I don't know really why it can be useful yet but I am sure it will be. We can also add methods to force to change the direction of the text (maybe we would like to write french in reverse order 😉). And a most useful usage is:

Iterate over direction portions. It can be particularly useful when transforming it into HTML for instance (or PDF, text etc.).
Also, with the append and prepend methods for instance, we can say: $str->append('text', $str::RTL); to force appending something in the opposite direction (to have bi-directionnal text thus).

Answer 8 · 2015-08-03T12:48:48.000Z

PS: How your bachelor goes 😉?

Answer 9 · 2015-08-03T12:59:39.000Z

Another use case:

When comparing strings, we would compare portion of directions, not the whole string at once. This some usages I think of.

Answer 10 · 2015-09-08T10:51:08.000Z

Hey Ivan,

thanks for asking - my bachelor is done now, so I think, I will find some time to contribute.

I will try to implement the algorithm according to the UNICODE BIDIRECTIONAL ALGORITHM. Especially the table Bidirectional Character Types looks very interesting and exactly what is lacking as of now ("weak" characters as numbers and punctuation are not handled correctly by our algorithm).

Answer 11 · 2015-09-08T11:01:55.000Z

Excellent news!

Answer 12 · 2015-10-14T09:08:39.000Z

Just a short update: I wrote a small script which parses the official bidi-classes from the unicode consortium (http://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedBidiClass.txt). It generates an optimized regex (not working atm, I miss something XD). The regex get quiet large though, but may some more optimizations are possible. The script is a small console app (Bin-folder) which allows easy regeneration if the spec should change.

After my regex works, I will implement the unicode bidi algorithm from http://www.unicode.org/reports/tr9/.

Answer 13 · 2015-10-14T11:39:12.000Z

Why do we need such a regular expressions?

Answer 14 · 2015-10-14T11:43:50.000Z

We need to distinguish between the different types of bidirectional
characters. Especially as some characters "change" their directions
depending on context (read: surrounding characters). It's quite complex at
the start, but as soon as you have the groups and get the hang of it, you
can exclude a lot of cases very fast.

On Wed, 14 Oct 2015 13:39 Ivan Enderlin notifications@github.com wrote:

Why do we need such a regular expressions?

—
Reply to this email directly or view it on GitHub
#21 (comment).