BioJulia/Automa.jl

Multi-bytes characters trigger actions multiple times

Closed this issue · 3 comments

For multi-bytes unicode characters (e.g. ϕ) the actions are triggered once for each byte, instead of once for each character. This is unexpected and rather hard to work with.

using Automa
import Automa.RegExp: @re_str

const re = Automa.RegExp

anything = re"."
anything.actions[:exit] = [:info]
machine = Automa.compile(re.rep(anything))

Automa.execute(machine, "ϕ")  # return (0, [:info, :info])

Dear @Kolaru

Automa is byte-oriented. That is, Automa's machines is defined in terms of, and operate on, bytes. Therefore, anything = re"." really means "any byte".

It is of course possible to create an Automa-like package that operates on UTF-8 characters instead of bytes. But such a machine would be far slower and more complicated, emit more code, be harder to debug, and would not work on binary data.

Perhaps a solution would be to state more clearly in the README that Automa parses individual bytes?

Just to clarify: You can still work with UTF-8 data with Automa as it is now. It's just that e.g. re"." or re.any() or re.space() etc. refer to bytes, not chars.

If you really want to work on characters, you could perhaps do it by creating a codec for TranscodingStreams that convert UTF-8 to a stream of Chars, and then feed that into Automa. Then every char will be exactly 4 bytes.

Thanks a lot for the clarification!

Therefore, anything = re"." really means "any byte".

This is definitely suprising to me since I always considered regex to act on characters and not on byte. But I realize now that parsing re"ϕ" as a single character is equivalent as parsing it as a sequence of tow bytes, so for the most part it does not matter.

I think it would be worth mentioning in the documentation together with the description of p since this issue is likely to be noticed first when trying to access a string at an invalid index (strings are indexed by byte).

For my case it is really not a problem I think, since I have to read all characters and look at them individually anyway. I think I should be able to make it work, either with your proposed solution or by skipping invalid string indices. I will look into contributing what works to the documentation once I have figured it out.