BioJulia/Automa.jl

Correctly finding matching double quotes when Unicode characters are present

Closed this issue · 3 comments

jagot commented

I know that Automa.jl is byte-oriented (#80), but I have so far been able to use it with some degree of success when parsing text with Unicode characters.

My problem is the following: I need to parse data like this such that matching double quotes are correctly identified:

"first string" some other tokens "second string"

The regular expression I use is very similar to the "mini-Julia" example:

string = re.cat('"', re.rep(re"[ !#-~]" | re.cat("\\\"")), '"')

dquote = re.primitive('"')
esc_dquote = re.cat("\\\"")
not_quote = !dquote | esc_dquote
string = re.cat(dquote, re.rep(not_quote), dquote) => :(emit_string())

the idea being that between the two delimiting double quotes, anything is allowed, except double quotes, unless they are escaped.

The full MWE is

using Automa
import Automa.RegExp: @re_str
const re = Automa.RegExp

machine = let
    space = re"[\n\t ]+" => :()
    dec      = re"[-+]?[0-9]+" => :(emit(:dec))
    # We also wish to support Unicode characters in identifier names
    identifier = re"[^ (){}0-9-+,;\"][^ (){},;\"]*" => :(emit(:identifier))

    dquote = re.primitive('"')
    esc_dquote = re.cat("\\\"")
    not_quote = !dquote | esc_dquote
    string = re.cat(dquote, re.rep(not_quote), dquote) => :(emit_string())

    Automa.compile(dec, identifier, string, space)
end

context = Automa.CodeGenContext()
@eval function tokenize(data)
    $(Automa.generate_init_code(context, machine))
    p_end = p_eof = lastindex(data)

    emit(kind) = println((kind, data[ts:prevind(data, te+1)]))
    emit_string() = println((:string, unescape_string(data[nextind(data, ts, 1):prevind(data, te)])))

    while p  p_eof && cs > 0
        $(Automa.generate_exec_code(context, machine))
    end

    cs == 0 ? :ok : cs < 0 ? :error : :incomplete
end

and here is some test code:

buf = IOBuffer()

# These parse fine
println(buf, "\"Hello\"\nthere\n\"no here\"")
println(buf, "\"Hello\"\n5\n\"no here\"")
println(buf, "\"Hello\" 5")
println(buf, "\"🧶mystring🐉\" hello👒")
println(buf, "\"A string with escapes: \\\"ff\\\"\" hello👒")

# These do not parse as expected
println(buf, "\"Hello\" 5 \"no here\"")
println(buf, "\"Hello\" there \"no here\"")
println(buf, "\"Hello\" there \"no here\" where \"over here\"")
println(buf, "\"🧶mystring🐉\" hello👒 \"🧶mystring🐉\"")
println(buf, "\"Hello\" there \"αβγδέ\" where \"over here\"")

println(repeat("-", 100))
seek(buf, 0)
for l in readlines(buf)
    println(l)
    tokenize(strip(l))
    println()
end

which gives the following output

"Hello"
(:string, "Hello")

there
(:identifier, "there")

"no here"
(:string, "no here")

"Hello"
(:string, "Hello")

5
(:dec, "5")

"no here"
(:string, "no here")

"Hello" 5
(:string, "Hello")
(:dec, "5")

"🧶mystring🐉" hello👒
(:string, "🧶mystring🐉")
(:identifier, "hello👒")

"A string with escapes: \"ff\"" hello👒
(:string, "A string with escapes: \"ff\"")
(:identifier, "hello👒")

"Hello" 5 "no here"
(:string, "Hello\" 5 \"no here")

"Hello" there "no here"
(:string, "Hello\" there \"no here")

"Hello" there "no here" where "over here"
(:string, "Hello\" there \"no here\" where \"over here")

"🧶mystring🐉" hello👒 "🧶mystring🐉"
(:string, "🧶mystring🐉\" hello👒 \"🧶mystring🐉")

"Hello" there "αβγδέ" where "over here"
(:string, "Hello\" there \"αβγδέ\" where \"over here")

If I use re"[ !#-~]" instead of !re.primitive('"') in the definition of not_quote above, it correctly delimits the various strings, but of course does not work with Unicode characters.

How can I work around this?

Tested with Automa.jl v0.8.2 and latest master on

julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × AMD Ryzen 9 3950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver2)
  Threads: 1 on 32 virtual cores

in a clean environment (no other packages installed).

I believe it is a slight misunderstand of what !dquote means. This produces a regex that matches all strings which does not match dquote. That includes, for example, "a\"". This string contains an unescaped quote, but it does not match dquote.
Replace !dquote with re[^\"], and it should work.

I also see that there are some trouble expressing dquote and esc_dquote. I'll see if I can improve Automa's parsing of regex to allow it to be expressed as simply re"\"" and re"\\\""

(apparently, you can do re"\\\\\"" - something is up about this escaping)
Edit: It's this: https://docs.julialang.org/en/v1/manual/strings/#man-raw-string-literals

jagot commented

Thanks! re"[^\"]" did the trick!