Correctly finding matching double quotes when Unicode characters are present
Closed this issue · 3 comments
I know that Automa.jl is byte-oriented (#80), but I have so far been able to use it with some degree of success when parsing text with Unicode characters.
My problem is the following: I need to parse data like this such that matching double quotes are correctly identified:
"first string" some other tokens "second string"
The regular expression I use is very similar to the "mini-Julia" example:
Automa.jl/example/tokenizer.jl
Line 11 in 8dd2672
dquote = re.primitive('"')
esc_dquote = re.cat("\\\"")
not_quote = !dquote | esc_dquote
string = re.cat(dquote, re.rep(not_quote), dquote) => :(emit_string())
the idea being that between the two delimiting double quotes, anything is allowed, except double quotes, unless they are escaped.
The full MWE is
using Automa
import Automa.RegExp: @re_str
const re = Automa.RegExp
machine = let
space = re"[\n\t ]+" => :()
dec = re"[-+]?[0-9]+" => :(emit(:dec))
# We also wish to support Unicode characters in identifier names
identifier = re"[^ (){}0-9-+,;\"][^ (){},;\"]*" => :(emit(:identifier))
dquote = re.primitive('"')
esc_dquote = re.cat("\\\"")
not_quote = !dquote | esc_dquote
string = re.cat(dquote, re.rep(not_quote), dquote) => :(emit_string())
Automa.compile(dec, identifier, string, space)
end
context = Automa.CodeGenContext()
@eval function tokenize(data)
$(Automa.generate_init_code(context, machine))
p_end = p_eof = lastindex(data)
emit(kind) = println((kind, data[ts:prevind(data, te+1)]))
emit_string() = println((:string, unescape_string(data[nextind(data, ts, 1):prevind(data, te)])))
while p ≤ p_eof && cs > 0
$(Automa.generate_exec_code(context, machine))
end
cs == 0 ? :ok : cs < 0 ? :error : :incomplete
end
and here is some test code:
buf = IOBuffer()
# These parse fine
println(buf, "\"Hello\"\nthere\n\"no here\"")
println(buf, "\"Hello\"\n5\n\"no here\"")
println(buf, "\"Hello\" 5")
println(buf, "\"🧶mystring🐉\" hello👒")
println(buf, "\"A string with escapes: \\\"ff\\\"\" hello👒")
# These do not parse as expected
println(buf, "\"Hello\" 5 \"no here\"")
println(buf, "\"Hello\" there \"no here\"")
println(buf, "\"Hello\" there \"no here\" where \"over here\"")
println(buf, "\"🧶mystring🐉\" hello👒 \"🧶mystring🐉\"")
println(buf, "\"Hello\" there \"αβγδέ\" where \"over here\"")
println(repeat("-", 100))
seek(buf, 0)
for l in readlines(buf)
println(l)
tokenize(strip(l))
println()
end
which gives the following output
"Hello"
(:string, "Hello")
there
(:identifier, "there")
"no here"
(:string, "no here")
"Hello"
(:string, "Hello")
5
(:dec, "5")
"no here"
(:string, "no here")
"Hello" 5
(:string, "Hello")
(:dec, "5")
"🧶mystring🐉" hello👒
(:string, "🧶mystring🐉")
(:identifier, "hello👒")
"A string with escapes: \"ff\"" hello👒
(:string, "A string with escapes: \"ff\"")
(:identifier, "hello👒")
"Hello" 5 "no here"
(:string, "Hello\" 5 \"no here")
"Hello" there "no here"
(:string, "Hello\" there \"no here")
"Hello" there "no here" where "over here"
(:string, "Hello\" there \"no here\" where \"over here")
"🧶mystring🐉" hello👒 "🧶mystring🐉"
(:string, "🧶mystring🐉\" hello👒 \"🧶mystring🐉")
"Hello" there "αβγδέ" where "over here"
(:string, "Hello\" there \"αβγδέ\" where \"over here")
If I use re"[ !#-~]"
instead of !re.primitive('"')
in the definition of not_quote
above, it correctly delimits the various strings, but of course does not work with Unicode characters.
How can I work around this?
Tested with Automa.jl v0.8.2 and latest master on
julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 32 × AMD Ryzen 9 3950X 16-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, znver2)
Threads: 1 on 32 virtual cores
in a clean environment (no other packages installed).
I believe it is a slight misunderstand of what !dquote
means. This produces a regex that matches all strings which does not match dquote
. That includes, for example, "a\""
. This string contains an unescaped quote, but it does not match dquote
.
Replace !dquote
with re[^\"]
, and it should work.
I also see that there are some trouble expressing dquote
and esc_dquote
. I'll see if I can improve Automa's parsing of regex to allow it to be expressed as simply re"\""
and re"\\\""
(apparently, you can do re"\\\\\""
- something is up about this escaping)
Edit: It's this: https://docs.julialang.org/en/v1/manual/strings/#man-raw-string-literals
Thanks! re"[^\"]"
did the trick!