BioJulia/Automa.jl

Suboptimal DFA construction when epsilon loop in NFA

Closed this issue · 1 comments

Construct an NFA with:

import Automa
const re = Automa.RegExp
nfa = Automa.re2nfa(re.rep(re.rep(re.any()) * re.opt(re.parse("A"))))

Notice that this regex matches literally any sequence of bytes.
View the constructed NFA:

open("/tmp/nfa.dot", "w") do file
    println(file, Automa.nfa2dot(nfa))
end
# in shell, do "dot -Tsvg /tmp/nfa.dot > /tmp/nfa.svg

Note that it has a loop of epsilon edges. Now construct a DFA (or Machine, whatever) from that NFA and visualise it:

machine = Automa.dfa2machine(Automa.nfa2dfa(nfa))
open("/tmp/machine.dot", "w") do file
    println(file, Automa.machine2dot(machine))
end

It looks like this:
image

That's wrong. It doesn't even make any sense. Those three nodes are completely indistinguishable. It should look like this:
image

The mistake is somewhere in nfa2dfa where it mistakenly believes the DFA should have 3 nodes instead of 1.

Nevermind, it turns out that instantiating a Machine using dfa2machine does not optimize it. If you use reduce_nodes, it will be correctly optimized. This happens when using Automa.compile by default.