BioJulia/Automa.jl

DFAs can take wrong actions due to ambiguities

Closed this issue · 3 comments

Consider the following, highly simplified FASTA format:

import Automa
import Automa.RegExp: @re_str
const re = Automa.RegExp

machine = (function ()
    header = re">[A-Z]+"
    seqline = re"[A-Z]+"
    sequence = seqline * re.rep(re"\n" * seqline)
    record = header * re"\n" * sequence
    fasta = re.rep1(record * re"\n")

    header.actions[:exit] = [:header]
    seqline.actions[:exit] = [:seqline]
    record.actions[:exit] = [:record]

    Automa.compile(fasta)
end)()

By definition of record, a record MUST contain a header, which contains a >. However see the flowchart produced for this machine by:

write("fasta.dot", Automa.machine2dot(machine))
run(`dot -Tsvg -o fasta.svg fasta.dot`)

The state 5 corresponds to seqline, and a following newline brings it to state 6. In this state transition, the action record is executed. But clearly, from the defition, we did not necessarily exit record! This creates a loop between state 5 and 6, such that record is executed at every newline.

This means that record can be repeatedly exited, without it being entered, and allows for e.g. a record only containing seqline, against the definition above.

Simplified to

machine = let
    C = re.rep(re"A" * re"B")
    D = C * re"A"
    C.actions[:enter] = [:enter]
    C.actions[:exit] = [:exit]
    Automa.compile(D)
end

Note in particular that the C pattern is entered once, and then can be exited an arbitrary amount of times with an input like "ABABAB".

Right, so I think I've figured out the problem. Let's have a simple regex C:

A = re"XZ"
B = re"XY"
C = A | B
A.actions[:enter] = [:entering_A]
B.actions[:enter] = [:entering_B]
A.actions[:exit] = [:exiting_A]
B.actions[:exit] = [:exiting_B]

Now, create an NFA using Automa.re2nfa(C) and visualize it. This NFA represents those regexes and their actions perfectly. Note that this NFA is truly nondeterministic; when encountering an "X" byte, it is impossible to know if we are entering the A regex or the B regex. In the NFA, this is visualized by two different epsilon edges, one that leads to A and one that leads to B. Only the one that leads to A have the action entering_A associated with it, and vice versa for B.

Now convert the NFA to a DFA. The resulting DFA is equivalent to the NFA only in the sense that it recognizes the same inputs. The edges are NOT the same, and consequently, the two epsilon edges, labeled with entering_A and entering_B are collapsed, and important information is lost. The result is that when we encounter an "X" byte, the regex enters BOTH A and B. Notably, at end of file, only ONE of these regexes are exited (which is clearly absurd)

This is impossible to solve, I think, but it would really be nice if there was some sort of warning or even an error if attempting to convert an NFA that had actions associated to ambiguous transitions to a DFA.

Edit: This can be done by modifying the epsilon_closure function. If you can go multiple paths from one node to another node only moving through epsilon edges, and not all these paths have the same labeled edges, then the machine is ambiguous and should not be compiled.

Solved by #49