BioJulia/Automa.jl

Recursive finite state machines?

Closed this issue · 4 comments

I'm probably being thick, but I can't work out how to write a recursive FSM, because the tokens are variables, and are not defined when they are needed... the simplest example I can come up with is to parse a nested series of names, so things like able, able(archer), able(archer(ate)), able(archer(ate(an))), able(archer(ate(an(apple)))).

The code would look something like this I think:

julia> import Automa

julia> import Automa.RegExp: @re_str

julia> const re = Automa.RegExp
Automa.RegExp

julia> name = re"[a-zA-Z]+"
Automa.RegExp.RE(:rep1, Automa.RegExp.RE[Automa.RegExp.RE(:class, Any[0x61:0x7a, 0x41:0x5a], DataStructures.DefaultDict{Symbol,Array{Symbol,1},Automa.RegExp.#gen_empty_names}(), #NULL)], DataStructures.DefaultDict{Symbol,Array{Symbol,1},Automa.RegExp.#gen_empty_names}(), Nullable{Symbol}())

julia> nested = name * re.opt(re"\(" * nested * re"\)")
ERROR: UndefVarError: nested not defined

But doesn't work because nested is used before it is described. It's not a case where rep would work because of the nesting.

The way the output looks, it seems like this may be a fundamental problem as previous tokens are included wholesale into new ones, so that would make nested infinitely long from the recursion. Hopefully I'm wrong though!

The only thing I can think of is tokenising using the FSM and then writing my own FSM basically, which seems not to be quite what I want to do...

If this is for newick format, the only option as far as I know, is to use Automa to tokenise, and build the parser from there, for the same reason as when we used Ragel - newick is not a regular format.

Oh, well...

It's the nested parentheses that scupper it. AFAIK Automa does finite machine, but not recursive ones, so it needs to be decided if a word is in the language with a machine with constant memory by examining all symbols in the word one after another. This is not possible with newick, because the automaton would have to track how many open and closed parens have occurred. I wonder if we could extend Automa to include recursive automa in the future, it seems they can be composed of sub FSA's as "modules" that can be called, and some way for remembering state.

Rather than separating tokenization and parsing, couldn’t you use a pushdown automaton approach, and have your registered FSM actions manipulate a stack?