pest-parser/pest

Grammar-based fuzzer based on pest grammars

jstnlef opened this issue · 14 comments

It would be really neat to have a library to do this. Here's an example of another project that does a similar thing. https://github.com/d0c-s4vage/gramfuzz

This is undecidable in general, I suspect. Along these lines:

Use the alphabet {0, 1}. Take two indexed families of nonempty strings {a₁, ⋯ , aₙ} and {b₁, ⋯ , bₙ}, each of length n. Write cᵢ = rev(bᵢ), the string reverse of each bᵢ. Then with the grammar

correspondence = { a_1 ~ correspondence ~ c_1
                 | a_2 ~ correspondence ~ c_2
                            // ...
                 | a_n ~ correspondence ~ c_n
                 | "2" }

palindrome = { "0" ~ palindrome ~ "0"
             | "1" ~ palindrome ~ "1"
             | "2" }

post = { &(correspondence ~ eoi) ~ (palindrome ~ eoi) }

the language generated by post solves the Post correspondence problem, yet is a perfectly legal pest grammar. Unless I've missed something.

@wirelyre Intuitively, I would have said that the undecidability of language disjointness also applies to PEGs. Your proof looks fine. But I don't see what this has to do with fuzzing. The way I understand it, the fuzzer should construct an ever-growing set of strings from the language defined by a pest grammar, which sounds feasible.

@dragostis The implication is that, for any fuzzer, there exist grammars which will cause it to run forever but never make a single matching string. And it is not possible to identify which grammars will do that.

Maybe that's okay behavior.

I'm bringing it up because I hunch that there is a simple and practical restriction on lookahead that makes it trivial to detect such cases. Then a search through the language space might construct new words in bounded time.

bd82 commented

Constructing a string matching a grammar is the other side of the coin of recognizing a grammar.
You generally write a simple recursive interpreter that traverses your grammar structure and generates the text.

  • For an Alternation you randomly pick one of the alternatives to generate randomly.
  • For an optional production ("?") you randomly skip it.
  • For a repetition you choose a random number of repetitions
  • ...

Obviously some kind of safe guard to avoid generating too large texts (run forever) would be needed So:

  • Once the generated text reaches size X, all optional will be skipped and all repetitions would 0.
  • This gets a little trickier with recursion as "stopping" a recursion means selecting the alternative
    which is the "Stopping Condition", However at least with direct recursion it is pretty easy to identify
    the recursive alternatives and avoid choosing those once we want to "stop".

@bd82 I think the question might be a bit more complicated than that. These rules have negative lookaheads and use the stack to match different things from the grammar.

bd82 commented

I think the question might be a bit more complicated than that. These rules have negative lookaheads and use the stack to match different things from the grammar.

Yeah that would make it more complicated... I was thinking in terms of pure EBNF...

I was wrong. A 2004 article shows how to convert any well-formed PEG (which doesn't admit the empty string) into another one that doesn't use the "predicate operations" & or !. :/

Ford, Bryan. "Parsing expression grammars: a recognition-based syntactic foundation." ACM SIGPLAN Notices. Vol. 39. No. 1. ACM, 2004. (PDF)


@bd82 Even without lookaheads and the stack, ordered choice is already really hard.

Suppose I want to exercise the second branch of rule = { first / second }. So I recurse into second, right? But now I have to make sure to refute first as well. And if both first and second have choices themselves, it all just blows up exponentially. Or worse. What do you do if first and second are mutually recursive with rule?

It seems obvious that there should be an algorithm that works well if you give it a sane grammar. There probably is!

So my question is: what is a "sane grammar"?

bd82 commented

Even without lookaheads and the stack, ordered choice is already really hard.

I normally work with LL(K) grammars not PEG, maybe that makes a difference to the complexity?

  • lets assume to generate alternative second I need to refute first .
  • In LL(K) terms this means there is a sequence of exactly N tokens that can match both first and second
  • In the parsing library I've authored that would be detected as an ambiguity so the grammar would be invalid.

In PEG if two alternatives can match exactly the same thing, the first one is accepted and the grammar is still valid?

bd82 commented

Lets assume that due to edge cases the text generator cannot always guarantee the generated text is valid.
As long as the generated text is valid some of the time we can just keep generating inputs and feed them to the parser until we have X valid inputs.

So maybe to provide 100 valid inputs we need to generate 120 possible inputs.
Even if 50% of the generated inputs are invalid we only need to generate twice as many inputs on average.

Maybe if we reduce the constraint for generating valid inputs 50% of the time, the definition of a "sane"
grammar would include 99.9% of actual end users grammars?

Earlier this year I released a project at DEFCON called synfuzz https://www.github.com/jrozner/synfuzz (written in Rust) that may help with accomplishing this once there is a more firm plan in place. From what I can tell it takes a similar approach to the project listed above. The goal is to provide a common platform that various frontends for different parser generator definition languages can target. I’d potentially be interested in helping on this.

The purpose of this kind of fuzzer would be to automatically generate "passing" cases? That is, examples of strings that match the grammar given as input. Is that right?

I ask because if we assume Pest's parsing to be correct (which it should more or less be, after a good amount of time and people using it have passed), then any non-matching string for a given grammar would be automatically rejected.

Therefore, I think the most useful fuzzer would generate matching strings, which would of course pass Pest's filtering and generate a Parse Tree. These Parse Trees would then be valid input for your AST builder, allowing you to fuzz-test it.

I think fuzz testing the AST building step is a worthwhile goal. The more complex your grammar is, the more paths there are to test, and maybe one of them makes your parser crash. I'd love this feature <3

Also: maybe this can be implemented with a PEG -> BNF translator, and then using the BNF representation to call the generators from bnf

CAD97 commented

That's an interesting idea, though potentially problematic: PEG is deterministic while pure BNF is ambiguous in some cases.

Clarification on what that means

The PEG grammar:

Production ← Keyword / Identifier
Keyword ← 'keyword'
Identifier ← [a-zA-Z]+

versus the "equivalent" EBNF:

production = keyword | identifier ;
keyword = "keyword" ;
identifier = ?letter?, { ?letter? } ;

(I didn't want to write out the EBNF for [a-zA-Z] ok)

The input string keyword is always unambiguously Production[Keyword(0..7)] in the PEG grammar, but is ambiguous in the pure EBNF with production[keyword(0..7)|identifier(0..7)].


For the most part, programming languages intend to be fairly free from ambiguity, and PEG accomplishes this by just always doing the first choice. It'd be interesting if generating based on the (e)bnf generates inputs that aren't accepted by the PEG parser or are ambiguous in the (e)bnf, and the user of the tool could see them, determine if they should parse, and which direction they should be going. It could help eliminate the biggest "surprise" potential of using PEG.

But yeah, fuzzing based off of grammars is about fuzzing the recipient of the parsed tree more-so than the parser itself (though that does help increase confidence in the correctness of the parser under more exotic optimization).

You raise a fair point. Is there literature on PEG -> BNF transformations? I'm thinking that maybe the BNF can be built while preserving the unambiguity property of the source PEG.

(though that does help increase confidence in the correctness of the parser under more exotic optimization)

Wait, optimization? I'm thinking on post-(semantic analysis) optimization, which is probably unrelated to what you meant. You mean optimizations in the sense that we want to know how correct is the output of building Pest with cargo build --release with regards to the output of cargo build?