Grammar-based fuzzer based on pest grammars
jstnlef opened this issue · 14 comments
It would be really neat to have a library to do this. Here's an example of another project that does a similar thing. https://github.com/d0c-s4vage/gramfuzz
This is undecidable in general, I suspect. Along these lines:
Use the alphabet {0, 1}. Take two indexed families of nonempty strings {a₁, ⋯ , aₙ} and {b₁, ⋯ , bₙ}, each of length n. Write cᵢ = rev(bᵢ), the string reverse of each bᵢ. Then with the grammar
correspondence = { a_1 ~ correspondence ~ c_1
| a_2 ~ correspondence ~ c_2
// ...
| a_n ~ correspondence ~ c_n
| "2" }
palindrome = { "0" ~ palindrome ~ "0"
| "1" ~ palindrome ~ "1"
| "2" }
post = { &(correspondence ~ eoi) ~ (palindrome ~ eoi) }
the language generated by post
solves the Post correspondence problem, yet is a perfectly legal pest grammar. Unless I've missed something.
@wirelyre Intuitively, I would have said that the undecidability of language disjointness also applies to PEGs. Your proof looks fine. But I don't see what this has to do with fuzzing. The way I understand it, the fuzzer should construct an ever-growing set of strings from the language defined by a pest grammar, which sounds feasible.
@dragostis The implication is that, for any fuzzer, there exist grammars which will cause it to run forever but never make a single matching string. And it is not possible to identify which grammars will do that.
Maybe that's okay behavior.
I'm bringing it up because I hunch that there is a simple and practical restriction on lookahead that makes it trivial to detect such cases. Then a search through the language space might construct new words in bounded time.
Constructing a string matching a grammar is the other side of the coin of recognizing a grammar.
You generally write a simple recursive interpreter that traverses your grammar structure and generates the text.
- For an Alternation you randomly pick one of the alternatives to generate randomly.
- For an optional production ("?") you randomly skip it.
- For a repetition you choose a random number of repetitions
- ...
Obviously some kind of safe guard to avoid generating too large texts (run forever) would be needed So:
- Once the generated text reaches size X, all optional will be skipped and all repetitions would 0.
- This gets a little trickier with recursion as "stopping" a recursion means selecting the alternative
which is the "Stopping Condition", However at least with direct recursion it is pretty easy to identify
the recursive alternatives and avoid choosing those once we want to "stop".
@bd82 I think the question might be a bit more complicated than that. These rules have negative lookaheads and use the stack to match different things from the grammar.
I think the question might be a bit more complicated than that. These rules have negative lookaheads and use the stack to match different things from the grammar.
Yeah that would make it more complicated... I was thinking in terms of pure EBNF...
I was wrong. A 2004 article shows how to convert any well-formed PEG (which doesn't admit the empty string) into another one that doesn't use the "predicate operations" &
or !
. :/
Ford, Bryan. "Parsing expression grammars: a recognition-based syntactic foundation." ACM SIGPLAN Notices. Vol. 39. No. 1. ACM, 2004. (PDF)
@bd82 Even without lookaheads and the stack, ordered choice is already really hard.
Suppose I want to exercise the second branch of rule = { first / second }
. So I recurse into second
, right? But now I have to make sure to refute first
as well. And if both first
and second
have choices themselves, it all just blows up exponentially. Or worse. What do you do if first
and second
are mutually recursive with rule
?
It seems obvious that there should be an algorithm that works well if you give it a sane grammar. There probably is!
So my question is: what is a "sane grammar"?
Even without lookaheads and the stack, ordered choice is already really hard.
I normally work with LL(K) grammars not PEG, maybe that makes a difference to the complexity?
- lets assume to generate alternative
second
I need to refutefirst
. - In LL(K) terms this means there is a sequence of exactly N tokens that can match both
first
andsecond
- In the parsing library I've authored that would be detected as an ambiguity so the grammar would be invalid.
In PEG if two alternatives can match exactly the same thing, the first one is accepted and the grammar is still valid?
Lets assume that due to edge cases the text generator cannot always guarantee the generated text is valid.
As long as the generated text is valid some of the time we can just keep generating inputs and feed them to the parser until we have X valid inputs.
So maybe to provide 100 valid inputs we need to generate 120 possible inputs.
Even if 50% of the generated inputs are invalid we only need to generate twice as many inputs on average.
Maybe if we reduce the constraint for generating valid inputs 50% of the time, the definition of a "sane"
grammar would include 99.9% of actual end users grammars?
Earlier this year I released a project at DEFCON called synfuzz https://www.github.com/jrozner/synfuzz (written in Rust) that may help with accomplishing this once there is a more firm plan in place. From what I can tell it takes a similar approach to the project listed above. The goal is to provide a common platform that various frontends for different parser generator definition languages can target. I’d potentially be interested in helping on this.
The purpose of this kind of fuzzer would be to automatically generate "passing" cases? That is, examples of strings that match the grammar given as input. Is that right?
I ask because if we assume Pest's parsing to be correct (which it should more or less be, after a good amount of time and people using it have passed), then any non-matching string for a given grammar would be automatically rejected.
Therefore, I think the most useful fuzzer would generate matching strings, which would of course pass Pest's filtering and generate a Parse Tree. These Parse Trees would then be valid input for your AST builder, allowing you to fuzz-test it.
I think fuzz testing the AST building step is a worthwhile goal. The more complex your grammar is, the more paths there are to test, and maybe one of them makes your parser crash. I'd love this feature <3
Also: maybe this can be implemented with a PEG -> BNF translator, and then using the BNF representation to call the generators from bnf
That's an interesting idea, though potentially problematic: PEG is deterministic while pure BNF is ambiguous in some cases.
Clarification on what that means
The PEG grammar:
Production ← Keyword / Identifier
Keyword ← 'keyword'
Identifier ← [a-zA-Z]+
versus the "equivalent" EBNF:
production = keyword | identifier ;
keyword = "keyword" ;
identifier = ?letter?, { ?letter? } ;
(I didn't want to write out the EBNF for [a-zA-Z]
ok)
The input string keyword
is always unambiguously Production[Keyword(0..7)]
in the PEG grammar, but is ambiguous in the pure EBNF with production[keyword(0..7)|identifier(0..7)]
.
For the most part, programming languages intend to be fairly free from ambiguity, and PEG accomplishes this by just always doing the first choice. It'd be interesting if generating based on the (e)bnf generates inputs that aren't accepted by the PEG parser or are ambiguous in the (e)bnf, and the user of the tool could see them, determine if they should parse, and which direction they should be going. It could help eliminate the biggest "surprise" potential of using PEG.
But yeah, fuzzing based off of grammars is about fuzzing the recipient of the parsed tree more-so than the parser itself (though that does help increase confidence in the correctness of the parser under more exotic optimization).
You raise a fair point. Is there literature on PEG -> BNF transformations? I'm thinking that maybe the BNF can be built while preserving the unambiguity property of the source PEG.
(though that does help increase confidence in the correctness of the parser under more exotic optimization)
Wait, optimization? I'm thinking on post-(semantic analysis) optimization, which is probably unrelated to what you meant. You mean optimizations in the sense that we want to know how correct is the output of building Pest with cargo build --release
with regards to the output of cargo build
?