oasis-open/cti-stix2-json-schemas

Pattern that includes a backslash doesn't parse properly

radder5 opened this issue · 5 comments

Hi,
I have been trying to create a pattern to match a very simple indicator viz
[ url:value LIKE '..\\' ]
but using the generated Java classes using Antlr4.7 against the latest STIXPattern.g4 file, I get an InputMismatchException in the enterPropTestLike(STIXPatternParser.PropTestLikeContext ctx) method (and exit method too) for the string literal. If I don't escape it with another backslash it matches the single quote character as the compiler takes the backslash to be escaping the quote.

If I edit the STIXPatten.g4 file and change the StringLiteral token definition to be
StringLiteral : QUOTE ( ~['\\] | '\\\'' | '\\\\' | '\\' )* QUOTE ;
instead it parses correctly. I have only just started looking at Antlr4 and Stix patterns so am just wondering if this is a bug and is ok to fix with the addition of '\\' in the token definition or will it have unforeseen consequences?

Thanks
Conrad

Hi, @radder5. Backslashes in StringLiterals cause a lot of problems/confusion with patterns. I was able to "validate" your pattern above using the Python-based cti-pattern-validator. It's possible there's a difference between the ANTLR Java and Python bindings, but that wouldn't be my first guess.

In Python, we ran into issues with the backslash-escaping required for Python strings. For instance, if you want to match a single backslash in a pattern, you need to escape it once for the pattern string literal, then escape BOTH OF THOSE for a Python string literal. Thus, validate_pattern("[ url:value LIKE '..\\\\' ]") would validate your pattern above ([ url:value LIKE '..\\' ]), which looks for a URL value containing ..\. I wonder if something similar is happening in Java.

I think your grammar change would allow a single \ anywhere in a string literal. That would not be spec-compliant. The spec says only a single quote or another backslash may follow a backslash.

I suspect the same problem as @gtback: insufficient escaping. If additional escaping doesn't solve the problem, perhaps posting a snip of code that reproduces the problem would help.

Hi,
Thanks for the comments.
The way I am trying to validate this is with some standard Antlr4 java viz:

ANTLRInputStream inputStream = new ANTLRInputStream("[ url:value LIKE '..\\\\' ] "); STIXPatternLexer stixLexer = new STIXPatternLexer(inputStream); CommonTokenStream commonTokenStream = new CommonTokenStream(stixLexer); STIXPatternParser stixParser = new STIXPatternParser(commonTokenStream); stixParser.setBuildParseTree(true); STIXPatternParser.PatternContext ctx = stixParser.pattern(); STIXPatternProcessor processor = new STIXPatternProcessor(); ParseTreeWalker.DEFAULT.walk(processor, ctx);

and then as a test to inspect the expression in my overridden method for enterPropTestLike

@Override public void enterPropTestLike(STIXPatternParser.PropTestLikeContext ctx) { super.enterPropTestLike(ctx); System.out.println("proplike " + ctx.getText()); }

So this is outputting '..\' which I would expect to output '..' as when Java outputs an escaped \ is just outputs a single \ e.g. System.out.printlin("..\") would output ..\

I guess it comes down to how I finally implement the processing of the string literal '..\' as output above.

Thanks for the help.

Ok, your expectations are wrong then. With respect to the pattern language, '..\' is not a valid string literal. Yeah, a single quote follows the backslash, but the point of escaping is to disambiguate certain characters inside the string. In this case, it is to enable you to have a single quote inside the string which is not interpreted as terminating the literal. Your expected string literal is escaping the last single quote, which means it isn't terminating the literal, which means there is no terminating quote, so it is invalid.

Java doesn't print string literals as defined in the Java language. It's not printing out snips of source code. Similarly, the StringLiteral token in the pattern grammar is intended to match valid string literals as defined in the STIX pattern language. The result of parsing is a structure used to guide your interpretation of the program/pattern. It won't do any other processing for you. What you do with that literal, as you correctly surmised, is up to you. Including dealing with any embedded escapes.

This could go round and round! Having just re-read my comment, I notice that GitHub has in fact unescaped half of what I was saying so it actually doesn't read as I meant it to (a number of backslashes are missing) !!
No matter, with your helpful comments and previous ones I accept that there is no bug and I will deal with escaping as I need to .
Thanks