y-scope/clp

CLP query parsing mistakenly treats static text with wildcards as int and float

Opened this issue · 0 comments

Bug

When CLP handles an ambigious token with wildcard attached, it seems to treat it as a float and int even if the token doesn't contain any numeric value.

After a quick investigation, it seems that the culprit is at

m_possible_types.push_back(Type::IntVar);

A fix that seems to work is to remove IntVar and FloatVar from possible type and set
m_cannot_convert_to_non_dict_var = true.
However, it would be nice to think carefully through the parsing logic to ensure there's no new corner cases introduced by the fix.

Some corner cases that we might need to check if they are properly handled

  1. float scientific notation
  2. hex (should have already been handled by CLP)
  3. -* and .*

Update 1:
CLP doesn't handle Scientific notation, so we don't need to consider this case.

is_var being false means that, the token doesn't contain any decimal number, nor it is a hex.
I would actually suggest renaming it to be is_definite_var

Here are the following corner case I can think of:

  1. "*text*.*"
  2. "*text*-*" (- can be a part of negative number)

Referring to the unit test, it looks like - and . are the only non-decimal chars that can be encoded.

Note, the . and - must be between two *. otherwise, they will be directly connected to a byte that is non-decimal(without a delimiter in between) and can only be treated as dictionary var.

In addition, CLP doesn't treat "-" + hex as encodedable int, so "text*-deadbeef" is not considered to contain an int.

In my opinion, to add a quick fix for this bug, we can add an extra check to see if the token contains any . or - surrounded by two *. if no, then it can be treated as a logtype or vardict only.

Update2:

The statement made in update 1 is not entirely correct, because we seem to not consider if a token containing a wildcard shall be splitted into two tokens. CLP always assume that there's only 1 token and uses a fallback mechanism (

} else if (query_token.has_greedy_wildcard_in_middle()) {
).

So perhaps the only corner case we need to consider is
"*-*.*"

CLP version

628ba3d

Environment

ubuntu 22.04

Reproduction steps

In CLP, instrument clg so it prints out all log types in its subqueries.

Compress any reasonable large logs.
Search for "FlowMonitor", or some random string (without any numeric value) surrounded by wildcard

Regardless of the matching results, you will notice CLP prints out every log type with a \INT in it.