y-scope/clp

clp-s: Several issues searching for logs that contain escaped characters.

Opened this issue · 2 comments

Bug

Search does not work as expected when searching against JSON values that contain escaped characters. This is likely an issue with how string predicates are un-escaped both for clp style search, and for wildcard matching.

Importantly, clp-s makes the decision to not un-escape raw JSON values before ingesting them, which causes some edge cases we are not currently considering during search.

For example, the value in {"key": "a: \"bcde\""} gets ingested verbatim as a: \"bcde\". However, the search *: "a: \"bcde\"" fails to return the matching result.

CLP version

0.1.2

Environment

clp-json package.

Reproduction steps

Ingest {"key": "a: \"bcde\""}
Perform the query *: "a: \"bcde\""

Reproduced with permission from zulip is a longer example:

{"@timestamp": "2024-11-05 20:21:48.140964", "id": 4000, "log": "Logging message 4000: \"AumdjUCipW45\""}
{"@timestamp": "2024-11-05 20:21:48.140964", "id": 4001, "log": "Logging message 4001: 'NcOBPgoyAMIz'"}
{"@timestamp": "2024-11-05 20:21:48.140964", "id": 4002, "log": "Logging message 4002: \\\"pn0b6GI4imwT\\\""}
{"@timestamp": "2024-11-05 20:21:48.140964", "id": 4003, "log": "Logging message 4003: \\'PHXzcoLwF6E5\\'"}
{"@timestamp": "2024-11-05 20:21:48.140964", "id": 4004, "log": {"empty_dict": {}, "empty_string": "", "empty_list": [], "null": null, "message": "Logging message 4004: WwoTKSzXqKr4"}}
{"@timestamp": "2024-11-05 20:21:48.140964", "id": 4005, "log": "Logging message 4005: \nIIj7lxPM2MQu"}
{"@timestamp": "2024-11-05 20:21:48.140964", "id": 4006, "log": "Logging message 4006: \\Bx03VwDor4ex"}
{"@timestamp": "2024-11-05 20:21:48.140964", "id": 4007, "log": "Logging message 4007: \rXtmxle8HOCD2"}
{"@timestamp": "2024-11-05 20:21:48.140964", "id": 4008, "log": "Logging message 4008: \tka5J5WdLyAJY"}
{"@timestamp": "2024-11-05 20:21:48.140964", "id": 4009, "log": "Logging message 4009: \\\rXQzF5AhEzzSt"}
{"@timestamp": "2024-11-05 20:21:48.140964", "id": 4010, "log": "Logging message 4010: \\\nGXD1u75wJrJV"}

Double quotes (4000): search on the log field doesn't work, even when escaping or double-escaping the quotes. Can workaround by using wildcards.
Single quotes (4001): search works as expected from the UI, but runs into issues with escaping on the commandline
Escaped double quotes (4002): again, not searchable unless worked around using wildcards.
Escaped single quotes (4003): not searchable at all unless using wildcards for both UI and command line
Whitespace and new-lines (4005): \n can not be properly escaped or double-escaped, though bizzarely log: "Logging message 4005: *\nIIj7lxPM2MQu" works
Backslashes (4006): Backslash part of the log is not searchable unless using wildcard
Additional whitespace issues (4007-4010): again whitespace not searchable unless excepting very specific wildcard usage

The draft PR fixes most of these issues, except for 4002 and 4003. Those two cases seem to run into another issue in Grep::process_raw_query where seemingly correct query strings generate no relevant subqueries. Interestingly this appears to be related to the last quote -- e.g. the query '"Logging message 4002: \\\"pn0b6GI4imwT\\\""' will not work, but the query '"Logging message 4002: \\\"pn0b6GI4imwT\\*"' will.

The issue with 4001 is actually just a bash issue -- it turns out that bash does not provide any mechanism to escape single quotes (') inside of a single-quoted string. Instead the query needs to surround the string with double-quotes, represent the single-quotes using the new unicode escape sequence support (e.g. 'log: "Logging message 4001: \u0027NcOBPgoyAMIz\u0027"'), or glue strings together in the terminal (e.g. 'log: "Logging message ...'"'"'NCo...'"'"'"').