Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clp-s: Several issues searching for logs that contain escaped characters. #590

Open
gibber9809 opened this issue Nov 13, 2024 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@gibber9809
Copy link
Contributor

gibber9809 commented Nov 13, 2024

Bug

Search does not work as expected when searching against JSON values that contain escaped characters. This is likely an issue with how string predicates are un-escaped both for clp style search, and for wildcard matching.

Importantly, clp-s makes the decision to not un-escape raw JSON values before ingesting them, which causes some edge cases we are not currently considering during search.

For example, the value in {"key": "a: \"bcde\""} gets ingested verbatim as a: \"bcde\". However, the search *: "a: \"bcde\"" fails to return the matching result.

CLP version

0.1.2

Environment

clp-json package.

Reproduction steps

Ingest {"key": "a: \"bcde\""}
Perform the query *: "a: \"bcde\""

@gibber9809 gibber9809 added the bug Something isn't working label Nov 13, 2024
@gibber9809 gibber9809 self-assigned this Nov 13, 2024
@gibber9809
Copy link
Contributor Author

Reproduced with permission from zulip is a longer example:

{"@timestamp": "2024-11-05 20:21:48.140964", "id": 4000, "log": "Logging message 4000: \"AumdjUCipW45\""}
{"@timestamp": "2024-11-05 20:21:48.140964", "id": 4001, "log": "Logging message 4001: 'NcOBPgoyAMIz'"}
{"@timestamp": "2024-11-05 20:21:48.140964", "id": 4002, "log": "Logging message 4002: \\\"pn0b6GI4imwT\\\""}
{"@timestamp": "2024-11-05 20:21:48.140964", "id": 4003, "log": "Logging message 4003: \\'PHXzcoLwF6E5\\'"}
{"@timestamp": "2024-11-05 20:21:48.140964", "id": 4004, "log": {"empty_dict": {}, "empty_string": "", "empty_list": [], "null": null, "message": "Logging message 4004: WwoTKSzXqKr4"}}
{"@timestamp": "2024-11-05 20:21:48.140964", "id": 4005, "log": "Logging message 4005: \nIIj7lxPM2MQu"}
{"@timestamp": "2024-11-05 20:21:48.140964", "id": 4006, "log": "Logging message 4006: \\Bx03VwDor4ex"}
{"@timestamp": "2024-11-05 20:21:48.140964", "id": 4007, "log": "Logging message 4007: \rXtmxle8HOCD2"}
{"@timestamp": "2024-11-05 20:21:48.140964", "id": 4008, "log": "Logging message 4008: \tka5J5WdLyAJY"}
{"@timestamp": "2024-11-05 20:21:48.140964", "id": 4009, "log": "Logging message 4009: \\\rXQzF5AhEzzSt"}
{"@timestamp": "2024-11-05 20:21:48.140964", "id": 4010, "log": "Logging message 4010: \\\nGXD1u75wJrJV"}

Double quotes (4000): search on the log field doesn't work, even when escaping or double-escaping the quotes. Can workaround by using wildcards.
Single quotes (4001): search works as expected from the UI, but runs into issues with escaping on the commandline
Escaped double quotes (4002): again, not searchable unless worked around using wildcards.
Escaped single quotes (4003): not searchable at all unless using wildcards for both UI and command line
Whitespace and new-lines (4005): \n can not be properly escaped or double-escaped, though bizzarely log: "Logging message 4005: *\nIIj7lxPM2MQu" works
Backslashes (4006): Backslash part of the log is not searchable unless using wildcard
Additional whitespace issues (4007-4010): again whitespace not searchable unless excepting very specific wildcard usage

@gibber9809
Copy link
Contributor Author

gibber9809 commented Dec 4, 2024

The draft PR fixes most of these issues, except for 4002 and 4003. Those two cases seem to run into another issue in Grep::process_raw_query where seemingly correct query strings generate no relevant subqueries. Interestingly this appears to be related to the last quote -- e.g. the query '"Logging message 4002: \\\"pn0b6GI4imwT\\\""' will not work, but the query '"Logging message 4002: \\\"pn0b6GI4imwT\\*"' will.

The issue with 4001 is actually just a bash issue -- it turns out that bash does not provide any mechanism to escape single quotes (') inside of a single-quoted string. Instead the query needs to surround the string with double-quotes, represent the single-quotes using the new unicode escape sequence support (e.g. 'log: "Logging message 4001: \u0027NcOBPgoyAMIz\u0027"'), or glue strings together in the terminal (e.g. 'log: "Logging message ...'"'"'NCo...'"'"'"').

@gibber9809
Copy link
Contributor Author

Removing the link to #622 since it doesn't fix the 4002/4003 cases. Those may be fixed by #428, but there also may be bugs in the heuristic CLP query generation code that need to be fixed.

@gibber9809
Copy link
Contributor Author

Upon closer examination it seems like 4002/4003 are caused by a bug in generate_logtypes_and_vars_for_subquery where we don't first unescape parts of the query string before adding it to the logtype. This issue was fixed in clp at the end of 2023, but the fix was never ported to clp-s.

At the least we'll port the fix to clp-s, and we will likely also start deleting copied code from Grep:: to avoid this sort of issue in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant