Azure/azure-sdk-for-python

[engsys] Batch Sanitizers have the potential to mangle JSON responses and make them invalid

kdestin opened this issue · 2 comments

Describe the bug

#35196 introduced a collection of "global" sanitizers that scrub secrets from recordings as they are written to disk.

The sanitization is Regex based and not "file format aware". Some of the sanitizers as defined have the potential to leave responses in a corrupt state (e.g. as invalid json).

The sanitizers for query strings were the source of my issue, but this might be an issue with other sanitizers too:

{"regex": "(client_id=)[^&]+", "value": "$1sanitized"},
{"regex": "(client_secret=)[^&]+", "value": "$1sanitized"},
{"regex": "(client_assertion=)[^&]+", "value": "$1sanitized"},
{"regex": "(?:[\\?&](sv|sig|se|srt|ss|sp)=)(?<secret>(([^&\\s]*)))", "value": SANITIZED},
{"regex": "refresh_token=(?<group>.*?)(?=&|$)", "group_for_replace": "group", "value": SANITIZED},
{"regex": "access_token=(?<group>.*?)(?=&|$)", "group_for_replace": "group", "value": SANITIZED},
{"regex": "token=(?<token>[^\\u0026]+)($|\\u0026)", "group_for_replace": "token", "value": SANITIZED},

To Reproduce
Steps to reproduce the behavior:

  1. Write a test that receives an HTTP response which'll get sanitized. Example (&sig= query param)
{
    "secretsType": "Sas",
    "sasToken": "sv=2021-10-04&si=azureml-system-datastore-policy&sr=c&sig=XXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
}
  1. Observe that your test passes when run in live mode (with the following environment variable set AZURE_TEST_RUN_LIVE=true).

  2. Try to replay the test in recording mode (with the following environment variable set AZURE_TEST_RUN_LIVE=false).

Expected behavior

The test should pass, using the sanitized responses from the record.

Actual Behavior

This sanitizer mangles the JSON response:

{"regex": "(?:[\\?&](sv|sig|se|srt|ss|sp)=)(?<secret>(([^&\\s]*)))", "value": SANITIZED},

  • It "eats" the trailing double quote (") that terminates the JSON string, which leaves the response as invalid JSON
  • It "eats" the leading ampersand (&), which corrupts the preceding query parameter.
            except ValueError as err:
>               raise DecodeError(
                    message="JSON is invalid: {}".format(err),
                    response=response,
                    error=err,
E                   azure.core.exceptions.DecodeError: JSON is invalid: Invalid control character at: line 3 column 78 (char 103)
E                   Content: {
E                     "secretsType": "Sas",
E                     "sasToken": "sv=2021-10-04&si=azureml-system-datastore-policy&sr=cSanitized
E                   }

../../../../venv/lib/python3.8/site-packages/azure/core/pipeline/policies/_universal.py:616: DecodeError

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context

ML team has a regex for query parameters that might be helpful in finding a fix

def _query_param_regex(name, *, only_value=True) -> str:
"""Builds a regex that matches against a query parameter of the form
(?|&)name=value
:param: name - The name of the query parameter to match against
:param: only_value (Optional) - Whether the regex match should just
match the value of the query param (instead of the name and value)
"""
# Character that marks the end of a query string value
QUERY_STRING_DELIMETER = "&#"
value_regex = rf'[^{QUERY_STRING_DELIMETER}"\s]*'
name_regex = rf"(?<=[?&]){name}="
if only_value:
name_regex = rf"(?<={name_regex})"
return rf'{name_regex}{value_regex}(?=[{QUERY_STRING_DELIMETER}"\s]|$)'

@mccoyp can you take a look?

Resolved by #35419