WASM: Some builtins don't work with strings >64 characters as inputs

Question

WASM: Some builtins don't work with strings >64 characters as inputs

sandhose opened this issue 7 months ago · 4 comments

Short description

Some builtins, at least regex.find_all_string_submatch_n, don't work for input strings longer than 64 characters

Steps To Reproduce

`repro.rego`:

package repro

const_c64 := "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
const_c65 := "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"

test.input_no_dollar_c64 { match_no_dollar(input.c64) }
test.input_no_dollar_c65 { match_no_dollar(input.c65) }
test.input_dollar_c64 { match_dollar(input.c64) }
test.input_dollar_c65 { match_dollar(input.c65) }
test.const_no_dollar_c64 { match_no_dollar(const_c64) }
test.const_no_dollar_c65 { match_no_dollar(const_c65) }
test.const_dollar_c64 { match_dollar(const_c64) }
test.const_dollar_c65 { match_dollar(const_c65) }

match_no_dollar(s) = m {
	[m] = regex.find_all_string_submatch_n("^.*", s, 1)
}

match_dollar(s) = m {
	[m] = regex.find_all_string_submatch_n("^.*$", s, 1)
}

`input.json`:

{
  "c64": "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
  "c65": "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
}

Running with `-t rego`

They all evaluate truthy:

$ opa eval --format pretty -t rego -d ./repro.rego 'data.repro.test' -i input.json 
[
  "const_dollar_c64",
  "const_dollar_c65",
  "const_no_dollar_c64",
  "const_no_dollar_c65",
  "input_dollar_c64",
  "input_dollar_c65",
  "input_no_dollar_c64",
  "input_no_dollar_c65"
]

Running with `-t wasm`

They all evaluate truthy:

$ opa eval --format pretty -t wasm -d ./repro.rego 'data.repro.test' -i input.json 
[
  "const_dollar_c64",
  "const_dollar_c65",
  "const_no_dollar_c64",
  "const_no_dollar_c65",
  "input_dollar_c64",
  "input_no_dollar_c64",
  "input_no_dollar_c65"
]

Notice how input_dollar_c65 is the only test evaluating as undefined.
So this issue is only when:

the string comes from the input
it is more than 64 characters long
the regex matches until the end of the string

Expected behavior

That it correctly evaluates, both in wasm and rego mode

Additional context

Tested both on the latest version and old version, as far back as v0.36.0

Answer 1 · 2023-11-03T09:24:56.000Z

I tried to see if it affected any natively implemented builtin or is specific to the regexp one. I did this is a simplified repro rego:

package repro

c65 := count(input.c65)
c64 := count(input.c65)

and it turns out that the result of this is the same for -t wasm and -t rego:

{
  "c64": 63,
  "c65": 64
}

Surprise! The input data is one character short, it seems. But that only slightly varies the bug report, not in any material way.

Also, I guess you've already ruled out that it's a more general problem, since without the trailing $ the outcomes are as expected.

So I'll keep digging.

Answer 2 · 2023-11-03T09:36:30.000Z

OK at this point, I'm not sure what works and what doesn't and what shouldn't 😅 -- We're passing UNANCHORED to re2 here and I guess that makes sense for submatches, but it might not square with anchored regexps fed into find_all_string_submatch_n.

So, despite this being somewhat buggy -- Where did the need arise for you to use an anchored RE with find_all_string_submatch_n?

Answer 3 · 2023-11-09T10:10:19.000Z

Coming back to this! 🔍

The long input is the only one of the four data points that hits this break: https://github.com/open-policy-agent/opa/blob/main/wasm/src/regex.cc#L126.
Digging in, I've found that text.size() for the input.c65 input is 65, while its s->len is 64, causing it to break here
text.size() comes from StringPiece, which calls strlen() on a char*; which is 65. ⚡

Answer 4 · 2023-11-09T11:05:27.000Z

☝️ PR is ready. Thanks again for helping uncover this ✨

Short description

Steps To Reproduce

repro.rego:

input.json:

Running with -t rego