WASM: Some builtins don't work with strings >64 characters as inputs
sandhose opened this issue · 4 comments
Short description
Some builtins, at least regex.find_all_string_submatch_n
, don't work for input
strings longer than 64 characters
Steps To Reproduce
repro.rego
:
package repro
const_c64 := "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
const_c65 := "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
test.input_no_dollar_c64 { match_no_dollar(input.c64) }
test.input_no_dollar_c65 { match_no_dollar(input.c65) }
test.input_dollar_c64 { match_dollar(input.c64) }
test.input_dollar_c65 { match_dollar(input.c65) }
test.const_no_dollar_c64 { match_no_dollar(const_c64) }
test.const_no_dollar_c65 { match_no_dollar(const_c65) }
test.const_dollar_c64 { match_dollar(const_c64) }
test.const_dollar_c65 { match_dollar(const_c65) }
match_no_dollar(s) = m {
[m] = regex.find_all_string_submatch_n("^.*", s, 1)
}
match_dollar(s) = m {
[m] = regex.find_all_string_submatch_n("^.*$", s, 1)
}
input.json
:
{
"c64": "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
"c65": "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
}
Running with -t rego
They all evaluate truthy:
$ opa eval --format pretty -t rego -d ./repro.rego 'data.repro.test' -i input.json
[
"const_dollar_c64",
"const_dollar_c65",
"const_no_dollar_c64",
"const_no_dollar_c65",
"input_dollar_c64",
"input_dollar_c65",
"input_no_dollar_c64",
"input_no_dollar_c65"
]
Running with -t wasm
They all evaluate truthy:
$ opa eval --format pretty -t wasm -d ./repro.rego 'data.repro.test' -i input.json
[
"const_dollar_c64",
"const_dollar_c65",
"const_no_dollar_c64",
"const_no_dollar_c65",
"input_dollar_c64",
"input_no_dollar_c64",
"input_no_dollar_c65"
]
Notice how input_dollar_c65
is the only test evaluating as undefined
.
So this issue is only when:
- the string comes from the input
- it is more than 64 characters long
- the regex matches until the end of the string
Expected behavior
That it correctly evaluates, both in wasm
and rego
mode
Additional context
Tested both on the latest version and old version, as far back as v0.36.0
I tried to see if it affected any natively implemented builtin or is specific to the regexp one. I did this is a simplified repro rego:
package repro
c65 := count(input.c65)
c64 := count(input.c65)
and it turns out that the result of this is the same for -t wasm
and -t rego
:
{
"c64": 63,
"c65": 64
}
Surprise! The input data is one character short, it seems. But that only slightly varies the bug report, not in any material way.
Also, I guess you've already ruled out that it's a more general problem, since without the trailing $
the outcomes are as expected.
So I'll keep digging.
OK at this point, I'm not sure what works and what doesn't and what shouldn't 😅 -- We're passing UNANCHORED to re2 here and I guess that makes sense for submatches, but it might not square with anchored regexps fed into find_all_string_submatch_n
.
So, despite this being somewhat buggy -- Where did the need arise for you to use an anchored RE with find_all_string_submatch_n?
Coming back to this! 🔍
- The long input is the only one of the four data points that hits this break: https://github.com/open-policy-agent/opa/blob/main/wasm/src/regex.cc#L126.
- Digging in, I've found that
text.size()
for the input.c65 input is 65, while itss->len
is 64, causing it to break here text.size()
comes from StringPiece, which callsstrlen()
on achar*
; which is 65. ⚡
☝️ PR is ready. Thanks again for helping uncover this ✨