Infinite loop when using a blank regular expression
Closed this issue · 14 comments
Here's a simple test case which causes an infinite loop:
regex = RE2::Regexp.new('')
regex.scan('test').map { |match| match }
This also fails:
regex = RE2::Regexp.new('()')
regex.scan('test').map { |match| match }
In Ruby, this works as follows:
'test'.scan('').map { |match| match }
=> ["", "", "", "", ""]
Related issue: https://stackoverflow.com/a/30047809/1992201
Notes from the re2.h file about this: https://github.com/google/re2/blob/master/re2/re2.h#L133-L138
Relates to these lines: https://github.com/mudge/re2/blob/v1.0.0/ext/re2/re2.cc#L224-L225
It looks like this is a problem with the implementation of Scanner
: it loops indefinitely as long as there are matches even if the input wasn't advanced.
Correct form:
r = RE2::Regexp.new('')
scanner = r.scan('test')
scanner.scan.to_a
Incorrect form:
r = RE2::Regexp.new('')
r.scan('test').map { |x| x }
Great catch, @stanhu: the comments from re2.h
are particularly useful. What do you think the right behaviour should be re:
// If the regular expression being used might match
// an empty string, the loop body must check for this case and either
// advance the string or break out of the loop.
Should we advance the input ourselves or terminate the loop?
Looks like advancing the input ourselves using remove_prefix
whenever we encounter an empty match or an empty capturing match will solve the infinite loop but it's a little finicky getting it to return nil
when the end of the input is encountered.
Yeah, I attempted the same with this, but it did not quite work:
--- a/ext/re2/re2.cc
+++ b/ext/re2/re2.cc
@@ -204,12 +204,15 @@ static VALUE re2_scanner_rewind(VALUE self) {
*/
static VALUE re2_scanner_scan(VALUE self) {
int i;
+ int found_match = 0;
+ size_t original_size;
re2_pattern *p;
re2_scanner *c;
VALUE result;
Data_Get_Struct(self, re2_scanner, c);
Data_Get_Struct(c->regexp, re2_pattern, p);
+ original_size = c->input->size();
vector<RE2::Arg> argv(c->number_of_capturing_groups);
vector<RE2::Arg*> args(c->number_of_capturing_groups);
@@ -228,6 +231,7 @@ static VALUE re2_scanner_scan(VALUE self) {
if (matches[i].empty()) {
rb_ary_push(result, Qnil);
} else {
+ found_match = 1;
rb_ary_push(result, ENCODED_STR_NEW(matches[i].data(),
matches[i].size(),
p->pattern->options().utf8() ? "UTF-8" : "ISO-8859-1"));
@@ -237,6 +241,17 @@ static VALUE re2_scanner_scan(VALUE self) {
result = Qnil;
}
+ /* In case we matched a null string, advance the pointer to avoid an infinite loop */
+ if (!found_match &&
+ c->number_of_capturing_groups &&
+ c->input->size() &&
+ (original_size == c->input->size())) {
+ c->input->remove_prefix(1);
+ }
+
return result;
}
At GitLab, we're working around it by changing the Ruby code by iterating over match
. I'll send the patch later.
I had something working by removing a 1 byte prefix from input (if the input size is not zero) in two cases:
- When the match returned in a capturing group is empty;
- When there is a match (viz. FindAndConsumeN returns true) but the number of capturing groups is zero.
I'm out for the evening but will take a look at this as soon as possible as the main challenge is keeping the logic clean.
The other thing this makes me wonder is whether we should distinguish between an empty capturing group (e.g. by returning an empty string) and a match without a capturing group at all. That would likely be backward incompatible though so we can defer that for now.
FYI, this is what we did in GitLab: https://gitlab.com/nick.thomas/gitlab-ce/commit/dabc1fa388143808bab792448504dac4bae8992b
Ruby's Regexp distinguishes like so:
"foo".scan(//) # ["", "", "", ""]
"foo".scan(/()/) # [[""], [""], [""], [""]]
"foo".scan(/foo()/) # [[""]]
"foo".scan(/(foo())/) [["foo", ""]]
With the change in #35, these would be the corresponding results:
[9] pry(main)> RE2::Regexp.new('').scan("foo").to_a
=> [[], [], []]
[10] pry(main)> RE2::Regexp.new('()').scan("foo").to_a
=> [[nil], [nil], [nil]]
[11] pry(main)> RE2::Regexp.new('foo()').scan("foo").to_a
=> [[nil]]
[12] pry(main)> RE2::Regexp.new('(foo())').scan("foo").to_a
=> [["foo", nil]]
Aside from the empty string/nil
difference, re2 results one less match with an empty pattern which makes intuitive sense to me (it matches each character and then stops) but perhaps I'm missing something?
Mmm, the base case in Ruby of an empty string and empty pattern is making me wonder if I should preserve this in re2 as well:
> ''.scan(//)
=> [""]
In order to do this, I might need to rework scanning to store if a scanner is exhausted: that is, where matching has advanced beyond the end of the input string, not just when the input size is 0.
Updated #35 to match Ruby's behaviour (while keeping compatibility re returning nil
versus an empty string):
[1] pry(main)> RE2::Regexp.new('').scan('foo').to_a
=> [[], [], [], []]
[2] pry(main)> RE2::Regexp.new('()').scan('foo').to_a
=> [[nil], [nil], [nil], [nil]]
[3] pry(main)> RE2::Regexp.new('foo()').scan('foo').to_a
=> [[nil]]
[4] pry(main)> RE2::Regexp.new('(foo())').scan('foo').to_a
=> [["foo", nil]]
Does this now match your expectations?
I've released v1.1.0 which should hopefully fix this problem: please give it a whirl and let me know if it solves your issue.
Thanks for the quick turnaround on this @mudge ! https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/13036 updates gitlab to use re2 v1.1.0 and allows me to delete my homegrown scan :)
I tested the nil and empty match cases on a 1MiB build trace and they were both very fast. Great work!
You're very welcome! (Deleting code makes me very happy indeed.)