mudge/re2

Infinite loop when using a blank regular expression

Closed this issue · 14 comments

Here's a simple test case which causes an infinite loop:

regex = RE2::Regexp.new('')
regex.scan('test').map { |match| match }

This also fails:

regex = RE2::Regexp.new('()')
regex.scan('test').map { |match| match }

In Ruby, this works as follows:

'test'.scan('').map { |match| match }
=> ["", "", "", "", ""]

Related issue: https://stackoverflow.com/a/30047809/1992201

Notes from the re2.h file about this: https://github.com/google/re2/blob/master/re2/re2.h#L133-L138

Relates to these lines: https://github.com/mudge/re2/blob/v1.0.0/ext/re2/re2.cc#L224-L225

It looks like this is a problem with the implementation of Scanner: it loops indefinitely as long as there are matches even if the input wasn't advanced.

Correct form:

r = RE2::Regexp.new('')
scanner = r.scan('test')
scanner.scan.to_a

Incorrect form:

r = RE2::Regexp.new('')
r.scan('test').map { |x| x }
mudge commented

Great catch, @stanhu: the comments from re2.h are particularly useful. What do you think the right behaviour should be re:

// If the regular expression being used might match
// an empty string, the loop body must check for this case and either
// advance the string or break out of the loop.

Should we advance the input ourselves or terminate the loop?

mudge commented

Looks like advancing the input ourselves using remove_prefix whenever we encounter an empty match or an empty capturing match will solve the infinite loop but it's a little finicky getting it to return nil when the end of the input is encountered.

Yeah, I attempted the same with this, but it did not quite work:

--- a/ext/re2/re2.cc
+++ b/ext/re2/re2.cc
@@ -204,12 +204,15 @@ static VALUE re2_scanner_rewind(VALUE self) {
  */
 static VALUE re2_scanner_scan(VALUE self) {
   int i;
+  int found_match = 0;
+  size_t original_size;
   re2_pattern *p;
   re2_scanner *c;
   VALUE result;
 
   Data_Get_Struct(self, re2_scanner, c);
   Data_Get_Struct(c->regexp, re2_pattern, p);
+  original_size = c->input->size();

   vector<RE2::Arg> argv(c->number_of_capturing_groups);
   vector<RE2::Arg*> args(c->number_of_capturing_groups);
@@ -228,6 +231,7 @@ static VALUE re2_scanner_scan(VALUE self) {
       if (matches[i].empty()) {
         rb_ary_push(result, Qnil);
       } else {
+        found_match = 1;
         rb_ary_push(result, ENCODED_STR_NEW(matches[i].data(),
               matches[i].size(),
               p->pattern->options().utf8() ? "UTF-8" : "ISO-8859-1"));
@@ -237,6 +241,17 @@ static VALUE re2_scanner_scan(VALUE self) {
     result = Qnil;
   }

+  /* In case we matched a null string, advance the pointer to avoid an infinite loop */
+  if (!found_match &&
+      c->number_of_capturing_groups &&
+      c->input->size() &&
+      (original_size == c->input->size())) {
+      c->input->remove_prefix(1);
+  }
+
   return result;
 }

At GitLab, we're working around it by changing the Ruby code by iterating over match. I'll send the patch later.

mudge commented

I had something working by removing a 1 byte prefix from input (if the input size is not zero) in two cases:

  1. When the match returned in a capturing group is empty;
  2. When there is a match (viz. FindAndConsumeN returns true) but the number of capturing groups is zero.

I'm out for the evening but will take a look at this as soon as possible as the main challenge is keeping the logic clean.

The other thing this makes me wonder is whether we should distinguish between an empty capturing group (e.g. by returning an empty string) and a match without a capturing group at all. That would likely be backward incompatible though so we can defer that for now.

Ruby's Regexp distinguishes like so:

"foo".scan(//) # ["", "", "", ""]
"foo".scan(/()/) # [[""], [""], [""], [""]]
"foo".scan(/foo()/) # [[""]]
"foo".scan(/(foo())/) [["foo", ""]]
mudge commented

With the change in #35, these would be the corresponding results:

[9] pry(main)> RE2::Regexp.new('').scan("foo").to_a
=> [[], [], []]
[10] pry(main)> RE2::Regexp.new('()').scan("foo").to_a
=> [[nil], [nil], [nil]]
[11] pry(main)> RE2::Regexp.new('foo()').scan("foo").to_a
=> [[nil]]
[12] pry(main)> RE2::Regexp.new('(foo())').scan("foo").to_a
=> [["foo", nil]]

Aside from the empty string/nil difference, re2 results one less match with an empty pattern which makes intuitive sense to me (it matches each character and then stops) but perhaps I'm missing something?

mudge commented

Mmm, the base case in Ruby of an empty string and empty pattern is making me wonder if I should preserve this in re2 as well:

> ''.scan(//)
=> [""]

In order to do this, I might need to rework scanning to store if a scanner is exhausted: that is, where matching has advanced beyond the end of the input string, not just when the input size is 0.

mudge commented

Updated #35 to match Ruby's behaviour (while keeping compatibility re returning nil versus an empty string):

[1] pry(main)> RE2::Regexp.new('').scan('foo').to_a
=> [[], [], [], []]
[2] pry(main)> RE2::Regexp.new('()').scan('foo').to_a
=> [[nil], [nil], [nil], [nil]]
[3] pry(main)> RE2::Regexp.new('foo()').scan('foo').to_a
=> [[nil]]
[4] pry(main)> RE2::Regexp.new('(foo())').scan('foo').to_a
=> [["foo", nil]]

Does this now match your expectations?

mudge commented

I've released v1.1.0 which should hopefully fix this problem: please give it a whirl and let me know if it solves your issue.

Thanks for the quick turnaround on this @mudge ! https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/13036 updates gitlab to use re2 v1.1.0 and allows me to delete my homegrown scan :)

I tested the nil and empty match cases on a 1MiB build trace and they were both very fast. Great work!

mudge commented

You're very welcome! (Deleting code makes me very happy indeed.)

@mudge Thank you!