Infinite loop when using a blank regular expression

Question

Infinite loop when using a blank regular expression

Closed this issue 7 years ago · 14 comments

Here's a simple test case which causes an infinite loop:

regex = RE2::Regexp.new('')
regex.scan('test').map { |match| match }

This also fails:

regex = RE2::Regexp.new('()')
regex.scan('test').map { |match| match }

In Ruby, this works as follows:

'test'.scan('').map { |match| match }
=> ["", "", "", "", ""]

Related issue: https://stackoverflow.com/a/30047809/1992201

Notes from the re2.h file about this: https://github.com/google/re2/blob/master/re2/re2.h#L133-L138

Relates to these lines: https://github.com/mudge/re2/blob/v1.0.0/ext/re2/re2.cc#L224-L225

Answer 1 · 2017-07-20T08:55:09.000Z

It looks like this is a problem with the implementation of Scanner: it loops indefinitely as long as there are matches even if the input wasn't advanced.

Correct form:

r = RE2::Regexp.new('')
scanner = r.scan('test')
scanner.scan.to_a

Incorrect form:

r = RE2::Regexp.new('')
r.scan('test').map { |x| x }

Answer 2 · 2017-07-20T13:21:47.000Z

Great catch, @stanhu: the comments from re2.h are particularly useful. What do you think the right behaviour should be re:

// If the regular expression being used might match
// an empty string, the loop body must check for this case and either
// advance the string or break out of the loop.

Should we advance the input ourselves or terminate the loop?

Answer 3 · 2017-07-20T15:24:06.000Z

Looks like advancing the input ourselves using remove_prefix whenever we encounter an empty match or an empty capturing match will solve the infinite loop but it's a little finicky getting it to return nil when the end of the input is encountered.

Answer 4 · 2017-07-20T15:53:56.000Z

Yeah, I attempted the same with this, but it did not quite work:

--- a/ext/re2/re2.cc
+++ b/ext/re2/re2.cc
@@ -204,12 +204,15 @@ static VALUE re2_scanner_rewind(VALUE self) {
  */
 static VALUE re2_scanner_scan(VALUE self) {
   int i;
+  int found_match = 0;
+  size_t original_size;
   re2_pattern *p;
   re2_scanner *c;
   VALUE result;
 
   Data_Get_Struct(self, re2_scanner, c);
   Data_Get_Struct(c->regexp, re2_pattern, p);
+  original_size = c->input->size();

   vector<RE2::Arg> argv(c->number_of_capturing_groups);
   vector<RE2::Arg*> args(c->number_of_capturing_groups);
@@ -228,6 +231,7 @@ static VALUE re2_scanner_scan(VALUE self) {
       if (matches[i].empty()) {
         rb_ary_push(result, Qnil);
       } else {
+        found_match = 1;
         rb_ary_push(result, ENCODED_STR_NEW(matches[i].data(),
               matches[i].size(),
               p->pattern->options().utf8() ? "UTF-8" : "ISO-8859-1"));
@@ -237,6 +241,17 @@ static VALUE re2_scanner_scan(VALUE self) {
     result = Qnil;
   }

+  /* In case we matched a null string, advance the pointer to avoid an infinite loop */
+  if (!found_match &&
+      c->number_of_capturing_groups &&
+      c->input->size() &&
+      (original_size == c->input->size())) {
+      c->input->remove_prefix(1);
+  }
+
   return result;
 }

At GitLab, we're working around it by changing the Ruby code by iterating over match. I'll send the patch later.

Answer 5 · 2017-07-20T16:36:45.000Z

I had something working by removing a 1 byte prefix from input (if the input size is not zero) in two cases:

When the match returned in a capturing group is empty;
When there is a match (viz. FindAndConsumeN returns true) but the number of capturing groups is zero.

I'm out for the evening but will take a look at this as soon as possible as the main challenge is keeping the logic clean.

The other thing this makes me wonder is whether we should distinguish between an empty capturing group (e.g. by returning an empty string) and a match without a capturing group at all. That would likely be backward incompatible though so we can defer that for now.

Answer 6 · 2017-07-20T16:59:34.000Z

FYI, this is what we did in GitLab: https://gitlab.com/nick.thomas/gitlab-ce/commit/dabc1fa388143808bab792448504dac4bae8992b

Answer 7 · 2017-07-22T17:46:05.000Z

Ruby's Regexp distinguishes like so:

"foo".scan(//) # ["", "", "", ""]
"foo".scan(/()/) # [[""], [""], [""], [""]]
"foo".scan(/foo()/) # [[""]]
"foo".scan(/(foo())/) [["foo", ""]]

Answer 8 · 2017-07-22T17:54:55.000Z

With the change in #35, these would be the corresponding results:

[9] pry(main)> RE2::Regexp.new('').scan("foo").to_a
=> [[], [], []]
[10] pry(main)> RE2::Regexp.new('()').scan("foo").to_a
=> [[nil], [nil], [nil]]
[11] pry(main)> RE2::Regexp.new('foo()').scan("foo").to_a
=> [[nil]]
[12] pry(main)> RE2::Regexp.new('(foo())').scan("foo").to_a
=> [["foo", nil]]

Aside from the empty string/nil difference, re2 results one less match with an empty pattern which makes intuitive sense to me (it matches each character and then stops) but perhaps I'm missing something?

Answer 9 · 2017-07-22T18:04:01.000Z

Mmm, the base case in Ruby of an empty string and empty pattern is making me wonder if I should preserve this in re2 as well:

> ''.scan(//)
=> [""]

In order to do this, I might need to rework scanning to store if a scanner is exhausted: that is, where matching has advanced beyond the end of the input string, not just when the input size is 0.

Answer 10 · 2017-07-23T10:20:18.000Z

Updated #35 to match Ruby's behaviour (while keeping compatibility re returning nil versus an empty string):

[1] pry(main)> RE2::Regexp.new('').scan('foo').to_a
=> [[], [], [], []]
[2] pry(main)> RE2::Regexp.new('()').scan('foo').to_a
=> [[nil], [nil], [nil], [nil]]
[3] pry(main)> RE2::Regexp.new('foo()').scan('foo').to_a
=> [[nil]]
[4] pry(main)> RE2::Regexp.new('(foo())').scan('foo').to_a
=> [["foo", nil]]

Does this now match your expectations?

Answer 11 · 2017-07-23T21:01:26.000Z

I've released v1.1.0 which should hopefully fix this problem: please give it a whirl and let me know if it solves your issue.

Answer 12 · 2017-07-24T09:58:34.000Z

Thanks for the quick turnaround on this @mudge ! https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/13036 updates gitlab to use re2 v1.1.0 and allows me to delete my homegrown scan :)

I tested the nil and empty match cases on a 1MiB build trace and they were both very fast. Great work!

Answer 13 · 2017-07-24T13:12:23.000Z

You're very welcome! (Deleting code makes me very happy indeed.)

Answer 14 · 2017-07-24T15:05:51.000Z

@mudge Thank you!