mudge/re2

something like ruby scan

mattes opened this issue · 9 comments

Something like this ...
http://ruby-doc.org/core-2.0.0/String.html#method-i-scan

a = "cruel world"
a.scan(/\w+/)        #=> ["cruel", "world"]

seems not to be possible with re2 at the moment:

r = RE2::Regexp.new('(\w+)')
r.match('cruel world')   #=> #<RE2::MatchData "cruel" 1:"cruel">

I asked @mudge about this. He added:

While the underlying re2 library does seem to support this via its
"Consume" and "FindAndConsume" operations (c.f.
https://code.google.com/p/re2/source/browse/re2/re2.h#116), I haven't
added support for this in the gem

https://code.google.com/p/re2/source/browse/re2/re2.h#116

// SCANNING TEXT INCREMENTALLY
//
// The "Consume" operation may be useful if you want to repeatedly
// match regular expressions at the front of a string and skip over
// them as they match.  This requires use of the "StringPiece" type,
// which represents a sub-range of a real string.
//
// Example: read lines of the form "var = value" from a string.
//      string contents = ...;          // Fill string somehow
//      StringPiece input(contents);    // Wrap a StringPiece around it
//
//      string var;
//      int value;
//      while (RE2::Consume(&input, "(\\w+) = (\\d+)\n", &var, &value)) {
//        ...;
//      }
//
// Each successful call to "Consume" will set "var/value", and also
// advance "input" so it points past the matched text.  Note that if the
// regular expression matches an empty string, input will advance
// by 0 bytes.  If the regular expression being used might match
// an empty string, the loop body must check for this case and either
// advance the string or break out of the loop.
//
// The "FindAndConsume" operation is similar to "Consume" but does not
// anchor your match at the beginning of the string.  For example, you
// could extract all words from a string by repeatedly calling
//     RE2::FindAndConsume(&input, "(\\w+)", &word)
//

I was thinking that perhaps we should have an interface that returns a lazy enumerator: in that way, consumption would be memory-efficient and it would be possible to return matches for potentially huge strings.

e.g.

# Would enumerate over every single match found, executing the block as it goes.
RE2('<li>(.*?)</li>').match_all(some_huge_string) do |match|
  puts match
end

# Would return an enumerator with only the first match populated, using
# next would move it along.
matches = RE2('<li>(.*?)</li>').match_all(some_huge_string)
matches.next #=> First match
matches.next #=> Second match
matches.next #=> raises StopIteration when finished.

+1 for lazy enumerators

I've pushed a new, prerelease version of the gem with encoding-awareness and a new method: consume: install it with gem install re2 --pre.

You can use it like so:

require "re2"
re = RE2('(\w+)')
consumer = re.consume("Foo bar baz quux")
consumer.each do |matches|
  # ...
end

# Alternatively...
consumer.rewind # to reset the state of the consumer

enum = consumer.to_enum
enum.next #=> ["Foo"]
enum.next #=> ["bar"]

Please give it a go and let me know if that meets your needs.

awesome awesome. will try it asap.

Just revisiting this: did you get chance to try out the prerelease gem?

Fixed as of v0.6.0 with RE2::Regexp#scan which returns a RE2::Scanner:

scanner = RE2('(\w+)').scan("Some long list of words")
scanner.scan #=> ["Some"]
scanner.scan #=> ["long"]
scanner.scan #=> ["list"]