something like ruby scan

Question

something like ruby scan

mattes opened this issue 11 years ago · 9 comments

Something like this ...
http://ruby-doc.org/core-2.0.0/String.html#method-i-scan

a = "cruel world"
a.scan(/\w+/)        #=> ["cruel", "world"]

seems not to be possible with re2 at the moment:

r = RE2::Regexp.new('(\w+)')
r.match('cruel world')   #=> #<RE2::MatchData "cruel" 1:"cruel">

I asked @mudge about this. He added:

While the underlying re2 library does seem to support this via its
"Consume" and "FindAndConsume" operations (c.f.
https://code.google.com/p/re2/source/browse/re2/re2.h#116), I haven't
added support for this in the gem

Answer 1 · 2013-09-11T13:21:04.000Z

https://code.google.com/p/re2/source/browse/re2/re2.h#116

// SCANNING TEXT INCREMENTALLY
//
// The "Consume" operation may be useful if you want to repeatedly
// match regular expressions at the front of a string and skip over
// them as they match.  This requires use of the "StringPiece" type,
// which represents a sub-range of a real string.
//
// Example: read lines of the form "var = value" from a string.
//      string contents = ...;          // Fill string somehow
//      StringPiece input(contents);    // Wrap a StringPiece around it
//
//      string var;
//      int value;
//      while (RE2::Consume(&input, "(\\w+) = (\\d+)\n", &var, &value)) {
//        ...;
//      }
//
// Each successful call to "Consume" will set "var/value", and also
// advance "input" so it points past the matched text.  Note that if the
// regular expression matches an empty string, input will advance
// by 0 bytes.  If the regular expression being used might match
// an empty string, the loop body must check for this case and either
// advance the string or break out of the loop.
//
// The "FindAndConsume" operation is similar to "Consume" but does not
// anchor your match at the beginning of the string.  For example, you
// could extract all words from a string by repeatedly calling
//     RE2::FindAndConsume(&input, "(\\w+)", &word)
//

Answer 2 · 2013-09-11T15:20:53.000Z

see here for Consume implementation: https://github.com/yunabe/practice/blob/master/google/re2_example.cc#L32

Answer 3 · 2013-09-11T17:28:08.000Z

I added some test: https://github.com/mattes/re2/blob/master/spec/re2/regexp_spec.rb#L250 and tried to implement it: https://github.com/mattes/re2/blob/master/ext/re2/re2.cc#L906 (its not working, yet)

Answer 4 · 2013-09-12T08:04:55.000Z

I was thinking that perhaps we should have an interface that returns a lazy enumerator: in that way, consumption would be memory-efficient and it would be possible to return matches for potentially huge strings.

e.g.

# Would enumerate over every single match found, executing the block as it goes.
RE2('<li>(.*?)</li>').match_all(some_huge_string) do |match|
  puts match
end

# Would return an enumerator with only the first match populated, using
# next would move it along.
matches = RE2('<li>(.*?)</li>').match_all(some_huge_string)
matches.next #=> First match
matches.next #=> Second match
matches.next #=> raises StopIteration when finished.

Answer 5 · 2013-09-12T17:06:20.000Z

+1 for lazy enumerators

Answer 6 · 2013-09-15T13:36:57.000Z

I've pushed a new, prerelease version of the gem with encoding-awareness and a new method: consume: install it with gem install re2 --pre.

You can use it like so:

require "re2"
re = RE2('(\w+)')
consumer = re.consume("Foo bar baz quux")
consumer.each do |matches|
  # ...
end

# Alternatively...
consumer.rewind # to reset the state of the consumer

enum = consumer.to_enum
enum.next #=> ["Foo"]
enum.next #=> ["bar"]

Please give it a go and let me know if that meets your needs.

Answer 7 · 2013-09-16T10:40:31.000Z

awesome awesome. will try it asap.

Answer 8 · 2014-02-01T21:13:10.000Z

Just revisiting this: did you get chance to try out the prerelease gem?

Answer 9 · 2014-02-01T23:04:50.000Z

Fixed as of v0.6.0 with RE2::Regexp#scan which returns a RE2::Scanner:

scanner = RE2('(\w+)').scan("Some long list of words")
scanner.scan #=> ["Some"]
scanner.scan #=> ["long"]
scanner.scan #=> ["list"]