something like ruby scan
mattes opened this issue · 9 comments
Something like this ...
http://ruby-doc.org/core-2.0.0/String.html#method-i-scan
a = "cruel world"
a.scan(/\w+/) #=> ["cruel", "world"]
seems not to be possible with re2 at the moment:
r = RE2::Regexp.new('(\w+)')
r.match('cruel world') #=> #<RE2::MatchData "cruel" 1:"cruel">
I asked @mudge about this. He added:
While the underlying re2 library does seem to support this via its
"Consume" and "FindAndConsume" operations (c.f.
https://code.google.com/p/re2/source/browse/re2/re2.h#116), I haven't
added support for this in the gem
https://code.google.com/p/re2/source/browse/re2/re2.h#116
// SCANNING TEXT INCREMENTALLY
//
// The "Consume" operation may be useful if you want to repeatedly
// match regular expressions at the front of a string and skip over
// them as they match. This requires use of the "StringPiece" type,
// which represents a sub-range of a real string.
//
// Example: read lines of the form "var = value" from a string.
// string contents = ...; // Fill string somehow
// StringPiece input(contents); // Wrap a StringPiece around it
//
// string var;
// int value;
// while (RE2::Consume(&input, "(\\w+) = (\\d+)\n", &var, &value)) {
// ...;
// }
//
// Each successful call to "Consume" will set "var/value", and also
// advance "input" so it points past the matched text. Note that if the
// regular expression matches an empty string, input will advance
// by 0 bytes. If the regular expression being used might match
// an empty string, the loop body must check for this case and either
// advance the string or break out of the loop.
//
// The "FindAndConsume" operation is similar to "Consume" but does not
// anchor your match at the beginning of the string. For example, you
// could extract all words from a string by repeatedly calling
// RE2::FindAndConsume(&input, "(\\w+)", &word)
//
see here for Consume
implementation: https://github.com/yunabe/practice/blob/master/google/re2_example.cc#L32
I added some test: https://github.com/mattes/re2/blob/master/spec/re2/regexp_spec.rb#L250 and tried to implement it: https://github.com/mattes/re2/blob/master/ext/re2/re2.cc#L906 (its not working, yet)
I was thinking that perhaps we should have an interface that returns a lazy enumerator: in that way, consumption would be memory-efficient and it would be possible to return matches for potentially huge strings.
e.g.
# Would enumerate over every single match found, executing the block as it goes.
RE2('<li>(.*?)</li>').match_all(some_huge_string) do |match|
puts match
end
# Would return an enumerator with only the first match populated, using
# next would move it along.
matches = RE2('<li>(.*?)</li>').match_all(some_huge_string)
matches.next #=> First match
matches.next #=> Second match
matches.next #=> raises StopIteration when finished.
+1 for lazy enumerators
I've pushed a new, prerelease version of the gem with encoding-awareness and a new method: consume
: install it with gem install re2 --pre
.
You can use it like so:
require "re2"
re = RE2('(\w+)')
consumer = re.consume("Foo bar baz quux")
consumer.each do |matches|
# ...
end
# Alternatively...
consumer.rewind # to reset the state of the consumer
enum = consumer.to_enum
enum.next #=> ["Foo"]
enum.next #=> ["bar"]
Please give it a go and let me know if that meets your needs.
awesome awesome. will try it asap.
Just revisiting this: did you get chance to try out the prerelease gem?
Fixed as of v0.6.0 with RE2::Regexp#scan
which returns a RE2::Scanner
:
scanner = RE2('(\w+)').scan("Some long list of words")
scanner.scan #=> ["Some"]
scanner.scan #=> ["long"]
scanner.scan #=> ["list"]