Proposal: Stream compatible RegExp implementation

Question

Proposal: Stream compatible RegExp implementation

jamestalmage opened this issue 9 years ago · 1 comments

I think it would be cool to take the parser and AST you have created, and generate a Node.js Stream compatible version.

API would go something like this:

var input = createInputStream('hello how are you');
var streamRegex = new StreamRegex('\\w+');
var arr = [];
input.pipe(streamRegex.match())
 .on('data', function(chunk) {
   arr.push(chunk.toString('utf8'));
 });

console.log(arr);
// ['hello', 'how', 'are', 'you']

Goals / Ideas:

Equivalents for match, test, split, and replace
Work with very large inputs (i.e. larger than available memory). This would be the key advantage of using a Stream based version over the default.
no copying of buffer data, use Buffer.concat() and buf.slice()
work with multiple encodings
be fast

I've searched, but have not found anything that operates this way. I did find this, but it converts the buffers to strings, and concats them (violating 2, and 3 above).

Obviously this would be a separate project from this one, but it could certainly share the parser and AST at a minimum (and likely more). I may try implementing myself, but it would be nice to have buy in / input from the contributors here, especially if I end up wanting to refactor some of the code here to facilitate reuse in my project (and help from experts on the problem domain would certainly be welcome).

I think it could be pretty powerful. Thoughts?

Answer 1 · 2015-08-05T12:59:54.000Z

Obviously this would be a separate project from this one,

I agree - creating a stream based RegExp engine is out of the scope of this project. Therefore, I am going to close this issue.

Work with very large inputs (i.e. larger than available memory)

This sounds like a good idea at first, but note, that it is trivial to write a RegExp that might match the entire stream input like /.+TheEnd$/. Therefore, designing a streaming based RegExp might require restricting the expressiveness of the RegExp language.