BurntSushi/aho-corasick

Async support

artem-tim opened this issue · 3 comments

I would very much like to see this exceptionally well made crate to provide low level building blocks to allow consumers to use AhoCorasick automaton in the async context.

I do wonder if this is something that is out of scope of the goals of the current crate. However this feature is something that we currently need, and surely there are others.

It is currently not easy to use the publicly available automatons without writing chunk handling logic in order to use with async processes, and this logic needs to be normalized somewhere - it could be this crate, or it could be an associate crate. I do think that it is low level enough to make it available directly here - it is async glue, after all.

The current de facto standard for Async Rust is given foundation by the futures-rs crate. As such, in order to define an interface large enough to be usable in many cases, I figured an implementation of AsyncRead and AsyncWrite is in a good spot - it seems easily usable in various scenarios, and low level enough to enable working with user-defined buffers of custom size and custom implementations of polling loops.
Of course, the interface is up to debate.

As such, I've opened a tentative PR implementing such constructs behind an optional feature : #132

The PR has a more detailed description of motivation and unit testing, but I'll cross post the most important interface part here :

If async feature is explicitly enabled, 3 new public methods will become available to AhoCorasick struct (documented with examples) :

  • async_reader : Produces an AhoCorasickAsyncReader which implements AsyncRead trait, wrapping user-provided AsyncRead source, and yielding chunks with replaced bytes when polling from it
  • async_writer : Produces an AhoCorasickAsyncWriter which implements AsyncWrite trait, wrapping user-provided AsyncWrite sink, and similarily writing to it will write replaced chunks to the original sink.
  • try_async_stream_replace_all : A mostly standalone helper function which is somewhat trivial to implement using either async_reader or async_writer, however I though it would be convenient to have as an async alternative to the existing try_stream_replace_all method.

Thanks for putting this together. Clearly a lot of work went into it. As I said on the PR, I don't see myself accepting any change like this in the immediate future. Let me try to lay out why. It's actually not necessarily about scope, but the reasons are varied.

Firstly and most primarily, the executor independent async ecosystem is just simply not mature enough to feature in the public API of this crate. The giant red flag that should stand out to you here is that you've added a public dependency that's at version 0.3.x to a crate that's at version 1.x.y. If and when futures releases a semver breaking change, I'll be forced to do it as well, optional feature or not. I spent many many many years getting this crate to 1.x.y and I absolutely do not want to be beholden to someone else as to when I have to release a 2.x.y. A way around this would be to switch to something more mature, such as tokio (which is at 1.x.y), but then you have a low level library that's fixed to a specific executor. That's also bad and not something I'm interested in. So bottom line here is that the executor independent async ecosystem really needs to evolve first before I'd consider any kind of async support in a library like this. If this library were still in its own experimental stages, my tune might change.

Secondly, I quickly reviewed the async logic in your PR and it's incredibly thorny. I've spent enough time on the stream implementations for std::io::Read to know that the bugs in this area of the code are quite subtle and difficult to debug and fix. I've been bitten by numerous bugs in the stream code and I'm still not happy with its current state. The async code looks even more subtle. This brings us to something you said in your PR that I really disagree with at a very fundamental level (emphasis mine):

you would not be opposed to the contribution hidden behind an entirely optional feature so it would do no harm.

The problem here is that your patch is not just some simple async adapter. As you say, there is complicated glue logic here to figure out. You did the initial work to make it, but accepting the PR means I have to assume the maintenance burden of the code and fix bugs when people report them. If the async streaming code is anything like the synchronous streaming code, I anticipate long hours of debugging would be ahead of me to support your use case. I really just do not have that kind of time and therefore cannot accept the PR in its current state. (This hurdle could in theory be cleared by coming up with better abstractions that make reasoning about stream searching easier. And that's on my longer term plan, no time soon.)

The third reason is that I am very unsure of the API design in this space. With this crate at 1.x.y and the executor independent async ecosystem in a still very nascent state, there is an impedance mismatch. Basically, accepting this change opens me up to significant risk that the API is wrong in some subtle way that is difficult to know because of the ecosystem immaturity. The impedance mismatch comes from inserting this into a crate that I believe has a mature API.

And lastly, I am unclear on the motivation for something like this. You mention that it is required, but it's not clear to me that it is. For example, if this is a CPU bound task, then aren't such things typically handled by dividing that work up in a thread pool where synchronous I/O can happen?

I tried to order the above reasons roughly in order of importance. So for example, I'd expect you might be able to clear the hurdles laid down by the third and fourth reasons. The second reason might also be clearable, but challenging. However, the first reason I think is just not something that can be cleared right now.

Your points make a whole lot of sense, and if I'd thought of this I wouldn't have opened this PR. Especially the pre-1.0 dependencies version compatibility that I understand may become hell for the big crates maintainers.

Just to clarify why we require this : our team is working on a WASM project using wasm-bindgen crate, which is to be used in a web context (browser or web workers) - an environment where everything is async by default, from the entrypoint to the application (and single-threaded on top of that) - pretty rigid. So we are unable to use sync version even if we wanted to.

In conclusion, I will keep my fork with this implementation (improving on it based on your feedback and our testing), aside from your project waiting for async ecosystem to mature, and anyone reading this is of course also welcome to use it/improve on it - until there is a more official solution. Until then, it has been very informative, again thank you so much for your time and your review !

Great, thanks for your understanding. And good luck. :-)