AhoCorasick::stream_find_iter may fail if the needle is split between buffer.
ogoffart opened this issue · 3 comments
This is a regression from aho_corasick 0.7.4 to 0.7.5. And the bug still happen in the last tested version 0.7.13
Here is a test-case to reproduce:
use std::io::Read;
// NOTE: The test only fails if this ends with j
pub const MAGIC: [u8; 5] = *b"1234j";
// NOTE: The test fails for value in 8188..=8191
// These value put the string to search accross two call to read because the buffer size is 8192 by default
pub const BEGIN: usize = 8191;
#[derive(Default)]
/// This is just a structure that implements Reader.
/// The reader implementation will simulate a file filled with 0, except for the MAGIC string at offset BEGIN
struct R {
read: usize,
}
impl Read for R {
fn read(&mut self, buf: &mut [u8]) -> std::io::Result<usize> {
//dbg!(buf.len());
if self.read > 100000 {
return Ok(0);
}
let mut from = 0;
if self.read < BEGIN {
from = buf.len().min(BEGIN - self.read);
for x in 0..from {
buf[x] = 0;
}
self.read += from;
}
if self.read >= BEGIN && self.read <= BEGIN + MAGIC.len() {
let to = buf.len().min(BEGIN + MAGIC.len() - self.read + from);
if to > from {
buf[from..to]
.copy_from_slice(&MAGIC[self.read - BEGIN..self.read - BEGIN + to - from]);
self.read += to - from;
from = to;
}
}
for x in from..buf.len() {
buf[x] = 0;
self.read += 1;
}
Ok(buf.len())
}
}
fn main() -> std::io::Result<()> {
let aut = aho_corasick::AhoCorasick::new(&[&MAGIC]);
// While reading from a vector, it works:
let mut buf = vec![];
R::default().read_to_end(&mut buf)?;
let from_whole = aut.find_iter(&buf).next().unwrap().start();
//But using stream_find_iter fails!
let mut file = R::default();
let begin = aut
.stream_find_iter(&mut file)
.next()
.expect("NOT FOUND!!!!")? // Panic here
.start();
assert_eq!(from_whole, begin);
Ok(())
}
Playground: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=45d7da0d4705ecafd8fd7804676bbfc8
(Was found because the cpp_macro
crate use stream_find_iter to find the metadata in a binary file, and this randomly fails)
I've tried to investigate the problem. It appear that for the specific pattern in the testcase, the prefilter used is RareBytesOne { byte1: 106, offset: RareByteOffset { max: 4 } })
What happens is that calling earliest_find_at
from ahocorasick::StreamChunkIter::next
will not find that byte in the buffer and will therefore skip everything. Then buf.roll
will be called to rotate the buffer and read from the Reader. But the next call to earliest_find_at
will only search from the newly read data (despite the match is still in the buffer)
RareBytesOne::next_candidate
will find the rare byte, but will then discard the match because its start position is earlier than the at
argument.
I've tried replacing this line
Line 654 in 39b029b
with
pos.saturating_sub(self.offset.max as usize)
so that RareBytesOne::next_candidate can return a match smaller than at
, also adjusting the call to update_skipped_bytes
to handle negative values. And this solve the testcase.
However it causes some other test to run into infinite loops, so that's not quite right.
@ogoffart Thanks so much for that investigation! That was enough to tell me where the problem was and helped me craft a fix. See #67 for a bit more detail, but the short story is that I've disabled the prefilter in this case unfortunately. Fixing stream searching to work with the rare byte prefilter is delicate and I don't trust my test coverage enough to attempt that.
A fix for this is on crates.io in aho-corasick 0.7.14
.