BurntSushi/aho-corasick

AhoCorasick::stream_find_iter may fail if the needle is split between buffer.

ogoffart opened this issue · 3 comments

This is a regression from aho_corasick 0.7.4 to 0.7.5. And the bug still happen in the last tested version 0.7.13

Here is a test-case to reproduce:

use std::io::Read;

// NOTE: The test only fails if this ends with j
pub const MAGIC: [u8; 5] = *b"1234j";


// NOTE: The test fails for value in 8188..=8191
// These value put the string to search accross two call to read because the buffer size is 8192 by default
pub const BEGIN: usize = 8191;


#[derive(Default)]
/// This is just a structure that implements Reader.
/// The reader implementation will simulate a file filled with 0, except for the MAGIC string at offset BEGIN
struct R {
    read: usize,
}

impl Read for R {
    fn read(&mut self, buf: &mut [u8]) -> std::io::Result<usize> {
        //dbg!(buf.len());
        if self.read > 100000 {
            return Ok(0);
        }
        let mut from = 0;
        if self.read < BEGIN {
            from = buf.len().min(BEGIN - self.read);
            for x in 0..from {
                buf[x] = 0;
            }
            self.read += from;
        }
        if self.read >= BEGIN && self.read <= BEGIN + MAGIC.len() {
            let to = buf.len().min(BEGIN + MAGIC.len() - self.read + from);
            if to > from {
                buf[from..to]
                    .copy_from_slice(&MAGIC[self.read - BEGIN..self.read - BEGIN + to - from]);
                self.read += to - from;
                from = to;
            }
        }
        for x in from..buf.len() {
            buf[x] = 0;
            self.read += 1;
        }
        Ok(buf.len())
    }
}

fn main() -> std::io::Result<()> {
    let aut = aho_corasick::AhoCorasick::new(&[&MAGIC]);

    // While reading from a vector, it works:
    let mut buf = vec![];
    R::default().read_to_end(&mut buf)?;
    let from_whole = aut.find_iter(&buf).next().unwrap().start();

    //But using stream_find_iter fails!
    let mut file = R::default();
    let begin = aut
        .stream_find_iter(&mut file)
        .next()
        .expect("NOT FOUND!!!!")? // Panic here
        .start();
    assert_eq!(from_whole, begin);
    Ok(())
}

Playground: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=45d7da0d4705ecafd8fd7804676bbfc8

(Was found because the cpp_macro crate use stream_find_iter to find the metadata in a binary file, and this randomly fails)

I've tried to investigate the problem. It appear that for the specific pattern in the testcase, the prefilter used is RareBytesOne { byte1: 106, offset: RareByteOffset { max: 4 } })
What happens is that calling earliest_find_at from ahocorasick::StreamChunkIter::next will not find that byte in the buffer and will therefore skip everything. Then buf.roll will be called to rotate the buffer and read from the Reader. But the next call to earliest_find_at will only search from the newly read data (despite the match is still in the buffer)
RareBytesOne::next_candidate will find the rare byte, but will then discard the match because its start position is earlier than the at argument.

I've tried replacing this line

cmp::max(at, pos.saturating_sub(self.offset.max as usize))

with

pos.saturating_sub(self.offset.max as usize)

so that RareBytesOne::next_candidate can return a match smaller than at, also adjusting the call to update_skipped_bytes to handle negative values. And this solve the testcase.
However it causes some other test to run into infinite loops, so that's not quite right.

@ogoffart Thanks so much for that investigation! That was enough to tell me where the problem was and helped me craft a fix. See #67 for a bit more detail, but the short story is that I've disabled the prefilter in this case unfortunately. Fixing stream searching to work with the rare byte prefilter is delicate and I don't trust my test coverage enough to attempt that.

A fix for this is on crates.io in aho-corasick 0.7.14.