This is a test of various methods of buffering input when reading files into a lexical analyzer.
Methods tested:
- Direct read from stream, one byte at a time
- Buffered read from stream, 1kb at a time
- Buffered read from stream, 1mb at a time
- Memory mapped read
- Read all to string (utf-8 validated)
- Read all to byte vector
Each method was tested on a ~300kb source file, and a ~3mb source file.
Two test files were generated using regex generators. The content of these files looks like this:
jvvxrTLUJNA7r_3k2r3VhYqrKcj9hwcKaZgI9uS19w7nE8J0kdh0HVjTv2XwUC5Zx1hkaQfEW_92DjFvQCpbtEz1u
S2BQBBcCT4MM_ghekImB8
8894
I_1i5U9TqOA7Jea7l1TvajwB4OuykRWtotUSGlFTwovHOKfaigrDC_R
668579675
Variably repeating for 10k and 110k lines, respectively. A generic lexical analyzer is constructed using a single byte of lookahead:
pub struct Lexer<I: Iterator<Item = u8>> {
inner: Peekable<I>
}
The lexer discards whitespace, and produces a simple token sum type:
#[derive(Debug)]
pub enum Token {
Identifier(String),
Number(i64)
}
Tokens are generated via an Iterator implementation:
impl<I: Iterator<Item = u8>> Iterator for Lexer<I> {
type Item = Token;
fn next (&mut self) -> Option<Token> {
if let Some(&ch) = self.inner.peek() {
match ch {
x if x.is_ascii_alphabetic() || x == b'_' => {
let mut s = vec![x];
self.inner.next();
while let Some(&ch) = self.inner.peek() {
if ch.is_ascii_alphanumeric()
|| ch == b'_' {
s.push(ch);
self.inner.next();
} else {
break
}
}
Some(Token::Identifier(unsafe { String::from_utf8_unchecked(s) }))
}
x if x.is_ascii_digit() => {
let mut s = vec![x];
self.inner.next();
while let Some(&ch) = self.inner.peek() {
if ch.is_ascii_digit() {
s.push(ch);
self.inner.next();
} else {
break
}
}
Some(Token::Number(unsafe { String::from_utf8_unchecked(s) }.parse().unwrap()))
}
x if x.is_ascii_whitespace() => {
self.inner.next();
self.next()
}
_ => None
}
} else {
None
}
}
}
For each kind of test, where necessary an intermediate iterator is constructed:
This is simply a wrapper to convert an io::Read
into an Iterator
pub struct ByteReader<R: Read> {
inner: R
}
impl<R: Read> ByteReader<R> {
pub fn new (inner: R) -> Self {
Self {
inner
}
}
}
impl<R: Read> Iterator for ByteReader<R> {
type Item = u8;
fn next (&mut self) -> Option<u8> {
let mut buf = [0u8; 1];
match self.inner.read_exact(&mut buf) {
Ok(_) => Some(unsafe { *buf.get_unchecked(0) }),
Err(_) => None
}
}
}
This takes a reference to a mutable slice, and refills it with bytes on demand
pub struct BufferedReader<'b, R: Read> {
inner: R,
buffer: &'b mut [u8],
offset: usize,
remainder: usize,
}
impl<'b, R: Read> BufferedReader<'b, R> {
pub fn new (inner: R, buffer: &'b mut [u8]) -> Self {
Self {
inner,
buffer,
offset: 0,
remainder: 0,
}
}
fn refill_buffer (&mut self) -> Option<()> {
match self.inner.read(self.buffer.as_mut()) {
Err(_) => { None }
Ok(n) if n == 0 => { None }
Ok(n) => { self.offset = 0; self.remainder = n; Some(()) }
}
}
}
impl<'b, R: Read> Iterator for BufferedReader<'b, R> {
type Item = u8;
fn next (&mut self) -> Option<u8> {
if self.offset == self.remainder {
self.refill_buffer()?;
}
let offset = self.offset;
self.offset += 1;
Some(unsafe { *self.buffer.get_unchecked(offset) })
}
}
This calls mmap
on a file to generate a simple memory mapped view of it, then simply increments the base pointer. This will work on Unix systems, but an alternative implementation would be required for Windows
pub struct MMapReader<'f> {
base: *const u8,
len: usize,
ptr: *const u8,
end: *const u8,
f: PhantomData<&'f mut File>
}
impl<'f> MMapReader<'f> {
pub fn new (file: &'f mut File) -> Self {
use std::os::unix::io::AsRawFd;
let fd = file.as_raw_fd();
let len = file.metadata().unwrap().len() as usize;
unsafe {
let ptr = libc::mmap(
ptr::null_mut(),
len,
libc::PROT_READ,
libc::MAP_SHARED,
fd,
0
) as *const u8;
let end = ptr.add(len);
Self {
base: ptr, len,
ptr, end,
f: PhantomData
}
}
}
}
impl<'f> Drop for MMapReader<'f> {
fn drop (&mut self) {
unsafe { libc::munmap(self.base as *mut _, self.len); }
}
}
impl<'f> Iterator for MMapReader<'f> {
type Item = u8;
fn next (&mut self) -> Option<u8> {
if self.ptr < self.end {
unsafe {
let out = *self.ptr;
self.ptr = self.ptr.add(1);
Some(out)
}
} else {
None
}
}
}
Results are sorted from slowest to fastest
Method | Average time (ns) | Max deviation |
---|---|---|
byte_reader | 81,016,095 | 3,666,607 |
buffered_reader_1kb | 3,048,091 | 196,842 |
buffered_reader_1mb | 2,885,936 | 277,377 |
read_to_vec | 2,664,660 | 82,730 |
read_to_string | 2,663,513 | 28,455 |
mmap | 2,577,370 | 36,680 |
Method | Average time (ns) | Max deviation |
---|---|---|
byte_reader | 884,473,603 | 24,798,573 |
buffered_reader_1kb | 36,473,766 | 1,860,908 |
buffered_reader_1mb | 35,053,802 | 1,157,109 |
read_to_string | 32,273,945 | 465,271 |
read_to_vec | 32,116,772 | 603,392 |
mmap | 30,200,108 | 569,450 |
As with any benchmark, you should take all of this with a grain of salt and perform more focused, realistic tests when you are able
-
As long as you are using some kind of buffering for your input stream, you're doing okay. Reading one byte a time simply isn't feasible and the kernel won't save you here
-
The best bang-for-buck here is reading the entire input into a string or vector, before lexing. This has the least complexity of any solution, and second best performance. Better performance could be extracted by reusing the buffer for each subsequent reading operation
-
mmap
has great performance and stability, but it will require building a multi-platform abstraction -
Comparing
read_to_string
andread_to_vec
demonstrates that performing utf-8 validation on the data once it is in memory is extremely trivial, and has less impact on the final numbers than basic noise and initial machine state