A Rust
library to read from S3 object as if they were files on a local filesystem (almost). The S3Reader
adds both Read
and Seek
traits, allowing to place the cursor anywhere within the S3 object and read from any byte offset. This allows random access to bytes within S3 objects.
Add this to your Cargo.toml
:
[dependencies]
s3reader = "1.0.0"
use std::io::{BufRead, BufReader};
use s3reader::S3Reader;
use s3reader::S3ObjectUri;
fn read_lines_manually() -> std::io::Result<()> {
let uri = S3ObjectUri::new("s3://my-bucket/path/to/huge/file").unwrap();
let s3obj = S3Reader::open(uri).unwrap();
let mut reader = BufReader::new(s3obj);
let mut line = String::new();
let len = reader.read_line(&mut line).unwrap();
println!("The first line >>{line}<< is {len} bytes long");
let mut line2 = String::new();
let len = reader.read_line(&mut line2).unwrap();
println!("The next line >>{line2}<< is {len} bytes long");
Ok(())
}
fn use_line_iterator() -> std::io::Result<()> {
let uri = S3ObjectUri::new("s3://my-bucket/path/to/huge/file").unwrap();
let s3obj = S3Reader::open(uri).unwrap();
let reader = BufReader::new(s3obj);
let mut count = 0;
for line in reader.lines() {
println!("{}", line.unwrap());
count += 1;
}
Ok(())
}
use std::io::{Read, Seek, SeekFrom};
use s3reader::S3Reader;
use s3reader::S3ObjectUri;
fn jump_within_file() -> std::io::Result<()> {
let uri = S3ObjectUri::new("s3://my-bucket/path/to/huge/file").unwrap();
let mut reader = S3Reader::open(uri).unwrap();
let len = reader.len();
let cursor_1 = reader.seek(SeekFrom::Start(len as u64)).unwrap();
let cursor_2 = reader.seek(SeekFrom::End(0)).unwrap();
assert_eq!(cursor_1, cursor_2);
reader.seek(SeekFrom::Start(10)).unwrap();
let mut buf = [0; 100];
let bytes = reader.read(&mut buf).unwrap();
assert_eq!(buf.len(), 100);
assert_eq!(bytes, 100);
Ok(())
}
Does this library really provide random access to S3 objects?
According to this StackOverflow answer, yes.
Are the reads sync or async?
The S3-SDK uses mostly async operations, but the Read
and Seek
traits require sync methods. Due to this, I'm using a blocking tokio runtime to wrap the async calls. This might not be the best solution, but works well for me. Any improvement suggestions are very welcome
Why is this useful?
Depends on your use-cases. If you need to access random bytes in the middle of large files/S3 object, this library is useful. For example, you can read it to stream mp4 files. It's also quite useful for some bioinformatic applications, where you might have a huge, several GB reference genome, but only need to access data of a few genes, accounting to only a few MB.