markschl/seq_io

Reading gz files?

xyl012 opened this issue · 3 comments

Hi Mark and thanks for the great package,
I chose seq io for the speed and was trying to read a gz fastq when I found the pull request and the notes. Originally, I had:
let fq = File::open(fastq_path).expect("Could not open Fastq");
let fq = flate2::read::GzDecoder::new(fq).into_inner();
seq_io::fastq::Reader::new(fq)
to read a fastq.gz. This compiled and I thought great! I tested the application, and it keeps giving me thread 'main' panicked at 'called Result::unwrap()on anErr value: InvalidStart { found: 203, pos: ErrorPosition { line: 1, id: None } }', src/main.rs:60:2

I unzipped the same test file and the same code runs fine, I only input a simple check gz basically here: if matches!(ext.unwrap(), "gz") {} I'm not sure what's going on, I even made sure I added lto=true. It seems to be reading the file but incorrectly, so I tried running buf_redux like in the pull request.. My head hurts so I'll come back to it later, but this is a really exciting prospect! This is the fastest reader so reading from gz would be phenomenal. Any ideas on what the problem could be? cheers

Hi! you should be able to supply the GzDecoder to fastq::Reader without calling into_inner after new. into_inner consumes (destroys) the GzDecoder, which you just created, and returns the originial File. If you supply this to fastq::Reader, it will try to read from the compressed GZ, which of course will fail.
I'd like to add a GzDecoder example to the documentation, where not only the file extension is checked, but also whether file starts with 1F B8 (the magic number). But right now I'm too busy...

thanks for the quick reply, It's compiiiiiled! That makes complete sense, it was a d'oh moment for me after I scrolled over it. I think I can add in some documentation since I'm going through it and see what I can do about the extension/magic number checker

I think I can add in some documentation since I'm going through it and see what I can do about the extension/magic number checker

Thanks :) However if you want to do a PR, it would make sense to use the master branch (which is a rewrite of v0.3, the current stable version), since I reordered the whole module structure and documentation. You can also paste some code here.

This code from the fastq-rs crate could help for reference. I'm not sure, whether it could also be done without unsafe code nowadays.