EESI/quikr

Error parsing sensing matrix, could not read header

audy opened this issue · 12 comments

audy commented

Hi, I'm getting this error: Error parsing sensing matrix, could not read header.

Commands:

quikr_train --input taxcollector.fa --output taxcollector.quikr

quikr --input gg_13_5_otus/rep_set/97_otus.fasta --sensing-matrix taxcollector.quikr.gz --output gg_97_otus_taxcollector.txt --verbose

Header of the sensing matrix:

quikr
0
194192
6
>[0]Bacteria;[1]Actinobacteria;[2]Actinobacteria;[3]Acidimicrobiales;[4]Acidimicrobiaceae;[5]Acidimicrobium;[6]Acidimicrobium_ferrooxidans;[7]Acidimicrobium_ferrooxidans;[8]Acidimicrobium_ferrooxidans_(T)|genus|3
0
1
0
0
0

Is my sensing matrix malformed? Thanks. I can send you any files you need.

Hi Audy,

That message will get spit out if what it thinks should be a header (starts with a '>' ) doesn't actually start with a header. If I could inspect the sensing matrix, then it would be clear what the issue is.

One quick question - did you train your sensing matrix with the same version of Quikr as you tried to run Quikr with? (quikr -v and quikr_train- v should tell you this)

Incompatible versions potentially could have not been caught and slipped through the error checking. I don't think this is the case since the header looks correct to me.

Calvin

Your input training matrix, without knowing what it looks like, could have also triggered the error (though I can't think of why).

Can you test on one of the green genes fasta files provided in the data/ folder to check?

Calvin

audy commented

Works with GreenGenes file in data/

quikr --input gg_91_otus_4feb2011.fasta --sensing-matrix gg_91_otus_4feb2011.matrix.gz --output test

I'm using the same version of quikr and quikr_train (v1.0.4 for both).

Edit: Can I send you my sensing matrix? I generated it directly from quikr_train.

Okay could you upload your fasta file or point to a source?
On Apr 25, 2014 2:03 PM, "Austin Richardson" notifications@github.com
wrote:

Works with GreenGenes file in data/

quikr --input gg_91_otus_4feb2011.fasta --sensing-matrix gg_91_otus_4feb2011.matrix.gz --output test

I'm using the same version of quikr and quikr_train (v1.0.4 for both).


Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-41421919
.

audy commented

I emailed it to you @mutantturkey. It was ~ 15mb so hopefully it went through.

I think the issue is a naiive approach to headers. We were only statically allocating 256 characters in length of header instead of properly reading in a whole line.

I need to confirm that and then push a fix

Calvin

audy commented

The oldest problem in bioinformatics :D :D

I wonder if there’s a good lib-fasta.

Thanks!

On Apr 28, 2014, at 11:25 AM, Calvin Morrison notifications@github.com wrote:

I think the issue is a naiive approach to headers. We were only statically allocating 256 characters in length of header instead of properly reading in a whole line.

I need to confirm that and then push a fix

Calvin


Reply to this email directly or view it on GitHub.

Yeah, quikr_train is tripping on the other '>' charaters in the header. I hadn't even thought of that whilst counting up mers.

I'll need to write a smarter parser. Until then a crappy work around is removing all of those from the headers except for the start of the headre.

audy commented

Will do, thanks!

Okay, I just modified the getdelim function to operate on a \n> instead of just a >

seems to fix the problem.for me.

Calvin

audy commented

Thanks!

Hey,

I pushed a fix in the latest master branch, and tested it with your data set. seems to work fine now.