Error parsing sensing matrix, could not read header
audy opened this issue · 12 comments
Hi, I'm getting this error: Error parsing sensing matrix, could not read header
.
Commands:
quikr_train --input taxcollector.fa --output taxcollector.quikr
quikr --input gg_13_5_otus/rep_set/97_otus.fasta --sensing-matrix taxcollector.quikr.gz --output gg_97_otus_taxcollector.txt --verbose
Header of the sensing matrix:
quikr
0
194192
6
>[0]Bacteria;[1]Actinobacteria;[2]Actinobacteria;[3]Acidimicrobiales;[4]Acidimicrobiaceae;[5]Acidimicrobium;[6]Acidimicrobium_ferrooxidans;[7]Acidimicrobium_ferrooxidans;[8]Acidimicrobium_ferrooxidans_(T)|genus|3
0
1
0
0
0
Is my sensing matrix malformed? Thanks. I can send you any files you need.
Hi Audy,
That message will get spit out if what it thinks should be a header (starts with a '>' ) doesn't actually start with a header. If I could inspect the sensing matrix, then it would be clear what the issue is.
One quick question - did you train your sensing matrix with the same version of Quikr as you tried to run Quikr with? (quikr -v and quikr_train- v should tell you this)
Incompatible versions potentially could have not been caught and slipped through the error checking. I don't think this is the case since the header looks correct to me.
Calvin
Your input training matrix, without knowing what it looks like, could have also triggered the error (though I can't think of why).
Can you test on one of the green genes fasta files provided in the data/ folder to check?
Calvin
Works with GreenGenes file in data/
quikr --input gg_91_otus_4feb2011.fasta --sensing-matrix gg_91_otus_4feb2011.matrix.gz --output test
I'm using the same version of quikr
and quikr_train
(v1.0.4 for both).
Edit: Can I send you my sensing matrix? I generated it directly from quikr_train
.
Okay could you upload your fasta file or point to a source?
On Apr 25, 2014 2:03 PM, "Austin Richardson" notifications@github.com
wrote:
Works with GreenGenes file in data/
quikr --input gg_91_otus_4feb2011.fasta --sensing-matrix gg_91_otus_4feb2011.matrix.gz --output test
I'm using the same version of quikr and quikr_train (v1.0.4 for both).
—
Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-41421919
.
I emailed it to you @mutantturkey. It was ~ 15mb so hopefully it went through.
I think the issue is a naiive approach to headers. We were only statically allocating 256 characters in length of header instead of properly reading in a whole line.
I need to confirm that and then push a fix
Calvin
The oldest problem in bioinformatics :D :D
I wonder if there’s a good lib-fasta.
Thanks!
On Apr 28, 2014, at 11:25 AM, Calvin Morrison notifications@github.com wrote:
I think the issue is a naiive approach to headers. We were only statically allocating 256 characters in length of header instead of properly reading in a whole line.
I need to confirm that and then push a fix
Calvin
—
Reply to this email directly or view it on GitHub.
Yeah, quikr_train is tripping on the other '>' charaters in the header. I hadn't even thought of that whilst counting up mers.
I'll need to write a smarter parser. Until then a crappy work around is removing all of those from the headers except for the start of the headre.
Will do, thanks!
Okay, I just modified the getdelim function to operate on a \n> instead of just a >
seems to fix the problem.for me.
Calvin
Thanks!
Hey,
I pushed a fix in the latest master branch, and tested it with your data set. seems to work fine now.