Error parsing sensing matrix, could not read header

Question

Error parsing sensing matrix, could not read header

audy opened this issue 10 years ago · 12 comments

Hi, I'm getting this error: Error parsing sensing matrix, could not read header.

Commands:

quikr_train --input taxcollector.fa --output taxcollector.quikr

quikr --input gg_13_5_otus/rep_set/97_otus.fasta --sensing-matrix taxcollector.quikr.gz --output gg_97_otus_taxcollector.txt --verbose

Header of the sensing matrix:

quikr
0
194192
6
>[0]Bacteria;[1]Actinobacteria;[2]Actinobacteria;[3]Acidimicrobiales;[4]Acidimicrobiaceae;[5]Acidimicrobium;[6]Acidimicrobium_ferrooxidans;[7]Acidimicrobium_ferrooxidans;[8]Acidimicrobium_ferrooxidans_(T)|genus|3
0
1
0
0
0

Is my sensing matrix malformed? Thanks. I can send you any files you need.

audy commented 10 years ago

Thanks!

Answer 1 · 2014-04-25T14:15:00.000Z

Hi Audy,

That message will get spit out if what it thinks should be a header (starts with a '>' ) doesn't actually start with a header. If I could inspect the sensing matrix, then it would be clear what the issue is.

One quick question - did you train your sensing matrix with the same version of Quikr as you tried to run Quikr with? (quikr -v and quikr_train- v should tell you this)

Incompatible versions potentially could have not been caught and slipped through the error checking. I don't think this is the case since the header looks correct to me.

Calvin

Answer 2 · 2014-04-25T14:20:19.000Z

Your input training matrix, without knowing what it looks like, could have also triggered the error (though I can't think of why).

Can you test on one of the green genes fasta files provided in the data/ folder to check?

Calvin

Answer 3 · 2014-04-25T18:03:03.000Z

Works with GreenGenes file in data/

quikr --input gg_91_otus_4feb2011.fasta --sensing-matrix gg_91_otus_4feb2011.matrix.gz --output test

I'm using the same version of quikr and quikr_train (v1.0.4 for both).

Edit: Can I send you my sensing matrix? I generated it directly from quikr_train.

Answer 4 · 2014-04-25T18:08:00.000Z

Okay could you upload your fasta file or point to a source?
On Apr 25, 2014 2:03 PM, "Austin Richardson" notifications@github.com
wrote:

Works with GreenGenes file in data/

quikr --input gg_91_otus_4feb2011.fasta --sensing-matrix gg_91_otus_4feb2011.matrix.gz --output test

I'm using the same version of quikr and quikr_train (v1.0.4 for both).

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-41421919
.

Answer 5 · 2014-04-25T20:22:16.000Z

I emailed it to you @mutantturkey. It was ~ 15mb so hopefully it went through.

Answer 6 · 2014-04-28T15:25:07.000Z

I think the issue is a naiive approach to headers. We were only statically allocating 256 characters in length of header instead of properly reading in a whole line.

I need to confirm that and then push a fix

Calvin

Answer 7 · 2014-04-28T15:27:54.000Z

The oldest problem in bioinformatics :D :D

I wonder if there’s a good lib-fasta.

Thanks!

On Apr 28, 2014, at 11:25 AM, Calvin Morrison notifications@github.com wrote:

I think the issue is a naiive approach to headers. We were only statically allocating 256 characters in length of header instead of properly reading in a whole line.

I need to confirm that and then push a fix

Calvin

—
Reply to this email directly or view it on GitHub.

Answer 8 · 2014-04-29T16:38:31.000Z

Yeah, quikr_train is tripping on the other '>' charaters in the header. I hadn't even thought of that whilst counting up mers.

I'll need to write a smarter parser. Until then a crappy work around is removing all of those from the headers except for the start of the headre.

Answer 9 · 2014-04-29T16:39:48.000Z

Will do, thanks!

Answer 10 · 2014-04-29T17:53:23.000Z

Okay, I just modified the getdelim function to operate on a \n> instead of just a >

seems to fix the problem.for me.

Calvin

Answer 11 · 2014-04-30T01:27:27.000Z

Hey,

I pushed a fix in the latest master branch, and tested it with your data set. seems to work fine now.