ERROR: input record is badly formatted. Unexpected character 'M' found.

Question

ERROR: input record is badly formatted. Unexpected character 'M' found.

Closed this issue 2 years ago · 2 comments

Hi,
I have cloned the source code and obtained the human genome reference in the form of the "analysis set",

-rw-rw---- 1 rahmann 835M Jun 28 14:07 GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz
-rw-rw---- 1 rahmann 3,0G Jun 28 14:07 GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fasta

I am using the uncompressed (.fasta) file and running

bcmap index GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fasta

using the default values. On the first run it builds the .fai file (which looks fine to me). But then, I get the following error:

reference        	GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fasta
kmer_index_name  	Index
k                	31
minimizer_window 	61

Loading reference genome...ERROR: input record is badly formatted. Unexpected character 'M' found. 
..done.
Bucket number set to: 0
Loading ref.fai............failed.
Building ref.fai.............done.
Preparing index..............done.
Filling index initially......done. 
Calculating cumulated sum....done.
Writing positions to index...done. 
Writing index to file........done.
Index finished!

I do not know what is badly formatted. I have uploaded the .fai file for you (renamed to .txt to be able to upload it on github) and pasted the first lines of the FASTA below:

GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fasta.fai.txt

>chr1  AC:CM000663.2  gi:568336023  LN:248956422  rl:Chromosome  M5:6aef897c3d6ff0c78aff06ac189178dd  AS:GRCh38
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Indeed, there are a few lines in the genomic sequence where the character 'M' appears (by grep); is this the problem?

TCACCCCCCACACACACCAAACAMCCCACACAACACACACACACCACACCACACAAACACAAACACACCA
ACACAACACAGATGCACACAMCACACATCACACCTACATACAACATACACACATACACCTACATACATTA
GGAAACACAGCTTTTGTCCATTCTGTGAAAGGACATTTCGGAGCTCTTTGGTACCAATGGTGMAAAAGCA
TTCTACTTTTCATCTGAAGATGTTTCCTTTTTTCTCATGGGCCTCAATACAMTCCCAAATATCCCTTGGC
AACTGCTCCATCAAAAGAAAATTTTAACTCTTTGAGATGAATGCACACATCAMAAAGCAGTTTCTCAGAG
CACAACCCACACACTGTACATACACAMCCCACACACATACACTGCATATACTCCATATACACATCCCATA
CCACAMAAATGTACTCATACCACACACACGAGCCCCACATAAATGCACTATACCACATACACACACATTC
ATACCCCCAACACAAATACACACATTCAACACAAACCACAATCACACCACACACTCAAMACACACCACAC

Kindly let me know which reference to use.

Answer 1 · 2022-06-28T12:49:46.000Z

Thank you for the feedback! The 'M' character is indeed the problem - I have seen this before in code using the SeqAn library for parsing FASTA files.

We will fix this as soon as possible. In the meantime, you can simply replace all characters but A,C,G,T in your reference genome FASTA file by 'N' and the program should run.

Answer 2 · 2022-06-29T06:15:22.000Z

We added support for references containing Iupac caracters. So this issue should now be fixed with the newest release.

Thank you for your feedback and please let us know if other problems occur.