ERROR: input record is badly formatted. Unexpected character 'M' found.
Closed this issue · 2 comments
Hi,
I have cloned the source code and obtained the human genome reference in the form of the "analysis set",
-rw-rw---- 1 rahmann 835M Jun 28 14:07 GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz
-rw-rw---- 1 rahmann 3,0G Jun 28 14:07 GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fasta
I am using the uncompressed (.fasta) file and running
bcmap index GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fasta
using the default values. On the first run it builds the .fai
file (which looks fine to me). But then, I get the following error:
reference GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fasta
kmer_index_name Index
k 31
minimizer_window 61
Loading reference genome...ERROR: input record is badly formatted. Unexpected character 'M' found.
..done.
Bucket number set to: 0
Loading ref.fai............failed.
Building ref.fai.............done.
Preparing index..............done.
Filling index initially......done.
Calculating cumulated sum....done.
Writing positions to index...done.
Writing index to file........done.
Index finished!
I do not know what is badly formatted. I have uploaded the .fai file for you (renamed to .txt to be able to upload it on github) and pasted the first lines of the FASTA below:
GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fasta.fai.txt
>chr1 AC:CM000663.2 gi:568336023 LN:248956422 rl:Chromosome M5:6aef897c3d6ff0c78aff06ac189178dd AS:GRCh38
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Indeed, there are a few lines in the genomic sequence where the character 'M' appears (by grep); is this the problem?
TCACCCCCCACACACACCAAACAMCCCACACAACACACACACACCACACCACACAAACACAAACACACCA
ACACAACACAGATGCACACAMCACACATCACACCTACATACAACATACACACATACACCTACATACATTA
GGAAACACAGCTTTTGTCCATTCTGTGAAAGGACATTTCGGAGCTCTTTGGTACCAATGGTGMAAAAGCA
TTCTACTTTTCATCTGAAGATGTTTCCTTTTTTCTCATGGGCCTCAATACAMTCCCAAATATCCCTTGGC
AACTGCTCCATCAAAAGAAAATTTTAACTCTTTGAGATGAATGCACACATCAMAAAGCAGTTTCTCAGAG
CACAACCCACACACTGTACATACACAMCCCACACACATACACTGCATATACTCCATATACACATCCCATA
CCACAMAAATGTACTCATACCACACACACGAGCCCCACATAAATGCACTATACCACATACACACACATTC
ATACCCCCAACACAAATACACACATTCAACACAAACCACAATCACACCACACACTCAAMACACACCACAC
Kindly let me know which reference to use.
Thank you for the feedback! The 'M' character is indeed the problem - I have seen this before in code using the SeqAn library for parsing FASTA files.
We will fix this as soon as possible. In the meantime, you can simply replace all characters but A,C,G,T in your reference genome FASTA file by 'N' and the program should run.
We added support for references containing Iupac caracters. So this issue should now be fixed with the newest release.
Thank you for your feedback and please let us know if other problems occur.