josuebarrera/GenEra

Maybe Ensembl protein formart bugs

yxj17173 opened this issue · 2 comments

Dear Josué,
Hello, another bugs occured. When I used Ensembl genome, such as mouse genome, GRCm38.92, it doesn't work. I had used this GRCm38.92 to run Diamond, worked, so this may genEra bug. And I got those feedback:tax10090.txt

Error: The sequences are expected to be proteins but only contain DNA letters. Use the option --ignore-warnings to proceed.
I tried to add '--ignore-warnings' to the end of command:
nohup time genEra -q /mnt/data4/disk/yxj/EmbroGenesis/Ref/Sequences/Mm.fa -t 10090 -b /mnt/data4/disk/yxj/diamond_nr/nr -d /mnt/data4/disk/yxj/diamond_nr/taxdump --ignore-warnings > /mnt/data4/disk/yxj/result/tax10090_2.txt &
and I got ERROR: One or more invalid arguments.
Then I tried another version, GRCm38.86, still the same error.
By the way, NCBI Refseq format works well.
Best regrads,
Xujiang

Dear Xujiang,
I don't think that the Ensembl format is the one to blame here. The FASTA file that you're trying to use has some very short sequences that do not look like genes:

>ENSMUSP00000142546.1 pep chromosome:GRCm38:14:54113468:54113476:1 gene:ENSMUSG00000096749.2 transcript:ENSMUST00000196221.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:Trdd1 description:T cell receptor delta diversity 1 [Source:MGI Symbol;Acc:MGI:4439547]
MAY
>ENSMUSP00000142955.1 pep chromosome:GRCm38:14:54122226:54122241:1 gene:ENSMUSG00000096176.1 transcript:ENSMUST00000177564.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:Trdd2 description:T cell receptor delta diversity 2 [Source:MGI Symbol;Acc:MGI:4439546]
IGGIR
>ENSMUSP00000141764.1 pep chromosome:GRCm38:6:41533201:41533212:1 gene:ENSMUSG00000095668.1 transcript:ENSMUST00000178537.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:Trbd1 description:T cell receptor beta, D region 1 [Source:MGI Symbol;Acc:MGI:4439571]
GTGG
>ENSMUSP00000141312.1 pep chromosome:GRCm38:6:41542163:41542176:1 gene:ENSMUSG00000094569.1 transcript:ENSMUST00000178862.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:Trbd2 description:T cell receptor beta, D region 2 [Source:MGI Symbol;Acc:MGI:4439727]
GTGG
>ENSMUSP00000142153.1 pep chromosome:GRCm38:12:113430528:113430538:-1 gene:ENSMUSG00000094028.1 transcript:ENSMUST00000179520.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:Ighd4-1 description:immunoglobulin heavy diversity 4-1 [Source:MGI Symbol;Acc:MGI:4439801]
LTG
>ENSMUSP00000141970.1 pep chromosome:GRCm38:12:113448214:113448229:-1 gene:ENSMUSG00000094552.1 transcript:ENSMUST00000179883.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:Ighd3-2 description:immunoglobulin heavy diversity 3-2 [Source:MGI Symbol;Acc:MGI:4439707]
RQLRL
>ENSMUSP00000142162.1 pep chromosome:GRCm38:12:113449588:113449597:-1 gene:ENSMUSG00000096420.2 transcript:ENSMUST00000195858.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:Ighd5-6 description:immunoglobulin heavy diversity 5-6 [Source:MGI Symbol;Acc:MGI:4937234]
EYL
>ENSMUSP00000141399.1 pep chromosome:GRCm38:12:113450851:113450867:-1 gene:ENSMUSG00000095656.1 transcript:ENSMUST00000180001.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:Ighd2-8 description:immunoglobulin heavy diversity 2-8 [Source:MGI Symbol;Acc:MGI:4439706]
STMVT
>ENSMUSP00000141414.1 pep chromosome:GRCm38:12:113454942:113454951:-1 gene:ENSMUSG00000094957.1 transcript:ENSMUST00000178815.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:Ighd5-5 description:immunoglobulin heavy diversity 5-5 [Source:MGI Symbol;Acc:MGI:4937334]
DYL
>ENSMUSP00000141374.1 pep chromosome:GRCm38:12:113456720:113456736:-1 gene:ENSMUSG00000094057.1 transcript:ENSMUST00000177965.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:Ighd2-7 description:immunoglobulin heavy diversity 2-7 [Source:MGI Symbol;Acc:MGI:4439866]
STMVT
>ENSMUSP00000141376.1 pep chromosome:GRCm38:12:113459864:113459892:-1 gene:ENSMUSG00000094268.1 transcript:ENSMUST00000178909.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:Ighd5-8
RQLASAVPQ
>ENSMUSP00000141852.1 pep chromosome:GRCm38:12:113460101:113460110:-1 gene:ENSMUSG00000096884.1 transcript:ENSMUST00000177646.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:Ighd5-4 description:immunoglobulin heavy diversity 5-4 [Source:MGI Symbol;Acc:MGI:4937058]
EYL
>ENSMUSP00000141615.1 pep chromosome:GRCm38:12:113461369:113461385:-1 gene:ENSMUSG00000096250.1 transcript:ENSMUST00000178230.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:Ighd2-6 description:immunoglobulin heavy diversity 2-6 [Source:MGI Symbol;Acc:MGI:4439865]
PTIVT
>ENSMUSP00000141202.1 pep chromosome:GRCm38:12:113464524:113464552:-1 gene:ENSMUSG00000095592.1 transcript:ENSMUST00000178483.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:Ighd5-7
RQLASAVPQ
>ENSMUSP00000141703.1 pep chromosome:GRCm38:12:113464761:113464770:-1 gene:ENSMUSG00000093876.1 transcript:ENSMUST00000179262.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:Ighd5-3 description:immunoglobulin heavy diversity 5-3 [Source:MGI Symbol;Acc:MGI:4937297]
EYL
>ENSMUSP00000141697.1 pep chromosome:GRCm38:12:113466027:113466043:-1 gene:ENSMUSG00000095897.1 transcript:ENSMUST00000178549.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:Ighd2-5 description:immunoglobulin heavy diversity 2-5 [Source:MGI Symbol;Acc:MGI:4439705]
PTIVT
>ENSMUSP00000142226.1 pep chromosome:GRCm38:12:113469189:113469217:-1 gene:ENSMUSG00000103203.1 transcript:ENSMUST00000193012.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:Gm37327
RQLASAVPQ
>ENSMUSP00000142199.1 pep chromosome:GRCm38:12:113469426:113469435:-1 gene:ENSMUSG00000096396.1 transcript:ENSMUST00000179166.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:Ighd5-2 description:immunoglobulin heavy diversity 5-2 [Source:MGI Symbol;Acc:MGI:4936898]
EYL
>ENSMUSP00000141415.1 pep chromosome:GRCm38:12:113470694:113470710:-1 gene:ENSMUSG00000095444.1 transcript:ENSMUST00000179560.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:Ighd2-4 description:immunoglobulin heavy diversity 2-4 [Source:MGI Symbol;Acc:MGI:4439709]
STMIT
>ENSMUSP00000142229.1 pep chromosome:GRCm38:12:113475400:113475416:-1 gene:ENSMUSG00000096568.1 transcript:ENSMUST00000177839.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:Ighd2-3 description:immunoglobulin heavy diversity 2-3 [Source:MGI Symbol;Acc:MGI:4439708]
SMMVT
>ENSMUSP00000141687.1 pep chromosome:GRCm38:12:113482170:113482192:-1 gene:ENSMUSG00000076630.1 transcript:ENSMUST00000103439.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:Ighd1-1
FITTVVA
>ENSMUSP00000141755.1 pep chromosome:GRCm38:12:113525313:113525329:-1 gene:ENSMUSG00000093818.1 transcript:ENSMUST00000180266.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:Ighd3-1 description:immunoglobulin heavy diversity 3-1 [Source:MGI Symbol;Acc:MGI:4439891]
GTARA
>ENSMUSP00000141206.1 pep chromosome:GRCm38:12:113528032:113528054:-1 gene:ENSMUSG00000076632.1 transcript:ENSMUST00000103441.1 gene_biotype:IG_D_gene transcript_biotype:IG_D_gene gene_symbol:Gm16968
YITKVVA

These sequences are too short for Diamond to handle, so they are causing the pipeline to crash. Remove those sequences and try to run the pipeline again.

tail -561183 Mus_musculus.GRCm38.pep.all.fa > Mus_musculus.GRCm38.pep.clean.fa

I'm sure it will work properly this time.

Cheers,
Josué.

Dear Josué,
Thanks for answering my doubts!
Xujiang