CDS does not work
Closed this issue · 5 comments
Hi, I am running the following command
gbseqextractor -types CDS -rv -f ~/data/genomes/NCBI/gbk/genomic.gbff -prefix test
and the produced fasta sequences are not CDS but look more like whole gene (including introns). Am I doing something wrong here?
Not sure what happened to you. would you please upload your gbff file?
Sure! The genbank file I used was https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/143/395/GCF_000143395.1_Attacep1.0/GCF_000143395.1_Attacep1.0_genomic.gbff.gz
Here's my command:
gbseqextractor -types CDS -rv -f GCF_000143395.1/genomic.gbff -prefix test
Here's an example of the "CDS" that gbseqextractor produces:
>NW_012130065.1;LOC105617575
ATACAAGTGCGAGAAAAATGACGTGCCCATTACGGTCACGAATGGCTCGACCTTGCCGCTCGTGGAGCGACCGACCGATGAGGAGGCCGCCCTTTATGACCCATTCGAGCATCGCAAACTCGCCCATCCCACGTCGGATCTCGACACGCTCATACATCTACTCAAGGGGAGCCTGGGCACGGGCATACTCGCAATGCCGATGGCCTTCCGTAACGCCGGACTCGCCTTCGGCCTCTTCGCCACCTTCTTCATCGGTGCAGTCTGCACCTATTGCGTTCACATCTTGGTAAAATCCGCCCATGTCCTGTGCAGGCGGTTGCAGACACCTAGTCTGGGCTTCGCTGACGTGGCCGAAGCGGCCTTCCTAGTGGGGCCCGAACCCGTTCAGAAATACGCCAGACTTGCCAAGTGAGTTTATTCATGGGAGACGTTACGTTTGCGTTTTATTACCAGAGTCATTTAGAAAATATATAACACGACAAATTTCGTTTTCACAACTTTTTATATGTATGAAGATATCCTCGATGTATACGTTGCGTTGTTATCCCTGTTTCTTTTTATTATTATTATTTTATATTATTCAATATATACACCGTATATTTAATATCGTTTCAGGGCAACAATTAATTCGTTCTTGGTGATCGATCTGGTTGGTTGTTGCTGCGTGTACATCGTCTTTATTTCAACTAATCTTAAAGAGGTCGTGGATTATTATACGGAAACGGATAAGGATCTGCGAGTGTACATGGCCGCGTTGCTGCCATTACTCATTATCTTCTCTCTTGTGAGGAATCTCAAATATCTCGCGCCGTTCTCAATGGTGGCAAATGTATTAATCGCCACCGGTATGGGTATCACCTTCTATTATATTTTCTACGATCTGCCTTCTATCAACGACGTACCAAATTTCTCTAGCTGGTCTCAGTTGCCCCTTTTCTTCGGCACTGCCATTTTTGCCCTCGAGGGCATCGGTGTTGTAAGTATACTCATTTTTTTTTTTATATCGCTATATCACAACAAATTTTACGTAACATCAAACTCAAATGAAATATTAAAACTTTATAATTTATAGCAAAACTTTTTTAACTTTTCCAACCAATATGATGTTTATAACTTAAAAAAAATTGTTGACAGAAGTTAAAGTACATTGATTTTGTATTTAATTTTGATTAGAATTATTGCAGTACAATTACATGATGAAGATATTACGACACTGCGCGTCAGAAATAACTTTGCATTCAGTTGCTAGACACGTTCGGACACTTAACGGTCGCGTTCGATTGTTTCAAAGTATCGACACGCGTAAATCAATCAAATGGCACGCCTTGGTCTGTGCGTCATGGAATATTTCGTACATTTGTAACACGTACCAACTTGATGAGCAAAAAATTGAATTAACATTTTATAATATCTGTCTCTGAATAGAAATTAAATCATCCAAGTATAGAAATTCTTAAAATAAACGTTATTTTTAATAAAATTTAATTTTCAATAAAATGCAAAATAATTAAAAAATACTATTTAAATCATATTTTATATACTTTTTTTGTTTTGAAATCTAAAAAAAATATTTAAATGTGTTCTATATTTTTTACTTATCTTATATACATATATCTTATAAAATAGAAACTATTTACATATGAACTTTAAATTTTGTGCCATCTTTTGTTTTCAATTCCATATGTATTTCATTATGTATCATTAAGAGTTAATTTGTTTCTAGGTAATGCCTTTGGAGAATAATATGAAAACACCAGCTCATTTCGTCGGTTGTCCGGGTGTCCTCAATACCGGCATGTTCTTCGTCGTATTGTTGTACAGTACCGTCGGTTTCTTCGGCTTCTGGAAGTATGGCGACAGCACAAGGGCCTCCATTACGCTTAATCTTCCGCAAAGTCAGGTGCTGGCACAATCCACCAAGGTAATGATTGCTATTGCGATCTTCCTCACTTACGGACTTCAGTTCTACGTGCCCATGGAGATTATTTGGAAGAACGCTAAGCAGTATTTTGGCTCCCGGCGATTACTCGGCGAATATTTCCTCCGCATCCTATTGGTGATCTTTACCGTATGCGTGGCGATCGCGATCCCCAATTTGGGTCCATTTATCTCCCTCGTCGGGGCTGTATGTTTGTCCACGTTGGGTCTCATGTTCCCATCGGTGATCGAATTGGTAACCGTCTGGGAACAGGAGAACGGCCTTGGCAAATGGAATTGGCGGCTATGGAAGAATATCGCCATCATCACTTTTGGTGTGTTGGGCTTCCTCACTGGCACGTATGTCAGTATCCAAGAGATCTTGGAAGGCAAATGA
When I blast this produced "CDS" sequence against the proteins annotated for that genome I get the following:
sorry, this script doesn't work when a gene has multiple CDS.
Actually, in the https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/143/395/GCF_000143395.1_Attacep1.0/ directory, NCBI already generated CDS and protein sequences for each gene.
Oh. Thanks for clarifying! I will keep lookinf for a decent tool to retrieve CDS from genbank files then. Thanks for pointing out the NCBI-made files!
Hi schraderL ,
Now the latest version can handle your problem.
Note: the script won't combine mutliple CDS of the same gene into ONE sequence.