huangnengCSU/compleasm

compleasm output also cds nucleotide sequences for the final set of genes

Opened this issue · 3 comments

Hi

Thank for this great software.

not a bug but a question/enhancement request:
Is there a way to also output the coding sequence for each final BUSCO gene? This would be really useful in certain phylogenetic analyses. Currently I managed to get it from the miniprot gff output file, but have to cross-check the gene marker fasta output first for the variant of the respective BUSCO gene used and then extract the cds from there.

i just noticed another detail which make it a bit difficult to extract the single copy gene CDS from the GFF files. out of >5300 genes in the Hymenoptera ODB10 dataset, 11 were found twice, i.e. with the same BUSCO-ID_X (see below). The other cases have incremental numbers and are unique, and then one of those is picked for the final result apparently. For the identical IDs in the GFF this is the case too (only one of them is picked), but I would need to check each case to know which one is the correct one. Not sure if this an intended behavior

double entry for 6888at7399_5
WINE01003074.1 miniprot mRNA 8969 12011 2184 + . ID=MP049007;Rank=1;Identity=0.5306;Positive=0.7011;Target=6888at7399_5 638
1430
WINE01001833.1 miniprot mRNA 26038 29230 2075 + . ID=MP049008;Rank=2;Identity=0.6583;Positive=0.7722;Target=6888at7399_5 1 622

Hi @estolle,

The output file full_table.tsv may give the information you need which includes the region in the genome hits the single copy gene.

thanks for the pointer. Indeed the full table has some info I can use and I think I can get the info I need now. Yet, if compleasm internally would know the cds and could output a multi-fasta (according to the gene marker fasta, that would be very helpful.

One issue I still see is this duplication of a BUSCO genes entry which is used for two different gene IDs in 2 different places, both marked ass "single". Yet the BUSCO Gene ID is identical. The difference is one has "Rank=1" and one "Rank=2" and the Rank1-verson is the final one in gene marker fasta and full table. So at the moment I am filtering out entries not matching Rank=1.

Do you have an idea how this duplication can be explained?

cat miniprot_output.single.frag.gff | grep "6888at7399_5"
WINE01003074.1 miniprot mRNA 8969 12011 2184 + . ID=MP049007;Rank=1;Identity=0.5306;Positive=0.7011;Target=6888at7399_5 638 1430
WINE01003074.1 miniprot CDS 8969 9234 230 + 0 Parent=MP049007;Rank=1;Identity=0.5667;Target=6888at7399_5 638 726
WINE01003074.1 miniprot CDS 9314 9942 353 + 1 Parent=MP049007;Rank=1;Identity=0.4047;Target=6888at7399_5 727 924
WINE01003074.1 miniprot CDS 10009 10343 350 + 2 Parent=MP049007;Rank=1;Identity=0.5676;Target=6888at7399_5 925 1036
WINE01003074.1 miniprot CDS 10421 11025 633 + 0 Parent=MP049007;Rank=1;Identity=0.6337;Target=6888at7399_5 1037 1214
WINE01003074.1 miniprot CDS 11094 11451 397 + 1 Parent=MP049007;Rank=1;Identity=0.5966;Target=6888at7399_5 1215 1334
WINE01003074.1 miniprot CDS 11532 11629 99 + 0 Parent=MP049007;Rank=1;Identity=0.5455;Target=6888at7399_5 1335 1366
WINE01003074.1 miniprot CDS 11825 12011 122 + 1 Parent=MP049007;Rank=1;Identity=0.3810;Target=6888at7399_5 1367 1430

WINE01001833.1 miniprot mRNA 26038 29230 2075 + . ID=MP049008;Rank=2;Identity=0.6583;Positive=0.7722;Target=6888at7399_5 1 622
WINE01001833.1 miniprot CDS 26038 26122 147 + 0 Parent=MP049008;Rank=2;Identity=0.9655;Target=6888at7399_5 1 28
WINE01001833.1 miniprot CDS 26353 26491 212 + 2 Parent=MP049008;Rank=2;Identity=0.8478;Target=6888at7399_5 29 74
WINE01001833.1 miniprot CDS 27069 27319 231 + 1 Parent=MP049008;Rank=2;Identity=0.5833;Target=6888at7399_5 75 149
WINE01001833.1 miniprot CDS 27392 27653 335 + 2 Parent=MP049008;Rank=2;Identity=0.7241;Target=6888at7399_5 150 236
WINE01001833.1 miniprot CDS 27730 27910 232 + 1 Parent=MP049008;Rank=2;Identity=0.7500;Target=6888at7399_5 237 297
WINE01001833.1 miniprot CDS 27995 28309 448 + 0 Parent=MP049008;Rank=2;Identity=0.8381;Target=6888at7399_5 298 402
WINE01001833.1 miniprot CDS 28389 28743 94 + 0 Parent=MP049008;Rank=2;Identity=0.2857;Target=6888at7399_5 403 510
WINE01001833.1 miniprot CDS 28827 29071 243 + 2 Parent=MP049008;Rank=2;Identity=0.6098;Target=6888at7399_5 511 593
WINE01001833.1 miniprot CDS 29144 29230 133 + 0 Parent=MP049008;Rank=2;Identity=0.8966;Target=6888at7399_5 594 622