--busco_lineage helps with busco scores but not with protein number
aureliendejode opened this issue · 4 comments
Hello,
I have used BRAKER3 with default parameters to annotate 3 anemone genomes and my busco scores were lower than in my genome and so I ran it again using the --busco_lineages option and it solved that issue.
However, there is still a big difference in the number of protein among the braker.aa, genemark.aa and augustus.hints.aa files.
Is it something that need to be fixed ? (I started to run omark on the braker.aa and the results look fine to me.)
If yes, it seems to me this might come from tsebra and there is maybe a way to run tsebra differently ?
Here are the stats for the 2 braker runs:
####Before using --busco_lineage
# BUSCO version is: 5.6.1
# The lineage dataset is: eukaryota_odb10 (Creation date: 2024-01-08, number of genomes: 70, number of BUSCOs: 255)
# Summarized benchmarking in BUSCO notation for file /grps2/bmtitus/analysis/Comparative_Genomic/Genome_assemblies/Entacmea_quesricolor/annotation_BRAKER3/braker/braker.aa
# BUSCO was run in mode: proteins
***** Results: *****
C:92.5%[S:83.1%,D:9.4%],F:1.6%,M:5.9%,n:255
236 Complete BUSCOs (C)
212 Complete and single-copy BUSCOs (S)
24 Complete and duplicated BUSCOs (D)
4 Fragmented BUSCOs (F)
15 Missing BUSCOs (M)
255 Total BUSCO groups searched
-rw-r--r-- 1 adejode bmtitus 18M 14 oct. 16:33 Augustus/augustus.hints.aa
-rw-r--r-- 1 adejode bmtitus 10M 14 oct. 16:35 braker.aa
-rw-r--r-- 1 adejode bmtitus 19M 15 oct. 14:19 GeneMark-ETP/genemark.aa
grep -c ">" braker.aa Augustus/augustus.hints.aa GeneMark-ETP/genemark.aa
braker.aa:19761
Augustus/augustus.hints.aa:38767
GeneMark-ETP/genemark.aa:36201
# BUSCO version is: 5.6.1
# The lineage dataset is: metazoa_odb10 (Creation date: 2024-01-08, number of genomes: 65, number of BUSCOs: 954)
# Summarized benchmarking in BUSCO notation for file /grps2/bmtitus/analysis/Comparative_Genomic/Genome_assemblies/Entacmea_quesricolor/annotation_BRAKER3/braker/braker.aa
# BUSCO was run in mode: proteins
***** Results: *****
C:91.7%[S:83.6%,D:8.1%],F:1.3%,M:7.0%,n:954
875 Complete BUSCOs (C)
798 Complete and single-copy BUSCOs (S)
77 Complete and duplicated BUSCOs (D)
12 Fragmented BUSCOs (F)
67 Missing BUSCOs (M)
954 Total BUSCO groups searched
####after using --busco_lineage
# BUSCO version is: 5.6.1
# The lineage dataset is: eukaryota_odb10 (Creation date: 2024-01-08, number of genomes: 70, number of BUSCOs: 255)
# Summarized benchmarking in BUSCO notation for file /grps2/bmtitus/analysis/Comparative_Genomic/Genome_assemblies/Entacmea_quesricolor/annotation_BRAKER3/braker_busco_lineage/braker/braker.aa
# BUSCO was run in mode: proteins
***** Results: *****
C:97.3%[S:72.2%,D:25.1%],F:0.8%,M:1.9%,n:255
248 Complete BUSCOs (C)
184 Complete and single-copy BUSCOs (S)
64 Complete and duplicated BUSCOs (D)
2 Fragmented BUSCOs (F)
5 Missing BUSCOs (M)
255 Total BUSCO groups searched
Dependencies and versions:
hmmsearch: 3.1
busco: 5.6.1
# BUSCO version is: 5.6.1
# The lineage dataset is: metazoa_odb10 (Creation date: 2024-01-08, number of genomes: 65, number of BUSCOs: 954)
# Summarized benchmarking in BUSCO notation for file /grps2/bmtitus/analysis/Comparative_Genomic/Genome_assemblies/Entacmea_quesricolor/annotation_BRAKER3/braker_busco_lineage/braker/braker.aa
# BUSCO was run in mode: proteins
***** Results: *****
C:97.5%[S:69.5%,D:28.0%],F:0.6%,M:1.9%,n:954
930 Complete BUSCOs (C)
663 Complete and single-copy BUSCOs (S)
267 Complete and duplicated BUSCOs (D)
6 Fragmented BUSCOs (F)
18 Missing BUSCOs (M)
954 Total BUSCO groups searched
-rw-r--r-- 1 adejode bmtitus 18M 16 oct. 10:04 Augustus/augustus.hints.aa
-rw-r--r-- 1 adejode bmtitus 11M 16 oct. 10:07 braker.aa
-rw-r--r-- 1 adejode bmtitus 19M 16 oct. 10:52 GeneMark-ETP/genemark.aa
grep -c ">" braker.aa Augustus/augustus.hints.aa GeneMark-ETP/genemark.aa
braker.aa:20454
Augustus/augustus.hints.aa:38756
GeneMark-ETP/genemark.aa:36206
There are not many genomes of closely relative... but as an example Nematostella vectensis has ~19 000 protein coding genes and ~38 000 genes and the annotation was conducted with the NCBI Eukaryotic Genome Annotation Pipeline.
I am actually not sure the number of genes is too low, I was just wondering if the differences in terms of number of sequences (among braker.aa, genemark.aa and augustus.hints.aa) is something to be concerned about ? Especially since the braker file is quite smaller (gft files) and contains way less proteins than the augustus and genemark ones.
-rw-r--r-- 1 adejode bmtitus 89M 14 oct. 11:31 GeneMark-ETP/genemark.gtf
-rw-r--r-- 1 adejode bmtitus 67M 14 oct. 16:33 Augustus/augustus.hints.gtf
-rw-r--r-- 1 adejode bmtitus 46M 14 oct. 16:34 braker.gtf
Great, thanks for your insights on this!