Gaius-Augustus/BRAKER

--busco_lineage helps with busco scores but not with protein number

aureliendejode opened this issue · 4 comments

Hello,
I have used BRAKER3 with default parameters to annotate 3 anemone genomes and my busco scores were lower than in my genome and so I ran it again using the --busco_lineages option and it solved that issue.
However, there is still a big difference in the number of protein among the braker.aa, genemark.aa and augustus.hints.aa files.
Is it something that need to be fixed ? (I started to run omark on the braker.aa and the results look fine to me.)
If yes, it seems to me this might come from tsebra and there is maybe a way to run tsebra differently ?

Here are the stats for the 2 braker runs:

####Before using --busco_lineage

# BUSCO version is: 5.6.1 
# The lineage dataset is: eukaryota_odb10 (Creation date: 2024-01-08, number of genomes: 70, number of BUSCOs: 255)
# Summarized benchmarking in BUSCO notation for file /grps2/bmtitus/analysis/Comparative_Genomic/Genome_assemblies/Entacmea_quesricolor/annotation_BRAKER3/braker/braker.aa
# BUSCO was run in mode: proteins

	***** Results: *****

	C:92.5%[S:83.1%,D:9.4%],F:1.6%,M:5.9%,n:255	   
	236	Complete BUSCOs (C)			   
	212	Complete and single-copy BUSCOs (S)	   
	24	Complete and duplicated BUSCOs (D)	   
	4	Fragmented BUSCOs (F)			   
	15	Missing BUSCOs (M)			   
	255	Total BUSCO groups searched		


-rw-r--r-- 1 adejode bmtitus 18M 14 oct.  16:33 Augustus/augustus.hints.aa
-rw-r--r-- 1 adejode bmtitus 10M 14 oct.  16:35 braker.aa
-rw-r--r-- 1 adejode bmtitus 19M 15 oct.  14:19 GeneMark-ETP/genemark.aa


grep -c ">" braker.aa Augustus/augustus.hints.aa GeneMark-ETP/genemark.aa 
braker.aa:19761
Augustus/augustus.hints.aa:38767
GeneMark-ETP/genemark.aa:36201

# BUSCO version is: 5.6.1 
# The lineage dataset is: metazoa_odb10 (Creation date: 2024-01-08, number of genomes: 65, number of BUSCOs: 954)
# Summarized benchmarking in BUSCO notation for file /grps2/bmtitus/analysis/Comparative_Genomic/Genome_assemblies/Entacmea_quesricolor/annotation_BRAKER3/braker/braker.aa
# BUSCO was run in mode: proteins

	***** Results: *****

	C:91.7%[S:83.6%,D:8.1%],F:1.3%,M:7.0%,n:954	   
	875	Complete BUSCOs (C)			   
	798	Complete and single-copy BUSCOs (S)	   
	77	Complete and duplicated BUSCOs (D)	   
	12	Fragmented BUSCOs (F)			   
	67	Missing BUSCOs (M)			   
	954	Total BUSCO groups searched		   


   

####after using --busco_lineage

# BUSCO version is: 5.6.1 
# The lineage dataset is: eukaryota_odb10 (Creation date: 2024-01-08, number of genomes: 70, number of BUSCOs: 255)
# Summarized benchmarking in BUSCO notation for file /grps2/bmtitus/analysis/Comparative_Genomic/Genome_assemblies/Entacmea_quesricolor/annotation_BRAKER3/braker_busco_lineage/braker/braker.aa
# BUSCO was run in mode: proteins

	***** Results: *****

	C:97.3%[S:72.2%,D:25.1%],F:0.8%,M:1.9%,n:255	   
	248	Complete BUSCOs (C)			   
	184	Complete and single-copy BUSCOs (S)	   
	64	Complete and duplicated BUSCOs (D)	   
	2	Fragmented BUSCOs (F)			   
	5	Missing BUSCOs (M)			   
	255	Total BUSCO groups searched		   

Dependencies and versions:
	hmmsearch: 3.1
	busco: 5.6.1


# BUSCO version is: 5.6.1 
# The lineage dataset is: metazoa_odb10 (Creation date: 2024-01-08, number of genomes: 65, number of BUSCOs: 954)
# Summarized benchmarking in BUSCO notation for file /grps2/bmtitus/analysis/Comparative_Genomic/Genome_assemblies/Entacmea_quesricolor/annotation_BRAKER3/braker_busco_lineage/braker/braker.aa
# BUSCO was run in mode: proteins

	***** Results: *****

	C:97.5%[S:69.5%,D:28.0%],F:0.6%,M:1.9%,n:954	   
	930	Complete BUSCOs (C)			   
	663	Complete and single-copy BUSCOs (S)	   
	267	Complete and duplicated BUSCOs (D)	   
	6	Fragmented BUSCOs (F)			   
	18	Missing BUSCOs (M)			   
	954	Total BUSCO groups searched		   


-rw-r--r-- 1 adejode bmtitus 18M 16 oct.  10:04 Augustus/augustus.hints.aa
-rw-r--r-- 1 adejode bmtitus 11M 16 oct.  10:07 braker.aa
-rw-r--r-- 1 adejode bmtitus 19M 16 oct.  10:52 GeneMark-ETP/genemark.aa


grep -c ">" braker.aa Augustus/augustus.hints.aa GeneMark-ETP/genemark.aa 
braker.aa:20454
Augustus/augustus.hints.aa:38756
GeneMark-ETP/genemark.aa:36206

There are not many genomes of closely relative... but as an example Nematostella vectensis has ~19 000 protein coding genes and ~38 000 genes and the annotation was conducted with the NCBI Eukaryotic Genome Annotation Pipeline.

I am actually not sure the number of genes is too low, I was just wondering if the differences in terms of number of sequences (among braker.aa, genemark.aa and augustus.hints.aa) is something to be concerned about ? Especially since the braker file is quite smaller (gft files) and contains way less proteins than the augustus and genemark ones.

-rw-r--r-- 1 adejode bmtitus 89M 14 oct.  11:31 GeneMark-ETP/genemark.gtf
-rw-r--r-- 1 adejode bmtitus 67M 14 oct.  16:33 Augustus/augustus.hints.gtf
-rw-r--r-- 1 adejode bmtitus 46M 14 oct.  16:34 braker.gtf

Great, thanks for your insights on this!