Documentation for the pipeline used to annotate genes for long read genomes assembled as part of the NSF Nitfix project. I used two different annotation pipelines: MAKER
(Cantarel et al. 2008 Genome Research) and BRAKER2
(Bruna et al. 2021 NAR Genomics and Bioinformatics).
I relied heavily on the following resources:
- Daren Card's annotation pipeline for the boa constrictor genome
- The MAKER tutorial for WGS Assembly and Annotation Winter School 2018
- BRAKER2 documentation
The following taxa were extracted and sequenced on Nanopore PromethION (long read) and Illumina (short read) platforms by Chris Dervinish and assembled in HASLR
by Neeka Sewnath before being passed to me for annotation.
- Ceanothus americanus
- Coriaria nepalensis
- Corynocarpus laevigatus
- Euonymus americanus
- Frangula americana
- Gonopterodendron arborea (labeled as Bulnesia arborea in files)
- Myrica cerifera
- Physocarpus capitata
- Physocarpus opulifolius
MAKER
uses transcript and protein data; its configuration allows for the user to provide transcripts (ESTs) and proteins from the genome taxon and from related taxa. The mode of BRAKER2
I ran uses only protein data; it allows for proteins from taxa that are distantly related to the genome taxon.
Transcript assemblies from the 1000 Plant Transcriptomes Project (One Thousand Plant Transcriptomes Initiative 2019, Carpenter et al. 2019) were used as input RNA data for MAKER
. Transcriptomes were chosen based on taxonomic distance to the genome taxon and ploidy level. If a transcriptome for the genome taxon was available, that was chosen. If one wasn't available, the most closely-related taxon with a similar n value based on the Chromosome Counts Database (Rice et al. 2014) to the genome taxon was selected. In addition, for a few genome taxa, publically available genomes/transcriptomes were also available and closer taxonomically than the 1KP data. In this case, both the 1KP and other published transcript data were used.
Genome Taxon | Transcriptome Taxon | 1KP Code | Other transcript source |
---|---|---|---|
Ceanothus americanus | Frangula carolinana, Ceanothus thyrsiflorus | WVEF | Salgado et al. 2018 |
Coriaria nepalensis | Coriaria nepalensis | NNGU | NA |
Corynocarpus laevigatus | Coriaria nepalensis, Datisca glomerata | NNGU | Salgado et al. 2018 |
Euonymus americanus | Crossopetalum rhacoma, Tripterygium wilfordii | IHCQ | Tu et al. 2020 |
Frangula americana | Frangula caroliniana | WVEF | NA |
Gonopterodendron arborea | Tribulus eichleriana | KVAY | NA |
Myrica cerifera | Myrica cerifera | INSP | NA |
Physocarpus capitata | Physocarpus opulifolius | SXCE | NA |
Physocarpus opulifolius | Physocarpus opulifolius | SXCE | NA |
The same protein data was used as input for all annotation runs. These were the translated CDS from Medicago trunculata, Arachis hypogaea, and Glycine max genome assemblies, with the longest isoforms selected. The dataset was received from Sara Knaack, who curated it for use in a parallel project.
BUSCO
scores of annotated gene models for both MAKER
and BRAKER
are listed in the file sample_annotation_busco.xlsx
. Genomes and annotations have been uploaded to CoGe
for browsing. For now they are private, but I can grant access to individuals if you contact me at kasey.pham@ufl.edu.
More detail on running this pipeline can be found in the pipline directory README.
This step must be done first, before running MAKER
or BRAKER
.
01-assembstats.sh
: Get statistics on genome assembly.02-repeat01.job
: Model repeats de novo usingRepeatModeler
.03-repeat02.job
: Identify plant repeats fromRepbase
.04-repeat03.job
: Mask de novo and plant repeats usingRepeatMasker
.05-consreps.sh
: Consolidate de novo and plant repeats.06-procreps.job
: Process repeats usingRepeatMasker
tools.07-compreps.job
: Generate GFF files with repeats for annotation tools.
This step can be run before, after, or concurrent to BRAKER
.
01-maker01.job
: RunMAKER
round 1 using annotated genome, transcriptomes, proteomes, and repeat library as input.02-process_maker.job
: Process and formatMAKER
round 1 output to prepare for runningSNAP
.03-snap.job
: GetSNAP
models.04-train_snap.job
: Generate training sets fromSNAP
models.05-export_fasta.sh
: ProcessMAKER
output for runningAUGUSTUS
.06-busco_aug.job
: RunAUGUSTUS
.07-process_busco.sh
: AddAUGUSTUS
output to reference species in config folder.08-maker02_prep.sh
: ProcessAUGUSTUS
output forMAKER
round 2.09-maker02.job
: RunMAKER
round 2.10-process_maker02.job
: ReformatMAKER
round 2 output and get annotation statistics.11-busco.job
: RunBUSCO
on annotated gene models.
This step can be run before, after, or concurrent to MAKER
.
01-mask_genome.sh
: Mask repeats in genome based on repeat library generated.02-genemark_es.job
: RunGenemarkES
to identify gene models.03-prothint.job
: RunProtHint
to generate hints file based onGeneMark
output to provide toBRAKER
.04-braker_prep.sh
: Prepare for runningBRAKER
by creating symlinks to the configuration directory andBRAKER
scripts in local directory.05-braker.job
: RunBRAKER
.06-get_braker_fas.sh
: ProcessBRAKER
output and getFASTA
file of gene models.07-busco.job
: RunBUSCO
onBRAKER
output gene models.