Nitfix Project Annotation

Documentation for the pipeline used to annotate genes for long read genomes assembled as part of the NSF Nitfix project. I used two different annotation pipelines: MAKER (Cantarel et al. 2008 Genome Research) and BRAKER2 (Bruna et al. 2021 NAR Genomics and Bioinformatics).

Documentation and Tutorials

I relied heavily on the following resources:

Genome Assemblies

The following taxa were extracted and sequenced on Nanopore PromethION (long read) and Illumina (short read) platforms by Chris Dervinish and assembled in HASLR by Neeka Sewnath before being passed to me for annotation.

Ceanothus americanus
Coriaria nepalensis
Corynocarpus laevigatus
Euonymus americanus
Frangula americana
Gonopterodendron arborea (labeled as Bulnesia arborea in files)
Myrica cerifera
Physocarpus capitata
Physocarpus opulifolius

MAKER uses transcript and protein data; its configuration allows for the user to provide transcripts (ESTs) and proteins from the genome taxon and from related taxa. The mode of BRAKER2 I ran uses only protein data; it allows for proteins from taxa that are distantly related to the genome taxon.

Reference Sequences

Transcript assemblies from the 1000 Plant Transcriptomes Project (One Thousand Plant Transcriptomes Initiative 2019, Carpenter et al. 2019) were used as input RNA data for MAKER. Transcriptomes were chosen based on taxonomic distance to the genome taxon and ploidy level. If a transcriptome for the genome taxon was available, that was chosen. If one wasn't available, the most closely-related taxon with a similar n value based on the Chromosome Counts Database (Rice et al. 2014) to the genome taxon was selected. In addition, for a few genome taxa, publically available genomes/transcriptomes were also available and closer taxonomically than the 1KP data. In this case, both the 1KP and other published transcript data were used.

Genome Taxon	Transcriptome Taxon	1KP Code	Other transcript source
Ceanothus americanus	Frangula carolinana, Ceanothus thyrsiflorus	WVEF	Salgado et al. 2018
Coriaria nepalensis	Coriaria nepalensis	NNGU	NA
Corynocarpus laevigatus	Coriaria nepalensis, Datisca glomerata	NNGU	Salgado et al. 2018
Euonymus americanus	Crossopetalum rhacoma, Tripterygium wilfordii	IHCQ	Tu et al. 2020
Frangula americana	Frangula caroliniana	WVEF	NA
Gonopterodendron arborea	Tribulus eichleriana	KVAY	NA
Myrica cerifera	Myrica cerifera	INSP	NA
Physocarpus capitata	Physocarpus opulifolius	SXCE	NA
Physocarpus opulifolius	Physocarpus opulifolius	SXCE	NA

The same protein data was used as input for all annotation runs. These were the translated CDS from Medicago trunculata, Arachis hypogaea, and Glycine max genome assemblies, with the longest isoforms selected. The dataset was received from Sara Knaack, who curated it for use in a parallel project.

Results

BUSCO scores of annotated gene models for both MAKER and BRAKER are listed in the file sample_annotation_busco.xlsx. Genomes and annotations have been uploaded to CoGe for browsing. For now they are private, but I can grant access to individuals if you contact me at kasey.pham@ufl.edu.

Pipeline Overview

More detail on running this pipeline can be found in the pipline directory README.

Repeat Library Construction

This step must be done first, before running MAKER or BRAKER.

01-assembstats.sh: Get statistics on genome assembly.
02-repeat01.job: Model repeats de novo using RepeatModeler.
03-repeat02.job: Identify plant repeats from Repbase.
04-repeat03.job: Mask de novo and plant repeats using RepeatMasker.
05-consreps.sh: Consolidate de novo and plant repeats.
06-procreps.job: Process repeats using RepeatMasker tools.
07-compreps.job: Generate GFF files with repeats for annotation tools.

Run `MAKER`

This step can be run before, after, or concurrent to BRAKER.

01-maker01.job: Run MAKER round 1 using annotated genome, transcriptomes, proteomes, and repeat library as input.
02-process_maker.job: Process and format MAKER round 1 output to prepare for running SNAP.
03-snap.job: Get SNAP models.
04-train_snap.job: Generate training sets from SNAP models.
05-export_fasta.sh: Process MAKER output for running AUGUSTUS.
06-busco_aug.job: Run AUGUSTUS.
07-process_busco.sh: Add AUGUSTUS output to reference species in config folder.
08-maker02_prep.sh: Process AUGUSTUS output for MAKER round 2.
09-maker02.job: Run MAKER round 2.
10-process_maker02.job: Reformat MAKER round 2 output and get annotation statistics.
11-busco.job: Run BUSCO on annotated gene models.

Run `BRAKER`

This step can be run before, after, or concurrent to MAKER.

01-mask_genome.sh: Mask repeats in genome based on repeat library generated.
02-genemark_es.job: Run GenemarkES to identify gene models.
03-prothint.job: Run ProtHint to generate hints file based on GeneMark output to provide to BRAKER.
04-braker_prep.sh: Prepare for running BRAKER by creating symlinks to the configuration directory and BRAKER scripts in local directory.
05-braker.job: Run BRAKER.
06-get_braker_fas.sh: Process BRAKER output and get FASTA file of gene models.
07-busco.job: Run BUSCO on BRAKER output gene models.

kaseykhanhpham/nitfix-annotation