/nitfix-annotation

documentation on annotation pipeline for NSF Nitfix project long read genome assemblies

Primary LanguagePython

Nitfix Project Annotation

Documentation for the pipeline used to annotate genes for long read genomes assembled as part of the NSF Nitfix project. I used two different annotation pipelines: MAKER (Cantarel et al. 2008 Genome Research) and BRAKER2 (Bruna et al. 2021 NAR Genomics and Bioinformatics).

Documentation and Tutorials

I relied heavily on the following resources:

Genome Assemblies

The following taxa were extracted and sequenced on Nanopore PromethION (long read) and Illumina (short read) platforms by Chris Dervinish and assembled in HASLR by Neeka Sewnath before being passed to me for annotation.

  • Ceanothus americanus
  • Coriaria nepalensis
  • Corynocarpus laevigatus
  • Euonymus americanus
  • Frangula americana
  • Gonopterodendron arborea (labeled as Bulnesia arborea in files)
  • Myrica cerifera
  • Physocarpus capitata
  • Physocarpus opulifolius

MAKER uses transcript and protein data; its configuration allows for the user to provide transcripts (ESTs) and proteins from the genome taxon and from related taxa. The mode of BRAKER2 I ran uses only protein data; it allows for proteins from taxa that are distantly related to the genome taxon.

Reference Sequences

Transcript assemblies from the 1000 Plant Transcriptomes Project (One Thousand Plant Transcriptomes Initiative 2019, Carpenter et al. 2019) were used as input RNA data for MAKER. Transcriptomes were chosen based on taxonomic distance to the genome taxon and ploidy level. If a transcriptome for the genome taxon was available, that was chosen. If one wasn't available, the most closely-related taxon with a similar n value based on the Chromosome Counts Database (Rice et al. 2014) to the genome taxon was selected. In addition, for a few genome taxa, publically available genomes/transcriptomes were also available and closer taxonomically than the 1KP data. In this case, both the 1KP and other published transcript data were used.

Genome Taxon Transcriptome Taxon 1KP Code Other transcript source
Ceanothus americanus Frangula carolinana, Ceanothus thyrsiflorus WVEF Salgado et al. 2018
Coriaria nepalensis Coriaria nepalensis NNGU NA
Corynocarpus laevigatus Coriaria nepalensis, Datisca glomerata NNGU Salgado et al. 2018
Euonymus americanus Crossopetalum rhacoma, Tripterygium wilfordii IHCQ Tu et al. 2020
Frangula americana Frangula caroliniana WVEF NA
Gonopterodendron arborea Tribulus eichleriana KVAY NA
Myrica cerifera Myrica cerifera INSP NA
Physocarpus capitata Physocarpus opulifolius SXCE NA
Physocarpus opulifolius Physocarpus opulifolius SXCE NA

The same protein data was used as input for all annotation runs. These were the translated CDS from Medicago trunculata, Arachis hypogaea, and Glycine max genome assemblies, with the longest isoforms selected. The dataset was received from Sara Knaack, who curated it for use in a parallel project.

Results

BUSCO scores of annotated gene models for both MAKER and BRAKER are listed in the file sample_annotation_busco.xlsx. Genomes and annotations have been uploaded to CoGe for browsing. For now they are private, but I can grant access to individuals if you contact me at kasey.pham@ufl.edu.

Pipeline Overview

More detail on running this pipeline can be found in the pipline directory README.

Repeat Library Construction

This step must be done first, before running MAKER or BRAKER.

  1. 01-assembstats.sh: Get statistics on genome assembly.
  2. 02-repeat01.job: Model repeats de novo using RepeatModeler.
  3. 03-repeat02.job: Identify plant repeats from Repbase.
  4. 04-repeat03.job: Mask de novo and plant repeats using RepeatMasker.
  5. 05-consreps.sh: Consolidate de novo and plant repeats.
  6. 06-procreps.job: Process repeats using RepeatMasker tools.
  7. 07-compreps.job: Generate GFF files with repeats for annotation tools.

Run MAKER

This step can be run before, after, or concurrent to BRAKER.

  1. 01-maker01.job: Run MAKER round 1 using annotated genome, transcriptomes, proteomes, and repeat library as input.
  2. 02-process_maker.job: Process and format MAKER round 1 output to prepare for running SNAP.
  3. 03-snap.job: Get SNAP models.
  4. 04-train_snap.job: Generate training sets from SNAP models.
  5. 05-export_fasta.sh: Process MAKER output for running AUGUSTUS.
  6. 06-busco_aug.job: Run AUGUSTUS.
  7. 07-process_busco.sh: Add AUGUSTUS output to reference species in config folder.
  8. 08-maker02_prep.sh: Process AUGUSTUS output for MAKER round 2.
  9. 09-maker02.job: Run MAKER round 2.
  10. 10-process_maker02.job: Reformat MAKER round 2 output and get annotation statistics.
  11. 11-busco.job: Run BUSCO on annotated gene models.

Run BRAKER

This step can be run before, after, or concurrent to MAKER.

  1. 01-mask_genome.sh: Mask repeats in genome based on repeat library generated.
  2. 02-genemark_es.job: Run GenemarkES to identify gene models.
  3. 03-prothint.job: Run ProtHint to generate hints file based on GeneMark output to provide to BRAKER.
  4. 04-braker_prep.sh: Prepare for running BRAKER by creating symlinks to the configuration directory and BRAKER scripts in local directory.
  5. 05-braker.job: Run BRAKER.
  6. 06-get_braker_fas.sh: Process BRAKER output and get FASTA file of gene models.
  7. 07-busco.job: Run BUSCO on BRAKER output gene models.