TOGA: exon-aware liftover

In cases where there is a reasonably high-quality annotation for a species closely related to the genome you wish to annotate, transferring or "lifting over" gene models from that high quality annotation may be a useful approach. Such methods begin with a whole-genome alignment (WGA), which serves as the basis of mapping the genomic coordinates of the source annotation to analogous coordinates in the target genome, i.e. the one you wish to annotate. TOGA is an exon-aware liftover approach that is also able to classify genes in the target species as orthologs, paralogs or processed pseudogenes.

In our TOGA workflow, we begin with a WGA created with Cactus, following our recommended best practice documented here. Running TOGA requires that a genome with a high quality anotation is part of the WGA, and that a number of files be generated. A potentially confusing part of the workflow is naming conventions for the reference annotation species and the species to which annotations are being tranferred ... the confusion being that they vary among the tools in the workflow! Thus, we are explicit about what gets called what in various steps in the workflow.

1. Create bed file of target genome chromosome lengths

In this step, the "target" genome is that to which we are lifting over annotations from a reference (source). We use a a python script WriteChromLengthBedFromFasta.py to create this file. This script uses a fasta parser from the Biopython python package. We highly recommend using an Anaconda or comparable python distribution that allows you to create conda environments. Assuming this version of python is in your path, you can create a biopython conda environment as follows:

conda create -n biopython -c anaconda biopython

Then to activate the environment:

source activate biopython

From here, you can then run the script simply by supplying the species name and full path to the genome fasta as a command line argument:

python WriteChromLengthBedFromFasta.py bigfoot /PATH/TO/bigfoot_v1_genome.fasta

2. Create CDS-only gff3 file for the reference genome

Create a conda environment for gffread and extract CDS entries

conda create -n gffread -c bioconda gffread
source activate gffread
genome=human_genomic.fasta
gff3=human.gff3
gffread $gff3 -g $genome -F -C -o human_CDSonly.gff3
conda deactivate

3. Create CDS isoforms table for reference

One has the option of providing isoform information to TOGA. If one doesn't, TOGA treats every isoform as a separate gene. To generate the isoform table, with gene in first column and isoform in the second, one can use a simple awk cmd:

awk -F"\t" '$3=="mRNA"{print $9}' human_CDSonly.gff3 |awk -F";" '{print $3"\t"$1}' |sed 's/gene_name=//g' | sed 's/ID=//g' > human_CDS_isoforms.tsv

This awk command works for CDS gff3 files extracted from NCBI gff3 files. Slight changes may be required if the reference annotation is generated by a different repository or workflow.

4. Create CDS annotation bed file

This CDS file is used by TOGA to only consider protein-coding genes and isoforms. This requires three steps. One needs to convert the original genome gff3 to bed before filtering out non-CDS entries, as the conversion of gff3 to genepred format will fail with a gffread-generated CDS-only gff file as input. While we demonstrate how to create separate conda environments for the bioinformatics tool required for each step, in principle all tools can be installed to a common conda environment that you use for all three steps.

4a. Convert genomic gff3 to GenePred format

conda create -n gff3ToGenePred -c bioconda ucsc-gff3togenepred
source activate gff3ToGenePred
gff3ToGenePred human.gff3 human.GenePred
conda deactivate

4b. Convert GenePred to bed

conda create -n genePredToBed -c bioconda ucsc-genepredtobed
source activate genePredToBed
genePredToBed human.GenePred human.bed
conda deactivate

4c. Create CDS-only annotation bed file

To do this, we have created a python script that takes the CDS isoform table and the reference annotation bed file as the first and second command line arguments. FilterAnnotationBedForCDS.py can be run as follows:

python FilterAnnotationBedForCDS.py human_CDS_isoforms.tsv human.bed

5. Create 2bit files for both genomes

In our example, we will be lifting over annotations from a high quality human reference to bigfoot, presumably a reasonably close relative.

conda create -n faToTwoBit -c bioconda ucsc-fatotwobit
source activate faToTwoBit
faToTwoBit human_genomic.fasta human.2bit
faToTwoBit bigfoot_v1_genome.fasta bigfoot.2bit
conda deactivate

6. Extract psl file from cactus hal (WGA) file

In order to create a chain file that TOGA requires, We first need to extract a psl format file describing the overlaps of the genome we wish to annotate to the reference genome. Following UCSC's terminology regarding chains and nets, the annotated reference genome is considered the "target" and the genome to be aligned to it (and eventually annotated) is the query. We use halLiftover to extract the pairwise alignment of query to target genome. Once we perform the lift over, TOGA will use the chain file to transfer annotations back to the query. We execute halLiftover in a singularity container built for cactus. Details on how to build your own cactus image will be available soon here. The commands below are written as if they were embedded in a shell script with numbered command line arguments, i.e. $1 is the first cmd line argument, $2 is the second, etc.

#!/bin/bash

CACTUS_IMAGE=/PATH/TO/cactus.sif

halfile=$1
querygenome=$2 # name of source genome in hal file
queryebed=$3 # chrom length bed file of source genome
targetgenome=$4 # target is the reference genome to which one is lifting
pslout=$5

singularity exec --cleanenv ${CACTUS_IMAGE} halLiftover --outPSL $halfile $querygenome $querybed $targetgenome $pslout

7. Force alignments in psl file to positive strand

This step is implemented with pslPosTarget, and can be run in a script, after creating the conda environment:

conda create -n pslPosTarget -c bioconda ucsc-pslpostarget

Then, run the script, the contents of which would look like this:

#!/bin/bash
source activate pslPosTarget
inpsl=$1
outpsl=$2
pslPosTarget $1 $2
conda deactivate

8. Create chain file

The final step before running TOGA is to create a chain file using axtChain. As with halLiftover, the "target" is the reference genome from which one wants ot transfer annotations to the "source", the genome to be annotated. Create a conda environment:

conda create -n axtChain -c ucsc-axtchain axtChain

Then run a script, the contents of which are as follows:

#!/bin/bash

source activate axtChain
pos_target_psl=$1
target2bit=$2
query2bit=$3
chainout=$4
axtChain -psl -verbose=0 -linearGap=loose $pos_target_psl $target2bit $source2bit $chainout
conda deactivate

The linearGap setting relates to penalties for gap opening and extension, with "loose" being preferred, as it leads to longer chains, with which TOGA performs better.

9. Run TOGA

TOGA is built with python code. Furthermore, TOGA uses a Nextflow as a workflow management system that submits jobs to your HPC cluster and tracks their status, so you will need to make sure the Nextflow config files are formatted properly. It does not seem that TOGA can be run without HPC infrastructure. To setup the TOGA conda environment

clone the github repository so you all have all the relevant TOGA scripts and functions
build a clean python conda environment using the latest version TOGA was tested with (3.7.3), then
install the python modules TOGA requires following instructions on the TOGA github repsitory, then
run the configure.sh script, and finally
test the install with run_test.sh

Noe that TOGA uses a Nextflow as a workflow management system that submits jobs to your HPC cluster and tracks their status, so you will need to make sure the Nextflow config files are formatted properly. It does not seem that TOGA can be run without HPC infrastructure. To build the conda environment, cd to your TOGA directory, then do:

conda create -n TOGA python=3.7.3
source activate TOGA
conda install -c bioconda nextflow
python3 -m pip install -r requirements.txt --user
# run configure script
./configure.sh
# once you have all the nextflow config files set up properly, run the test
./run_test micro

Be sure to consult the TOGA github repository for further details regarding how to confirm the test was successful.

Finally, to run TOGA, a script should be run that looks like this

#!/bin/bash chainfile=$1 targetCDSbed=$2 # remember, the target is the annotated reference genome from which annotations will be transferred to your query genome target2bit=$3 query2bit=$4 outname=$5 target_isoforms=$6 # tsv file with CDS gene and transcript id as columns

/PATH/TO/TOGA/toga.py $chainfile $targetCDSbed $target2bit $query2bit --kt --pn $outname -i $targert_isoforms --nc /PATH/TO/TOGA/NEXTFLOW_CONFIG_FILES_DIRECTORY --cb 10,100 --cjn 750

yuzhenpeng/GenomeAnnotation-TOGA