Reference Processing
These scripts and workflows are used to process NCBI annotations and create GTF, BED12, and annotation tables for use in RNA-Seq and other workflows. While there are intended to be generalized enough for use on various organisms, care should be taken to examine the output and make modifications when necessary.
NCBISnakefile
NCBI Pipeline - NCBISnakefile
is a
Snakemake file containing
rules for generating GTF and BED annotation files from NCBI GFF files.
NCBI GFF and FASTA files can be obtained from: ftp://ftp.ncbi.nlm.nih.gov/genomes/
Input:
-
{organism}.gff
- GFF downloaded from NCBI -
NCBIconfig.json
- Configuration file specifying attributes to be used, etc.
Output:
-
{organsim}.gtf
- GTF converted using gffread from Cufflinks suite, Thegene_id
attribute is set to thegene_id
from the GFF file, and missinggene_id
attributes are added with empty values.{organsim}_gffread.log
- Log of warnings produces when running gffread.
-
{organsim}_entrezid.gtf
- GTF withgene_id
replaced by Entrez GeneID. -
{organsim}.bed
- BED12 formatted file generated from GTF by UCSC Kent Utils. Thename
field is set to thetranscript_id
from the GFF/GTF files. -
{organsim}_gff_attributes.txt
- Tab delimited file of top level GFF features, with selected attributes converted to columns and child attribute values flattened (i.e. rolled-up) into the parent features. This is useful to generate lookup tables for GFF files to go from a Gene ID to some other annotation such as product. -
{organsim}_entrezid.bed
- (Not yet implemented)
Usage:
-
Make copy of config and update as needed
-
Run snakemake in the directory containing the input files:
snakemake --snakefile /path/to/NCBISnakefile
AnnotationSnakefile
NCBI Bacterial Pipeline - AnnotationSnakefile
is a
Snakemake file containing
rules for generating bacterial annotation files from NCBI GFF files.
extract_ids_from_gff.py
requires gffutils.
organism_parsers/
contains parsers for specific organisms
NCBI GFF downloads: ftp://ftp.ncbi.nlm.nih.gov/genomes/
Input:
-
{organism}.gff
- GFF downloaded from NCBI -
{organism}.fasta
- Fasta sequence -
ORGANISMS
- List of organism names
Output:
-
{organism}.gtf
- GTF file generated by 'gffread' from tophat package -
{organism}.bed
- BED12 file generated by converting the GTF file -
{organism}_parsed.gff
- GFF file generated by parsing NCBI GFF through 'gffread' Note: GFF parsing sometimes fails with segmentation fault -
{organism}_gene_ids_unique_coverage.txt
- Gene ids along with amount of the gene that is unambiguous with respect to other genes on the same strand. -
{organism}_gene_annotations.txt
- Gene annotations for each gene id
Usage:
-
Create a
Snakefile
using the following template, and modify as needed. -
Run snakemake in the directory containing the template and the input files:
snakemake --snakefile TEMPLATE_SNAKEFILE
Template
ORGANISMS = ["escherichia_coli_k12_nc_000913_3",
"pseudomonas_aeruginosa_pao1_nc_002516_2",
"staphylococcus_aureus_subsp__aureus_str__newman_nc_009641_1"]
rule all:
input: expand("{organism}.bed", organism=ORGANISMS),
# GFF Parsing sometimes fails with segmentation fault
#expand("{organism}_parsed.gff", organism=ORGANISMS),
# Don't always need gene names file
# expand("{organism}_gene_names.txt", organism=ORGANISMS),
expand("{organism}_gene_ids_unique_coverage.txt", organism=ORGANISMS),
expand("{organism}_gene_annotations.txt", organism=ORGANISMS)
REFPROCESSING_DIR = "/Users/lparsons/Documents/projects/reference_processing"
include: "%s/AnnotationSnakefile" % REFPROCESSING_DIR
Utilities
gff_attributes_to_tsv.py
Extract selected GFF attributes to additional columns. In addition, optionally flatten child features and roll attributes up to the parent. This is useful to generate lookup tables for GFF files to go from a Gene ID to some other annotation such as product.
translate_gtf_attribute.py
Translate the value of an attribute of a GTF file using a separate lookup file. This is useful to convert from one type of gene id (e.g. NCBI GFF gene ID) to another (e.g. EntrezID).
fix_gtf.py
Add any missing gene_id
attributes to GTF lines, using the transcript_id
if
available, otherwise, using an empty value. This is useful to post-process
output of gffread
from Cufflinks which leaves the gene_id
off of records
that are not associated with a gene (which seems reasonable, but can cause
issues with downstream analysis).
add_seqid_to_gff_id.py
Add the sequence id to the ID (and Parent) attributes in a GFF file. This is useful when there are multiple GFF files provided by NCBI that have duplicate IDs (such as when they include a plasmid with a bacterial sequence).
change_gff_id.py
Change the ID (and Parent) attributes in a GFF file. This is is useful when
there is another suitable identifier to use (such as locus_tag
). Note that
not all attributes may have this id, so it might be necessary to use both
add_seqid_to_gff_id.py
and change_gff_id.py
to ensure no duplicates.