/reference_processing

Process reference from NCBI for use with various RNA-Seq tools. Creates Fasta, GFF, GTF files. Currently written to work with bacteria.

Primary LanguagePython

Reference Processing

These scripts and workflows are used to process NCBI annotations and create GTF, BED12, and annotation tables for use in RNA-Seq and other workflows. While there are intended to be generalized enough for use on various organisms, care should be taken to examine the output and make modifications when necessary.


NCBI Pipeline - NCBISnakefile

NCBISnakefile is a Snakemake file containing rules for generating GTF and BED annotation files from NCBI GFF files.

NCBI GFF and FASTA files can be obtained from: ftp://ftp.ncbi.nlm.nih.gov/genomes/

Input:

  • {organism}.gff - GFF downloaded from NCBI

  • NCBIconfig.json - Configuration file specifying attributes to be used, etc.

Output:

  • {organsim}.gtf - GTF converted using gffread from Cufflinks suite, The gene_id attribute is set to the gene_id from the GFF file, and missing gene_id attributes are added with empty values.

    • {organsim}_gffread.log - Log of warnings produces when running gffread.
  • {organsim}_entrezid.gtf - GTF with gene_id replaced by Entrez GeneID.

  • {organsim}.bed - BED12 formatted file generated from GTF by UCSC Kent Utils. The name field is set to the transcript_id from the GFF/GTF files.

  • {organsim}_gff_attributes.txt - Tab delimited file of top level GFF features, with selected attributes converted to columns and child attribute values flattened (i.e. rolled-up) into the parent features. This is useful to generate lookup tables for GFF files to go from a Gene ID to some other annotation such as product.

  • {organsim}_entrezid.bed - (Not yet implemented)

Usage:

  1. Make copy of config and update as needed

  2. Run snakemake in the directory containing the input files:

    snakemake --snakefile /path/to/NCBISnakefile


NCBI Bacterial Pipeline - AnnotationSnakefile

AnnotationSnakefile is a Snakemake file containing rules for generating bacterial annotation files from NCBI GFF files.

extract_ids_from_gff.py requires gffutils.

organism_parsers/ contains parsers for specific organisms

NCBI GFF downloads: ftp://ftp.ncbi.nlm.nih.gov/genomes/

Input:

  • {organism}.gff - GFF downloaded from NCBI

  • {organism}.fasta - Fasta sequence

  • ORGANISMS - List of organism names

Output:

  • {organism}.gtf - GTF file generated by 'gffread' from tophat package

  • {organism}.bed - BED12 file generated by converting the GTF file

  • {organism}_parsed.gff - GFF file generated by parsing NCBI GFF through 'gffread' Note: GFF parsing sometimes fails with segmentation fault

  • {organism}_gene_ids_unique_coverage.txt - Gene ids along with amount of the gene that is unambiguous with respect to other genes on the same strand.

  • {organism}_gene_annotations.txt - Gene annotations for each gene id

Usage:

  1. Create a Snakefile using the following template, and modify as needed.

  2. Run snakemake in the directory containing the template and the input files:

    snakemake --snakefile TEMPLATE_SNAKEFILE

Template

ORGANISMS = ["escherichia_coli_k12_nc_000913_3",
             "pseudomonas_aeruginosa_pao1_nc_002516_2",
             "staphylococcus_aureus_subsp__aureus_str__newman_nc_009641_1"]

rule all:
    input: expand("{organism}.bed", organism=ORGANISMS),
           # GFF Parsing sometimes fails with segmentation fault
           #expand("{organism}_parsed.gff", organism=ORGANISMS),
           # Don't always need gene names file
           # expand("{organism}_gene_names.txt", organism=ORGANISMS),
           expand("{organism}_gene_ids_unique_coverage.txt", organism=ORGANISMS),
           expand("{organism}_gene_annotations.txt", organism=ORGANISMS)

REFPROCESSING_DIR = "/Users/lparsons/Documents/projects/reference_processing"
include: "%s/AnnotationSnakefile" % REFPROCESSING_DIR

Utilities

gff_attributes_to_tsv.py

Extract selected GFF attributes to additional columns. In addition, optionally flatten child features and roll attributes up to the parent. This is useful to generate lookup tables for GFF files to go from a Gene ID to some other annotation such as product.

translate_gtf_attribute.py

Translate the value of an attribute of a GTF file using a separate lookup file. This is useful to convert from one type of gene id (e.g. NCBI GFF gene ID) to another (e.g. EntrezID).

fix_gtf.py

Add any missing gene_id attributes to GTF lines, using the transcript_id if available, otherwise, using an empty value. This is useful to post-process output of gffread from Cufflinks which leaves the gene_id off of records that are not associated with a gene (which seems reasonable, but can cause issues with downstream analysis).

add_seqid_to_gff_id.py

Add the sequence id to the ID (and Parent) attributes in a GFF file. This is useful when there are multiple GFF files provided by NCBI that have duplicate IDs (such as when they include a plasmid with a bacterial sequence).

change_gff_id.py

Change the ID (and Parent) attributes in a GFF file. This is is useful when there is another suitable identifier to use (such as locus_tag). Note that not all attributes may have this id, so it might be necessary to use both add_seqid_to_gff_id.py and change_gff_id.py to ensure no duplicates.