/splice_sim

Splice_sim: a nucleotide-conversion enabled RNA-seq simulation and evaluation framework

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

splice_sim

splice_sim is a python based RNA-seq simulation and evaluation package specialized on the simulation nucleotide-conversions and splice isoforms. We use Nextflow and containerization via Docker to wrap splice_sim into a readily executable start-to-end framework for simulating and evaluating complex experimental scenarios with mapping based processing pipelines.

Contents

Features

  • Realistic Illumina short-read simulation using ART
  • Simulation of customizable nucleotide-conversions at configurable rates
  • Simulation of isoforms at configurable splicing states per transcript
  • Mapping accuracies and performance metrics at different scopes of genomic annotation
  • Elaborate output tracks for visual inspection stratified by performance metric

Installation

splice_sim itself is a python package with several dependencies that are best simply installed in a conda environment:

conda env create -f environment.yml

Then clone the splice_sim repository and call its main method:

git clone https://github.com/popitsch/splice_sim.git
cd splice_sim
python main.py

To run our full-blown all-on-one Nextflow based workflow, you simply need to install Nextflow and Docker to have all dependencies available and be ready to go. To run splice_sim on HPC environments, most administrators prefer Singularity (apptainer) containers which can be seamlessly created from our Docker container.

Usage

splice_sim itself is the Python package and includes the simulation engine to both simulate and evaluate datasets. splice_sim provides dedicate commands for the individual logical steps outlined below, starting from creating transcript models, simulating reads from those transcripts and finally evaluating the performance of a given mapper via the produced bam file. The parameters for the highlighted relevant steps for such a process are described in the splice_sim engine.

We have also wrapped a ready-to-use out of the box workflow that executes all step from start to end into our Nextflow workflow with a description provided in the adjacent section. We provide two separate workflows: (i) splice_sim.nf that handles the entire logic contained in the Simulation Pipeline block and (ii) splice_sim_eva.nf that handles everything related to the Evaluation Pipeline block in the diagram below. To avoid supplying an extensive set of parameters to theindividual steps via the command line, we designed splice_sim to be driven by a configuration file in json format that contains all necessary inputs and parameters needed for any of splice_sim steps.

Nextflow workflow

All you need to get started with our workflows is Nextflow and Docker. Typically one would also need to tweak the resource requirements for each of the processes that are executed. We did that already for you for the small (typically expressed genes size) and large (full transcriptome) datasets simulated in our study in the provided nextflow.config.example ready to use. Both workflows need as only input the configuration file file provided by the user.

Simulation workflow splice_sim.nf

This workflow executes the following steps to obtain the simulated truth alignments and mapped read alignments of the mappers under investigation in BAM format as final output:

  • Build a transcript model with splice_sim build_model
  • Simulate reads with ART for all conditions
  • Calculate and encode truth alignments for all conditions
  • Map simulated reads with mappers under investigation for all conditions
  • Postprocess BAM files with splice_sim postfilter_bam

Evaluation workflow splice_sim_eva.nf

This workflow executes the following steps to obtain the evaluation metrics and output files from the simulated truth alignments and mapped read alignments of the mappers under investigation as final output:

  • Evaluate a given read alignment of a mapper with splice_sim evaluate
  • Extract associated metadata of the annotated features under investigation with splice_sim extract_feature_metadata
  • Bundle, package and compress the produced output tables into RDS files to futher process in R with preprocess_results.R

splice_sim engine

The core splice_sim python engine covers isolated steps of the simulation and evaluation process process as highlighted in the previous block diagram. The most important commands along the way a documented in this section.

build_model

The build_model command takes the reference and configuration provided by the user and creates the transcript model and sequence files needed that contains the composition of the transcriptome and serves and input to the read simulation step of splice_sim.

 python splice_sim/main.py build_model --help
usage: main.py [-h] -c config_file [-o outdir]

  Copyright (C) 2021 XXX.  All rights reserved.

  Distributed on an "AS IS" basis without warranties
  or conditions of any kind, either express or implied.

USAGE

optional arguments:
  -h, --help            show this help message and exit
  -c config_file, --config config_file
                        JSON config file
  -o outdir, --outdir outdir
                        output directory (default is current dir)

create_genome_bam

The create_genome_bam command takes the simulated read set with your short-read simulator of choice (we use ART) and calculates the truth alignments that serve as the reference benchmark in splice_sim.

python splice_sim/main.py create_genome_bam --help
usage: main.py [-h] -m model_file -a config_file [-t threads] [-o outdir]

  Copyright (C) 2021 XXX.  All rights reserved.

  Distributed on an "AS IS" basis without warranties
  or conditions of any kind, either express or implied.

USAGE

optional arguments:
  -h, --help            show this help message and exit
  -m model_file, --model model_file
                        model file
  -a config_file, --art_sam config_file
                        ART sam file
  -t threads, --threads threads
                        threads
  -o outdir, --outdir outdir
                        output directory (default is current dir)

postfilter_bam

The postfilter_bam command filters secondary and supplementary reads and highlights isoforms.

python splice_sim/main.py postfilter_bam --help
usage: main.py [-h] -c config_file -b bam_file -o outdir

  Copyright (C) 2021 XXX.  All rights reserved.

  Distributed on an "AS IS" basis without warranties
  or conditions of any kind, either express or implied.

USAGE

optional arguments:
  -h, --help            show this help message and exit
  -c config_file, --config config_file
                        JSON config file
  -b bam_file, --bam bam_file
                        input bam file
  -o outdir, --outdir outdir
                        output dir

evaluate

The evaluate command runs the splice_sim evaluation routine on a given mapper bam file produced for a condition of the splice_sim simulation run.

python splice_sim/main.py evaluate --help
usage: main.py [-h] -b bam_file -c config_file -m model_file [-f filter_bed] [-t THREADS] -o outdir

  Copyright (C) 2021 XXX.  All rights reserved.

  Distributed on an "AS IS" basis without warranties
  or conditions of any kind, either express or implied.

USAGE

optional arguments:
  -h, --help            show this help message and exit
  -b bam_file, --bam_file bam_file
                        input bam file
  -c config_file, --config config_file
                        JSON config file
  -m model_file, --model model_file
                        model file
  -f filter_bed, --filter_bed filter_bed
                        Filter regions BED file
  -t THREADS, --threads THREADS
                        used threads
  -o outdir, --outdir outdir
                        output dir

extract_feature_metadata

The extract_feature_metadata extracts comprehensive metadata that lists various characteristics of the genomic features under evaluation.

python splice_sim/main.py extract_feature_metadata --help
usage: main.py [-h] -c config_file -m model_file -o outdir

  Copyright (C) 2021 XXX.  All rights reserved.

  Distributed on an "AS IS" basis without warranties
  or conditions of any kind, either express or implied.

USAGE

optional arguments:
  -h, --help            show this help message and exit
  -c config_file, --config config_file
                        JSON config file
  -m model_file, --model model_file
                        model file
  -o outdir, --outdir outdir
                        output dir

Running splice_sim with Docker

All the above commands of the splice_sim engine can be run without installing any dependencies out of the box of our Docker container. Simply precede the individual splice_sim calls like illustrated below:

docker run --cpus <number of CPUS> -m <GB memory>g -v $(pwd):$(pwd) -w $(pwd) tobneu/splice_sim:release python <path to splice_sim repository>/main.py

Processing results into R objects (RDS)

We provide a Rscript that takes all splice_sim evaluation outputs and processes them into readily usable RDS objects to be imported in R. The script is located in splice_sim/src/main/R/splice_sim/preprocess_results.R and all dependencies are again wrapped into our splice_sim R Docker container tobneu/splice_sim_r:latest.

Rscript --vanilla splice_sim/src/main/R/splice_sim/preprocess_results.R

Error: usage: preprocess_results.R <splice_sim_config> [<outdir>]
Execution halted

Configuration

We made the deliberate decision to have a central configuration file in json format that encapsulate and document all input files and parameters that are driving all splice_sim related processes. Here we document all required and optional parameters:

Parameter Description Required
dataset_name Name of the simulation run
splice_sim_cmd Path to main.py in the clone splice_sim repository
splice_eva_preprocess_cmd Path to main.py in the cloned splice_sim repository
gene_gff Path to the transcriptome annotation in gff3-format (obtained e.g. from Gencode
intron_gff Annotated introns from the transcriptome annotation in gff3-format
genome_fa Reference genome sequence in fasta format
genome_chromosome_sizes Chromosome lengths file with chromosome name and chromosome lengths separated by a tab as produced by samtools faidx
genome_conservation Genome conservation score in bigwig format (obtained e.g. from UCSC)
genome_mappability Genome mappability score in bigwig format (obtained e.g. from the Hoffman lab)
transcript_data Transcript state configuration file
transcript_ids Subset of transcript IDs from the reference annotation used in this simulation
isoform_mode Fraction of unspliced to mature spliced transcripts
frac_old_mature Fraction of pre-existing fully-spliced transcripts before labelling onset
condition Dictionary of the conditions in this simulation defining the base conversion type (ref and alt base), conversion_rates (list of doubles between 0 and 1) and the base_coverage (integer)
mappers Dictionary of dictionary of the mappers employed and evaluated in this simulation defining the name of the mapper, the command how to call it, the mapping index and known splice sites file
create_tdf Boolean whether TDF files for track visualization in IGV should be produced
max_ilen Integer defining the maximum considered intron length
min_abundance Integer defining the minimum abundance value fo a transcript to be considered
random_seed Integer seed to keep simulation deterministic
readlen Integer defining the short read length
write_reads Boolean value defining whether reads should be output to disk or not
write_intron_bam Boolean value defining whether intronic reads should be written to BAM file not
pre_exist Directory of pre-existing runs to restart the simulation in case a run crashes
snp_file Optional VCF file containing SNVs to be injected into the simulated data. The VCF file may contain the following (optional) properties in the INFO section: prob: conversion probability [0; 1] (optional, default:1), strand: strand specificity [NA, +, -] (optional, default:NA), enable_nc: [yes, no] (optional, default: yes) if yes, configured NC are possible at the same position if no SNV was injected

Find below an example config json:

{
    "dataset_name": "test_simulation",
    "splice_sim_cmd": "python /software/splice_sim/main.py", # should point to your local splice_sim clone
    "splice_eva_preprocess_cmd": "Rscript --vanilla /software/splice_sim/src/main/R/splice_sim/preprocess_results.R", # should point to your local splice_sim clone
    "gene_gff": "/references/gencode.vM21.gff3.gz", # gene annotation GFF3 file
    "genome_fa": "/references/Mus_musculus.GRCm38.dna.primary_assembly.fa",
    "genome_chromosome_sizes": "/references/Mus_musculus.GRCm38.dna.primary_assembly.fa.chrom.sizes",
    "genome_conservation": "/references/mm10.60way.phastCons60wayEuarchontoGlire.bw", # optional
    "genome_mappability": "/references/mm10.k24.umap.bedgraph.gz", # optional
    "transcript_data": "data.config.json", # if this file exists, 
    "transcript_ids": "tids.tsv", # optional (TSV file with one column 'transcript_id' containing all considered transcript ids)
    "isoform_mode": "1:1", # mode for creating isoform data; '1:1': mat and pre form as in paper (default), 'from_file': data will be loaded from 'transcript_id','abundance','frac_mature','frac_old_mature' columns in the configured transcript_ids file
    "frac_old_mature": 0, # fraction of OLD RNA that was not exposed to a nucleotide analog 
    "condition": {
        "ref": "T", # reference base
        "alt": "C", # alternat base
        "conversion_rates": [ 0.02, 0.04 ], # list of conversion rates
        "base_coverage": 10 # used to calculate simulated coverage per tx; 
    },
    "mappers": {
        "STAR": {
            "star_cmd": "STAR-2.7.1a",
            "star_genome_idx": "star_2.7.1",
            "star_splice_gtf": "/indices/gencode.vM21.gtf" # GTF file with known splice sites
            },
        "HISAT3N": {
            "hisat3n_cmd": "hisat-3n",
            "hisat3n_idx": "/indices/Mus_musculus.GRCm38.dna.primary_assembly",
            "hisat3n_kss": "/indices/gencode.vM21.gtf.hisat2_splice_sites.txt" # TXT file with known splice sites
            },
        "MERANGS": {
            "merangs_cmd": "meRanGs",
            "star_cmd": "STAR",
            "merangs_genome_idx": "/indices/meRanTK-1.2.1b/BSgenomeIDX",
            "merangs_splice_gtf": "/indices/gencode.vM21.gtf" # GTF file with known splice sites
            },
        "SEGEMEHL": {
            "segemehl_cmd": "segemehl.x",
            "segemehl_ctidx": "/indices/segemehl/Mus_musculus.GRCm38.dna.primary_assembly.ctidx",
            "segemehl_gaidx": "/indices/segemehl/Mus_musculus.GRCm38.dna.primary_assembly.gaidx"
            }
        },
    "create_tdf": true, 
    "max_ilen": 100000,
    "min_abundance": 1,
    "random_seed": 1234,
    "readlen": 100,
    "write_reads": false,
    "write_intron_bam": false
}

Output Structure

Mapper count tables

These tables contain the performance metrics for a given mapper in the count/*.counts.tsv.gz files.

Column Description Notes
mapper Name of the respective mapper; Supported mappers are STAR, HISAT3N, MERANGS and SEGEMEHL
conversion_rate Conversion rate between 0 and 1.0
fid feature/transcript id
true_isoform name of the isoform (as configured) the read originates from
cv1 1 if read contains at least one NC or 0 otherwise
cv2 1 if read contains at least two NC or 0 otherwise
se1 1 if read contains at least one simulated sequencing error or 0 otherwise
se2 1 if read contains at least two simulated sequencing errors or 0 otherwise
classification Read classification: TP: true positive, FN: false negative: FP_raw: false-positive/not
normalised, FP: false-positive/normalised
count read count. FP classified rows may include fractions
class_type read type: acc: acceptor spanning, don: donor spanning, spl: spliced read SJ only

Metadata tables

These tables contain various metadata for the genomic intervals under investigation stratified at transcript level (tx), exon / intron feature level (fx) or splice-junction level (sj) in the meta/*.metadata.tsv.gz files.

Column Description Notes
tid Transcript ID
fid Feature ID (intron or exon ID) FX+SJ only
ftype Feature type: tx, fx, don, acc or spl
rnk Rank. For transcripts this is the number exons, for introns/exons it is the rank from the transcript 5'-end
chromosome Chromosome of the annotate feature
start / end Genomic start/end position of the annotated feature
strand Strand of the annotation
A/C/T/G Number of A/C/T/G bases in the annotated sequence
mean_map Mean mappability for the annotated feature. Calculated from the configured mappability bedgraph file
tx_rnk Rank in the transcript FX+SJ only
num_exons Number of exons in tx; 1,2,3,4,5,>5
tx_mappability Transcript mappability, factor with levels: low, medium, high FX+SJ only
len Length of annotated feature
mappability Annotation mappability, factor with levels: low, medium, high
GC Fraction of G/C for annotated feature
frac_convertible Fraction of convertible bases for annotation SJ only
convertibility Convertibility, factor with levels: low, medium, high
don_ex_A/C/T/G Number of A/C/T/G bases in exonic part of donor window
(genomic window centred on splice donor site with size: 2xreadlen+1)
SJ only
don_in_A/C/T/G Number of A/C/T/G bases in intronic part of donor window SJ only
don_win_map Mean mappability of donor window SJ only
don_mappability Donor window mappability, factor w levels: low, medium, high SJ only
don_ex_fc Fraction of convertible bases in the exonic part of the donor window SJ only
don_in_fc Fraction of convertible bases in the intronic part of the donor window SJ only
ac_* Analogous to the splice donor columns above, but for splice acceptor site SJ only

Customization

NC simulation

Splice_sim simulates nucleotide conversions in reads based on Bernoulli processes with given (configured) conversion probabilities. While we believe this to be appropriate for simulating BS-seq or SLAM-seq data (see our paper), it might not be suitable for other use-cases where, e.g., conversion prob abilities are affected by local sequence context (e.g., RNA-editing) or the like.

For customization of the NC simulation process, users may alter/extend the splice_sim.simulator.modify_bases method that has access to

  • the sequence of the simulated read (w/o NC)
  • genomic coordinates and orientation (strand) of the simulated read
  • configured NC reference and alternate bases
  • the configured conversion_rate
  • a list of convertible positions in the read
  • a list of SNPs that affect this read (if configured)