/paper-rlsim

Analysis pipeline for assesing biasing factors in Illumina RNA-seq protocols through realistic simulations

Primary LanguagePythonOtherNOASSERTION

Assessing biasing factors in Illumina RNA-seq protocols through realistic simulations

header

Reference

This repository contains the analysis pipeline, the description of workflow and the results presented in the manuscript:

  • Botond Sipos, Greg Slodkowicz, Tim Massingham, Nick Goldman (2013) Realistic simulations reveal extensive sample-specificity of RNA-seq biases arXiv:1308.3172

Introduction

The rlsim package is a collection of tools aiming to integrate in a common simulation framework the documented biases present in the data generated by standard Illumina RNA-seq protocols due to individual experimental steps such as hexamer priming (Hansen et al.; 2010), PCR amplification (Benjamini and Speed; 2011) and size selection (Lee et al.; 2010). The rlsim package provides an implicit simulation model inspired by the library construction protocols (implemented in the rlsim tool) and also simple approaches for estimating the parameters that can be recovered from standard RNA-seq data (implemented in the effest tool) in order to replicate the properties of individual datasets. The details of the simulation and estimation frameworks are described in the rlsim package documentation.

The analyses implemented in the pipelines have the following goals:

  • To survey the magnitude of hexamer priming and PCR amplification biases in various paired-end RNA-seq datasets publicly available through ENA, generated by the standard Illumina protocol and its dUTP-based strand-specific modification (Levin et al.; 2010).
  • To quantify whether the ensemble of knowledge incorporated in the simulation pipeline - parameters estimated from the individual datasets, priming affinities based on a physical model, species-specific information on poly(A) tail length - makes the simulated datasets more similar to the real ones as measured through different summary statistics, hence the "realism" of the rlsim estimation/simulation framework.
  • To study the correlation of the same (or similar) summary statistics between library construction technical replicates in order to establish an expectation regarding the magnitude of correlation potentially achievable between real and simulated datasets.

Cloning the pipeline repository

The pipeline tracks the rlsim and simNGS packages as submodules; hence it is practical to use the --recursive flag when cloning the GitHub repository:

git clone --recursive https://github.com/sbotond/paper-rlsim.git

Downloading the estimated expression levels

Due to the large size of the Fasta files, the canonical transcriptomes with the corrected expression levels estimated from the datasets below are available from a separate repository at http://github.com/sbotond/rlsim-params. The repository also contains the raw parameter JSON files in order to help running modified simulations.

The bias analysis pipeline

Analysis overview

The main goal of the bias analysis pipeline is to quantify the magnitude of biases across different datasets and the ability of the rlsim estimation/simulation framework to reproduce them (hence the degree of "realism" of the simulations). The pipeline relies on comparison of real datasets to simulated datasets obtained under two different simulation settings:

  • Datasets simulated under the "full" model, which use the corrected expression levels and GC-dependent PCR efficiencies estimated by the effest tool in conjunction with a fragmentation method using an approach inspired by a physical model of oligonucleotide binding in order to simulate sequence biases around the start and end of the fragments (after_prim_double).

  • Datasets simulated under the "flat" model, which use uncorrected expression levels calculated by effest and a fixed PCR amplification efficiency (0.87 - the mean efficiency assumed by effest) in conjunction with a fragmentation method assuming uniform priming affinities (after_noprim_double).

The flat model can be considered as a submodel of the the full one, making fewer assumptions about the process generating the data. Both simulations use the insert size distribution estimated from the data and sample the same amount of fragments. The details of the full and flat simulation settings are described below.

Comparing datasets through summary statistics

The similarity between the real and simulated datasets is assessed by calculating Spearman rank correlations between various features of the datasets, summarising their properties on different levels:

  • Relative expression levels (per transcript fragment counts normalised by the number of total fragments) - calculated by the cov_cmp tool from the rlsim package
  • The GC content distribution of the mapped fragments - calculated by the pb_plot tool from the rlsim package.
  • Sequence biases as summarised by the base counts around the start and end of mapped fragments (with a window size of 10) capturing sequence bias - calculated by pb_plot. The rank correlations are calculated for count vectors for the four bases around start/end individually and averaged to obtain a single value.
  • The fragment coverage vectors of individual transcripts - calculated by cov_cmp

The correlations and statistics calculated for individual datasets and pairs of datasets are stored as pickle files. The pipeline provides additional scripts to calculate and summarise the changes in correlation due to the additional simulation factors used in the full model:

  • meta_cov_cmp calculates the shift in correlation for relative expression levels and per-transcript fragment coverage vectors.
  • meta_pb_cmp calculates shift in correlation for GC content distributions and sequence biases around fragment start/end.

By assessing the direction of the shift in the correlations calculated for the features above one can asses whether including the additional factors (GC-dependent efficiencies, priming affinities, etc.) improve the "realism" of the simulation, while the magnitude of the shift quantifies the potential impact of the biasing factors on downstream applications.

Read mapping and parameter estimation

  • The datasets defined by their accession numbers (AN) in the makefiles under data.mk are automatically downloaded from ENA.
  • Reads are mapped to the Ensembl v69 canonical transcriptomes fetched through the Perl API (along with the list of "single isoform" transcripts) using BWA (paired-end mode, default parameters). The same reference transcriptome and mapping strategy is used to map the simulated datasets.
  • Simulation parameters are estimated using effest using an expression multiplier (-e) of 10000.0 (see the rlsim documentation for more information) and the default assumed mean efficiency (0.87). The estimation of GC-dependent efficiencies is restricted to single isoform transcripts (through the -i flag).
  • Note that the estimation and simulation steps ignore the presence of isoforms by mapping to the canonical transcriptome. This is likely to decrease the correlation between real and simulated datasets, however it should not affect significantly the comparison of the full and flat models as they both use the same mapping strategy.

Flat simulations

The flat simulations do not attempt to simulate sequence-specific biases and they generate data under a simple scheme of random fragmentation, amplification under a fixed amplification efficiency and size selection:

  • The number of simulated fragments is equal to the number of properly mapped read pairs from the real dataset.
  • The fragmentation approach used (after_noprim_double) assumes uniform "priming affinities".
  • A fixed PCR amplification efficiency of 0.87 is used when simulating PCR, independent of the fragment GC content.
  • Size selection is simulated according to the empirical fragment size distribution estimated from the data.
  • The length of the simulated poly(A) tails are set to zero.

Full simulations

The full simulations attempt to use the maximum amount of knowledge derived from the dataset and literature, aiming to replicate the biases. Hence, in addition to the parameters used by the flat simulations, they also include the following factors:

  • The fragmentation process (after_prim_double, see the rlsim documentation for more details) uses sequence-specific "priming affinities" based on a nearest neighbor model of oligonucleotide binding in order to simulate sequence specific biases near the start and end of the simulated fragments.
  • Simulation of PCR amplification biases uses the "empirical" GC-dependent efficiencies estimated by the effest tool (with the minimum efficiency set to 0.05).
  • The simulated poly(A) tail distributions in the full simulations are set on a per-species basis based on relevant publications. We assume that the distributions of the tail lengths (specified with the syntax described in the rlsim documentation) follows a geometric distribution with a mean equal to the half of the maximum tail length observed in the respective species and truncated at the maximum value:

Simulating Illumina sequencing

Simulation of paired-end Illumina sequencing of the fragments generated by rlsim was performed using the simNGS package (version 1.6). Simulations were parametrised by the runfile s_3_4x shipped in the simNGS GitHub repository, trained on intensity data from a paired-end run with 101 cycles. The simulated read length was equal to the read length of the respective real datasets.

Datasets

Homo sapiens

Accession Info URL Read length PCR cycles Source
SRR065496 [details] 75 not reported human liver carcinoma (HepG2)
SRR521457 [details] 75 not reported K562 cell line
SRR521448 [details] 75 not reported GM12878 cell line
ERR030885 [details] 50 15 Illumina BodyMap: kidney
ERR030874 [details] 50 15 Illumina BodyMap: ovary

Mus musculus

Accession Info URL Read length PCR cycles Source
SRR040000 [details] 76 16 embryonic stem cells
SRR530635 [details] 101 stranded not reported leukemia cell line (K562 analog), dUTP-based stranded protocol
SRR530634 [details] 101 stranded not reported leukemia cell line (K562 analog), dUTP-based stranded protocol
ERR120702 [details] 72 not reported liver
SRR549333 [details] 99 stranded not reported mouse megakaryocyte erythroid progenitor cells with lineages CD16/32 and CD34, dUTP-based stranded protocol
SRR549337 [details] 99 stranded not reported mouse megakaryocyte erythroid progenitor cells with lineages CD16/32 and CD34, dUTP-based stranded protocol
SRR496440 [details] 100 not reported multipotential cell line that can be converted by 5-azacytidine into three mesodermal stem cell lineages

Drosophila melanogaster

Accession Info URL Read length PCR cycles Source
SRR042423 [details] 75 not reported 30 third-instar wandering larvae
SRR059066 [details] 75 not reported 30 third-instar wandering larvae
SRR034309 [details] 76 not reported mixed stage embryos
SRR031717 [details] 37 not reported S2-DRSC cell line
SRR015074 [details] 50 not reported CME_W1_Cl.8+ cell line

Running the bias analysis pipeline

The bias analysis pipeline can be launched on an LSF cluster by running the Run_bias_analysis.sh script. The script will submit one job per dataset. Job parameters can be tuned by editing the script. The analyses can be run locally as well by editing the run function in the launch script. The runlogs are stored under the runlogs directory.

After all jobs finish running, a global report can be generated by issuing:

make gr

Interpreting the results

The global report summarising the correlations of different features between data simulated under the full and flat conditions and real datasets is at reports/global_bias_report.pdf

The per-dataset output of the bias analysis pipeline (with the exception of the transcriptome fasta files augmented with the estimated expression levels) are available at http://www.ebi.ac.uk/goldman-srv/paper-rlsim/Bias and they are also checked in the repository under the Bias/AN/log directory, where AN is the dataset accession number. They can be regenerated by running the pipeline as instructed above.

The log directories (e.g. Bias/SRR549337/log) store the following output files important for interpreting the results:

  • meta_cov_cmp.pdf - report on the shift of correlations and p-values for the per-transcript fragment coverage vectors (full/flat model vs. real data)
  • meta_cov_cmp.txt - text report on the shift in the correlation of relative expression levels.
  • meta_pb_sim_flat.log - text report on the correlation of GC content distribution and sequence bias between real and flat simulated data
  • meta_pb_sim.log - text report on the correlation of GC content distribution and sequence bias between real and full simulated data
  • effest_report.pdf - effest PDF report
  • cov_cmp_flat.pdf - report on the comparison of relative expression levels and transcript coverage vectors between the real and flat simulated dataset. Contains a scatter-plot of relative expression levels, the histogram of rank correlations and corresponding p-values for per-transcript fragment coverage vectors. Also contains a plot of the simulated and real coverage vectors for 30 transcripts with the lowest p-values.
  • cov_cmp_report_flat.txt - text report on the correlation of relative expression levels and coverage vectors between the real and the flat simulated dataset.
  • cov_cmp.pdf - report on the comparison of relative expression levels and transcript coverage vectors between the real and full simulated dataset
  • cov_cmp_report.txt - text report on the correlation of relative expression levels and coverage vectors between the real and the full simulated dataset.
  • pb_plot_aln.pdf - sequence bias and GC content distribution plots for the real data
  • pb_plot_sim_flat.pdf - sequence bias and GC content distribution plots for the flat simulated data
  • pb_plot_sim.pdf - sequence bias and GC content distribution plots for the full simulated data
  • re_effest_report.pdf - PDF report of the effest run on the full simulated data
  • rlsim_report_flat.json.pdf - PDF report file for flat simulation
  • rlsim_report.json.pdf - PDF report file for the full simulation.

The log directories also contain several intermediary files:

  • effest_log.txt - effest log file
  • effest_raw_params.json - "raw parameters" estimated by effest
  • effest_suggestions.txt - command line parameters suggested by effest
  • rlsim_report_flat.json - JSON report file for the flat simulation
  • rlsim_report.json - JSON report file for the full simulation
  • *.pk - pickle files storing the results of various analyses
  • flat_re_SRR549337_expr.fas - uncorrected relative expression levels re-estimated from the full simulated data (not included in repo)
  • flat_SRR549337_expr.fas - uncorrected relative expression levels estimated from the real data (not included in repo)
  • re_SRR549337_expr.fas - uncorrected relative expression levels estimated from the full simulated data (not included in repo)
  • SRR549337_expr.fas - relative expression levels estimated from the real data (available from a separate repository)
  • files prefixed with re_ - output of effest parameter re-estimation from full simulated data

The replicate analysis pipeline

Analysis overview

In addition to the bias analysis described above, we performed analysis of correlation of summary statistics between pairs of technical replicates.

In the absence of a paired-end dataset with technical replicates, we took advantage of the single-ended datasets from Bullard et al. (2010) where library preparation, lane and flow cell effects have been analysed using Universal Human Reference (UHR) and Ambion's human brain reference RNA. The RNA used was commercial-grade and the sequencing was performed in-house by Illumina; the differences between replicates obtained in less ideal conditions are likely to be larger and so we believe such analysis provides an upper bound on the correlation between any two datasets.

The calculated summary statistics are conceptually the same but there are small differences due the fact that sequencing data have unpaired-end reads:

  • Relative expression levels are calculated in the same way as in the bias analysis
  • GC content is calculated for reads rather than fragments
  • Sequence biases are calculated for beginnings and ends of reads rather than fragments

Datasets

Homo sapiens

Accession Info URL Read length PCR cycles Source
SRR037466 [details] 35 - UHR RNA
SRR037467 [details] 35 - UHR RNA
SRR037468 [details] 35 - UHR RNA
SRR037469 [details] 35 - UHR RNA
SRR037473 [details] 35 - UHR RNA
SRR037474 [details] 35 - UHR RNA
SRR037475 [details] 35 - UHR RNA
SRR037476 [details] 35 - UHR RNA

Running the replicate analysis pipeline

The replicate analysis can be launched in a manner similar to the bias analysis, by running Run_replicate_analysis.sh and also uses the LSF queuing system for job submission.

Interpreting the results

The principal results of the analysis have been added to the global report file in the form of baselines representing the average correlation of each summary statistic between pairs of technical replicates.

The intermediate results of the analysis are stored in the Replicate subdirectory. Results for each pair or replicates are stored in a directory called AN1-AN2 where AN1 and AN2 are accession numbers for the two datasets. The log directories contain the following output files:

  • cov_cmp.pdf - the same as cov_cmp_sim.pdf/cov_cmp_flat.pdf but containing correlations between the replicates
  • cov_cmp_report.txt - text report containing the correlation of relative expression levels between the replicates
  • meta_pb.log - text report containing the correlation of GC content distribution and sequence bias between the replicates
  • pb_plot_ANX.pdf - sequence bias and GC content distribution for AN1/AN2
  • *.pk - pickle files with results of various analyses

Acknowledgements

Thanks for the useful discussions and suggestions to (in no particular order): Kevin Gori, Christophe Dessimoz, Nick Williams, Ernest Turro, John Marioni, Nuno Fonseca, Pär Engström, Paul Bertone.

Files and dependencies

Dependencies

You will need the following software installed in order to compile and run the pipeline:

Directories

  • Bias/ - files produced by the bias analysis pipeline
  • data.mk/ - dataset-specific makefiles
  • Info/ - miscellaneous info
  • lib/ - Python classes used by the analysis tools
  • ref_cache/ - Reference transcriptomes indexed by BWA
  • Replicate/ - files produced by the replicate analysis pipeline
  • reports/ - PDF reports
  • runlogs/ - directory to store runglogs
  • scripts/ - analysis scripts
  • simulators/ - directory containing the simulator (rlsim, simNGS) submodules

Scripts quick reference

ensembl_versions

Print out various Ensembl related info.

py_versions

Print out Python-related versions.

get_ref_transcripts

Fetch canonical transcriptome and the list of single isoform transcipt using the Ensembl v69 Perl API.

mapnsort

usage: mapnsort [-h] [-f ref_fasta] [-s] [-o output_file]
                input file [input file ...]

Map and sort reads.

positional arguments:
  input file      Paired-end reads in separate fastq files.

optional arguments:
  -h, --help      show this help message and exit
  -f ref_fasta    Reference sequences in fasta format.
  -s              Assume reads are single ended.
  -o output_file  Output file name.

simulate_data

usage: simulate_data [-h] -f ref_fasta [-m map_ref] [-r rlsim_opts]
                     [-n simngs_opts] -x simngs_runfile [-o out_dir]
                     [-l log_file] -z tools_path [-y dir_name]

Simulate RNA-seq library construction and Illumina sequencing.

optional arguments:
  -h, --help         show this help message and exit
  -f ref_fasta       Reference sequences in fasta format for simulation.
  -m map_ref         Reference sequences for mapping.
  -r rlsim_opts      Extra rlsim options.
  -n simngs_opts     simNGS options.
  -x simngs_runfile  simNGS runfile.
  -o out_dir         Output directory.
  -l log_file        Log file.
  -z tools_path      Path to simulation tools.
  -y dir_name        Name of the simulation directory

meta_cov_cmp

usage: meta_cov_cmp [-h] -m m_pickle -f f_pickle [-r pdf] [-p el_pickle]

Compare the distributions of pickled coverage stats (version 1.0).

optional arguments:
  -h, --help    show this help message and exit
  -m m_pickle   Pickle generated with full model.
  -f f_pickle   Pickle generated with flat model.
  -r pdf        Report PDF.
  -p el_pickle  Result pickle file.

meta_pb_cmp

usage: meta_pb_cmp [-h] -a a_pickle -s s_pickle [-p res_pickle]

Compare pickled bias stats (version 1.0).

optional arguments:
  -h, --help     show this help message and exit
  -a a_pickle    Pickle generated from real data.
  -s s_pickle    Pickle generated from simulated data.
  -p res_pickle  Result pickle file.

plot_global_report

usage: plot_global_report [-h] [-c cov_pickles [cov_pickles ...]]
                          [-p pb_pickles [pb_pickles ...]] [-l labels]
                          [-r pdf]

Collate the results from different datasets on a single plot(version 1.0).

optional arguments:
  -h, --help            show this help message and exit
  -c cov_pickles [cov_pickles ...]
                        Pickles generated by cov_cmp.
  -p pb_pickles [pb_pickles ...]
                        Pickles generated by pb_plot.
  -l labels             Dataset labels file.
  -r pdf                Report PDF.

githalytics.com alpha