Tosca - proximity ligation data analysis

Tosca is presented and described further in our preprint:

A computationally-enhanced hiCLIP atlas reveals Staufen1 RNA binding features and links 3’ UTR structure to RNA metabolism.

Anob M. Chakrabarti, Ira A. Iosub, Flora C. Y. Lee, Jernej Ule, Nicholas M. Luscombe. bioRxiv (2022).

Introduction
Pipeline summary
Quick start (testing)
Quick start (running)
Pipeline parameters
Pipeline outputs

Introduction

Tosca is a Nextflow pipeline for the analysis of hiCLIP or proximity ligation (e.g. PARIS, SPLASH, COMRADES) sequencing data. It is containerised using Docker to ensure ease of installation. It is optimised for use on high-performance computing (HPC) clusters, but can also run locally depending on the size of the data set.

Pipeline summary

Adapter and quality trimming (Cutadapt)
Premapping to remove spliced reads (STAR)
Hybrid identification (pblat and toscatools)
UMI-based deduplication (toscatools and modified UMI-tools)
Hybrid clustering (toscatools)
Annotation (toscatools)
Duplex and structure analysis and binding energy characterisation (toscatools)
Visualisation (toscatools)
1. BAM
2. BED
3. Arc plots
4. Contact matrices
QC (MultiQC)

Quick start (testing)

Ensure Nextflow and Docker or Singularity are installed on your system
Pull the main version of the pipeline from the GitHub repository:

nextflow pull amchakra/tosca -r main

Run the provided test dataset:

nextflow run amchakra/tosca -r main -profile test,docker

nextflow run amchakra/tosca -r main -profile test,singularity

Review the results

Quick start (running)

Ensure Nextflow and Docker or Singularity are installed on your system
Pull the main version of the pipeline from the GitHub repository:

nextflow pull amchakra/tosca -r main

Download and unpack pre-generated reference files. We have generated these for human and mouse (they are ~25GB each).

wget -q reference.tar.gz
tar -xzvf reference.tar.gz

Prepare a samplesheet.csv with your sample names and paths to your FASTQ files, following the template:

sample,fastq
sample1,/path/to/file1.fastq.gz
sample2,/path/to/file2.fastq.gz
sample3,/path/to/file3.fastq.gz

Run the pipeline (the minimum parameters have been specified):

nextflow run amchakra/tosca -r main \
-profile singularity \
--input samplesheet.csv \
--genomesdir /path/to/reference \
--org human

Pipeline parameters

Profiles

-profile can be used to specify test, docker, singularity and crick depending on the system being used and resources available. Others can be found at nf-core.

General parameters

--input specifies the input sample sheet
--outdir specifies the output results directory
- default: ./results
--tracedir specifies the pipeline run trace directory
- default: ./results/pipeline_info

Genome parameters

Either --genomesdir and --org or all of the other reference files need to be specified

--genomesdir specifies the genome reference directory
--org specifies the organism (options are currently: human, mouse)
--genome_fai specifies the genome FASTA index
--star_genome specifies the genome STAR index
--regions_gtf specifies the genome gene/region/biotype annotation GTF (generated by iCount-Mini)
--transcript_fa specifies the pseudo-transcriptome FASTA
--transcript_fai specifies the pseudo-transcriptome FASTA index
--transcript_gtf specifies the pseudo-transcriptome annotation GTF

Read trimming and alignment parameters

--adapter specifies the adapter sequence for Cutadapt
- default: AGATCGGAAGAGC
--min_quality specifies the minimum quality score for Cutadapt
- default: 10
--min_readlength specifiies the minimum read length after trimming for Cutadapt
- default: 16
--split_size specifies number of reads per FASTQ file when splitting for parallelised alignment
- default: 100000
--star_args specifies optional additional STAR aligmnent parameters
--step_size specifies pblat step size
- default: 5
--tile_size specifies pblat tile size
- default: 11
--min_score specifies pblat minimum score
- default: 15
--evalue specifies pblat e-value threshold
- default: 0.001
--maxhits specfies maximum number of pblat alignments per read
- default: 100

Hybrid identification and characterisation

--dedup_method specifies the UMI deduplication method (options are: none, unique, percentile, cluster, adjacency, directional)
- default: directional
--umi_separator specifies the UMI separator in the FASTQ read name
- default: _
--chunk_number specifies the number of chunks into which to split the hybrid files for parallelised processing
- default: 100
--percent_overlap specifies the minimum percentage that one of the two hybrid arms need to overlap to be counted as overlapping
- default: 0.75
--sample_size specifies the sample size to subsample hybrids reads per gene prior to clustering
- default: -1 i.e. no subsampling
--analyse_structure specifies whether to analyse the duplex structure for each hybrid read
- default: false
--shuffled_mfe specifies whether to generate a control shuffled mean minimum free energy for each hybrid read
- default: false
--clusters_only specifies whether to analyse the structure for hybrid reads that are in a cluster
- default: true
--atlas specifies whether to generate an atlas of duplexes by combining hybrids from all the samples
- default: true

Visualisation

--goi is a plain text file with one gene of interest per line to be visualised
--bin_size specifies the size of each bin when generating the contact map matrices
- default: 100
--breaks specifies the breaks for grouping the arcs by colour
- default: 0,0.3,0.8,1

Optional pipeline modules

--skip_premap skips premapping to the genome and filtering of spliced reads
--skip_atlas skips generation of an atlas by combining all the samples
--skip_qc skips generation of QC plots and MultiQC report

Pipeline outputs

Tosca outputs results in a number of subfolders:

.
├── mapped
├── hybrids
├── clusters
├── igv
├── maps
├── nonhybrids
└── pipeline_info

Files

mapped contains all the partial read alignments used for calculating valid hybrids:
- *.blast8.gz
hybrids contains files that have the identified hybrids as TSV files:
- *.hybrids.tsv.gz contains all the hybrids
- *.hybrids.dedup.tsv.gz contains the deduplicated hybrids
- *.hybrids.clustered.tsv.gz contains the deduplicated hybrids with clusters calculated that identify the unique duplexes/RNA structure they represent
- *.hybrids.gc.tsv.gz contains the deduplicated hybrids with genomic coordinates calculated
- *.hybrids.gc.annotated.tsv.gz contains the deduplicated hybrids with genomic coordinates, gene, region and biotypes calculated.
clusters contains files that have the identified clusters as TSV files:
- *.clusters.tsv.gz contains all the collapsed clusters
- *.clusters.gc.tsv.gz contains the collapsed clusters with genomic coordinates calculated
- *.clusters.gc.annotated.tsv.gz contains the collapsed clusters with genomic coordinates, gene, region and biotypes calculated.
igv contains files than can be used to visualise the results in IGV:
- *.bam contains all the hybrids in BAM format. Optional flags can be used to colour/group by experiment, hybrid cluster, read orientation, and hybridisation energy
- *.bed contains the clusters (i.e. unique duplexes) in BED format
- *.bp contains arc representations of the clusters coloured by number
maps contains contact map files (if genes of interest have been specified):
- *.mat.rds is an R matrix with the raw contact map matrix
- *.{bin_size}_binned.map.tsv.gz is the matrix in long format binned using {bin_size}
nonhybrids contains those sequencing reads that did not contain a hybrid:
- *.nonhybrid.fastq.gz
pipeline_info contains the execution reports, traces and timelines generated by Nextflow:
- execution_report.html
- execution_timeline.html
- execution_trace.txt

amchakra/tosca