Tosca is presented and described further in our preprint:
Anob M. Chakrabarti, Ira A. Iosub, Flora C. Y. Lee, Jernej Ule, Nicholas M. Luscombe. bioRxiv (2022).
- Introduction
- Pipeline summary
- Quick start (testing)
- Quick start (running)
- Pipeline parameters
- Pipeline outputs
Tosca is a Nextflow pipeline for the analysis of hiCLIP or proximity ligation (e.g. PARIS, SPLASH, COMRADES) sequencing data. It is containerised using Docker to ensure ease of installation. It is optimised for use on high-performance computing (HPC) clusters, but can also run locally depending on the size of the data set.
- Adapter and quality trimming (
Cutadapt
) - Premapping to remove spliced reads (
STAR
) - Hybrid identification (
pblat
andtoscatools
) - UMI-based deduplication (
toscatools
and modifiedUMI-tools
) - Hybrid clustering (
toscatools
) - Annotation (
toscatools
) - Duplex and structure analysis and binding energy characterisation (
toscatools
) - Visualisation (
toscatools
)- BAM
- BED
- Arc plots
- Contact matrices
- QC (
MultiQC
)
- Ensure
Nextflow
andDocker
orSingularity
are installed on your system - Pull the main version of the pipeline from the GitHub repository:
nextflow pull amchakra/tosca -r main
- Run the provided test dataset:
nextflow run amchakra/tosca -r main -profile test,docker
or
nextflow run amchakra/tosca -r main -profile test,singularity
- Review the results
- Ensure
Nextflow
andDocker
orSingularity
are installed on your system - Pull the main version of the pipeline from the GitHub repository:
nextflow pull amchakra/tosca -r main
- Download and unpack pre-generated reference files. We have generated these for human and mouse (they are ~25GB each).
wget -q reference.tar.gz
tar -xzvf reference.tar.gz
- Prepare a
samplesheet.csv
with your sample names and paths to your FASTQ files, following the template:
sample,fastq
sample1,/path/to/file1.fastq.gz
sample2,/path/to/file2.fastq.gz
sample3,/path/to/file3.fastq.gz
- Run the pipeline (the minimum parameters have been specified):
nextflow run amchakra/tosca -r main \
-profile singularity \
--input samplesheet.csv \
--genomesdir /path/to/reference \
--org human
-profile
can be used to specifytest
,docker
,singularity
andcrick
depending on the system being used and resources available. Others can be found at nf-core.
--input
specifies the input sample sheet--outdir
specifies the output results directory- default:
./results
- default:
--tracedir
specifies the pipeline run trace directory- default:
./results/pipeline_info
- default:
Either --genomesdir
and --org
or all of the other reference files need to be specified
--genomesdir
specifies the genome reference directory--org
specifies the organism (options are currently:human
,mouse
)--genome_fai
specifies the genome FASTA index--star_genome
specifies the genome STAR index--regions_gtf
specifies the genome gene/region/biotype annotation GTF (generated byiCount-Mini
)--transcript_fa
specifies the pseudo-transcriptome FASTA--transcript_fai
specifies the pseudo-transcriptome FASTA index--transcript_gtf
specifies the pseudo-transcriptome annotation GTF
--adapter
specifies the adapter sequence for Cutadapt- default:
AGATCGGAAGAGC
- default:
--min_quality
specifies the minimum quality score for Cutadapt- default:
10
- default:
--min_readlength
specifiies the minimum read length after trimming for Cutadapt- default:
16
- default:
--split_size
specifies number of reads per FASTQ file when splitting for parallelised alignment- default:
100000
- default:
--star_args
specifies optional additional STAR aligmnent parameters--step_size
specifies pblat step size- default:
5
- default:
--tile_size
specifies pblat tile size- default:
11
- default:
--min_score
specifies pblat minimum score- default:
15
- default:
--evalue
specifies pblat e-value threshold- default:
0.001
- default:
--maxhits
specfies maximum number of pblat alignments per read- default:
100
- default:
--dedup_method
specifies the UMI deduplication method (options are:none
,unique
,percentile
,cluster
,adjacency
,directional
)- default:
directional
- default:
--umi_separator
specifies the UMI separator in the FASTQ read name- default:
_
- default:
--chunk_number
specifies the number of chunks into which to split the hybrid files for parallelised processing- default:
100
- default:
--percent_overlap
specifies the minimum percentage that one of the two hybrid arms need to overlap to be counted as overlapping- default:
0.75
- default:
--sample_size
specifies the sample size to subsample hybrids reads per gene prior to clustering- default:
-1
i.e. no subsampling
- default:
--analyse_structure
specifies whether to analyse the duplex structure for each hybrid read- default:
false
- default:
--shuffled_mfe
specifies whether to generate a control shuffled mean minimum free energy for each hybrid read- default:
false
- default:
--clusters_only
specifies whether to analyse the structure for hybrid reads that are in a cluster- default:
true
- default:
--atlas
specifies whether to generate an atlas of duplexes by combining hybrids from all the samples- default:
true
- default:
--goi
is a plain text file with one gene of interest per line to be visualised--bin_size
specifies the size of each bin when generating the contact map matrices- default:
100
- default:
--breaks
specifies the breaks for grouping the arcs by colour- default:
0,0.3,0.8,1
- default:
--skip_premap
skips premapping to the genome and filtering of spliced reads--skip_atlas
skips generation of an atlas by combining all the samples--skip_qc
skips generation of QC plots and MultiQC report
Tosca outputs results in a number of subfolders:
.
├── mapped
├── hybrids
├── clusters
├── igv
├── maps
├── nonhybrids
└── pipeline_info
mapped
contains all the partial read alignments used for calculating valid hybrids:*.blast8.gz
hybrids
contains files that have the identified hybrids as TSV files:*.hybrids.tsv.gz
contains all the hybrids*.hybrids.dedup.tsv.gz
contains the deduplicated hybrids*.hybrids.clustered.tsv.gz
contains the deduplicated hybrids with clusters calculated that identify the unique duplexes/RNA structure they represent*.hybrids.gc.tsv.gz
contains the deduplicated hybrids with genomic coordinates calculated*.hybrids.gc.annotated.tsv.gz
contains the deduplicated hybrids with genomic coordinates, gene, region and biotypes calculated.
clusters
contains files that have the identified clusters as TSV files:*.clusters.tsv.gz
contains all the collapsed clusters*.clusters.gc.tsv.gz
contains the collapsed clusters with genomic coordinates calculated*.clusters.gc.annotated.tsv.gz
contains the collapsed clusters with genomic coordinates, gene, region and biotypes calculated.
igv
contains files than can be used to visualise the results in IGV:*.bam
contains all the hybrids in BAM format. Optional flags can be used to colour/group by experiment, hybrid cluster, read orientation, and hybridisation energy*.bed
contains the clusters (i.e. unique duplexes) in BED format*.bp
contains arc representations of the clusters coloured by number
maps
contains contact map files (if genes of interest have been specified):*.mat.rds
is an R matrix with the raw contact map matrix*.{bin_size}_binned.map.tsv.gz
is the matrix in long format binned using {bin_size}
nonhybrids
contains those sequencing reads that did not contain a hybrid:*.nonhybrid.fastq.gz
pipeline_info
contains the execution reports, traces and timelines generated by Nextflow:execution_report.html
execution_timeline.html
execution_trace.txt