This is a Nextflow workflow to run analysis of ScaleBio Single Cell DNA methylation sequencing libraries. It processes data from sequencing reads to alignments, single-cell outputs (methylation coverage, binned genome matrix, etc.), and QC reports.
- First install Nextflow (23.10 or later)
- Download this workflow to your machine
- Setup dependencies
- Launch the small pipeline test run
- Download / configure a reference genome for your samples
- Create a Sample Barcode Table
- Create runParams.yml, specifying inputs and analysis options for your run
- Launch the workflow for your data
- Linux system with GLIBC >= 2.17 (such as CentOS 7 or later)
- Java 11 or later
- 64GB of RAM and 12 CPU cores
- For large datasets a distributed compute system (HPC or cloud) is strongly recommended
- Working storage space 5x the input data size
- E.g. Approximately 4TB temporary storage for deep sequencing of a small kit
- Sequencing reads, either
--runFolder
: Path to the Illumina Sequencer RunFolder (bcl
files)--fastqDir
: Path to the fastq files, generated outside (before) this workflow; see Fastq generation.
- Sample Barcode Table
--samples
: A .csv file listing all samples in the analysis; See samples.csv.
- Reference Genome
--genome
: A reference genome, including a BSBolt index for alignment, and gene annotation; See Reference Genomes
The workflow produces per-sample and per-library QC reports (html
), alignments (bam
), per-cell methylation coverage calls (bismark-like cov
), genomic-bin methylation score matrix files (mtx
) and more; See Outputs for a full list.
A small test run, with all input data stored online can be run with the following command:
nextflow run /PATH/TO/ScaleMethyl -profile PROFILE -params-file /PATH/TO/ScaleMethyl/docs/examples/runParams.yml --outDir ScaleMethyl.out
-profile docker
is the preferred option if the system supports Docker containers; See Dependency Management for alternatives.
With this command, nextflow will automatically download the example data from the internet (AWS S3), so please ensure that the compute nodes have internet access and storage space. Alternatively you can manually download the data first (using AWS CLI)
aws s3 sync s3://scale.pub/testData/methylation/reference/ reference --no-sign-request
aws s3 sync s3://scale.pub/testData/methylation/downsampled_pbmcs_v1.1/ fastqs --no-sign-request
and then run with `nextflow run /PATH/TO/ScaleMethyl/ -profile PROFILE -params-file /PATH/TO/ScaleMethyl/docs/examples/runParams.yml --genome reference/genome.json --fastqDir fastqs --outDir ScaleMethyl.out
Note that this test run is merely a quick and easy way to verify that the pipeline executes properly and does not represent a complete dataset.
See the Nextflow command-line documentation for the options to run nextflow
on different systems; including HPC clusters and cloud compute.
Analysis parameters (inputs, options, etc.) can be defined either in a runParams.yml file or directly on the nextflow command-line. See analysisParameters for details on the options.
Note that nextflow
options are given with a single -
(e.g. -profile
), while workflow parameters (e.g. --outDir
) are given with a double dash --
.
In addition to the analysis parameters, a user-specific nextflow configuration file can be used for system settings (compute and storage resources, resource limits, storage paths, etc.):
-c path/to/user.config
See Nextflow configuration for the way different configuration files, parameter files and the command-line interact.
The Nextflow workflow can automatically use pre-built docker containers with all dependencies included. Activating the included -profile docker
enables the required Nextflow settings. For details and alternatives see Dependencies.
Nextflow itself supports running using AWS, Azure and Google Cloud.
In addition Nextflow tower offers another simple way to manage and execute nextflow workflows in Amazon AWS.
Alternate start points allow users to start the analysis without starting from the beginning. These options are included below. For these start points to be available, the pipeline needs to be completed to certain points in previous runs.
Command line argument --startPostAlignment true
or startPostAlignment: true
in the runParams yaml file will rerun the pipeline directly after the 'Dedup and Extract' module and before 'Matrix Generation'. This will enable the user to include additional outputs to their run or change the binning for downstream cell clustering without needing to re-align. Additional output options can be included such as allc and amethyst outputs. The 'DEDUP_AND_EXTRACT:Extract' and 'METRICS_AND_REPORTING:GenerateMetrics' processes need to completely finish before this option is available for subsequent runs.
See the change log
By purchasing product(s) and downloading the software product(s) of ScaleBio, You accept all of the terms of the License Agreement. If You do not agree to these terms and conditions, You may not use or download any of the software product(s) of ScaleBio.