CURRENTLY A WORK IN PROGRESS
ngs-pipeline
is a pipeline for variant calling WES and WGS data. The Broad's GATK software is used
for much of the preprocessing and variant calling. Annovar annotates the identified variants. The README.md
documents
all necessary steps to run variant calling on raw FASTQ data from quality control to variant annotation. In its current
state, ngs-pipeline
only works on the Sherlock HPC cluster at Stanford.
Before preprocessing can begin, the raw FASTQ files should be evaluated for quality control (QC). Sequencing can introduce artefacts and other bias that can affect variant calling and downstream analysis. It is vital we are aware of any potential defects early on. We use two tools, fastqc and multiqc to produce QC reports for FASTQ data.
First, use fastqc
to generate individual QC reports for each FASTQ file.
fastqc [input fastq]
evaluates a single FASTQ file. If all the FASTQ files are stored in a single directory type
fastqc *.fastq.gz
This will produce reports for each individual FASTQ file. Once these reports have been generated,
multiqc
can summarize all the reports. Type
multiqc [data directory]
and replace [data directory]
with the folder containing the fastqc
reports. The file
multiqc_report.html
will contain the summarized report info. This
link provides an overview of how to use
the report.
Its up to the user how to proceed once a QC problem has been identified. Depending on the quality of the sample, the user may decide to discard it all together. Although in the past trimming programs were often recommending, GATK 4.0 guidelines recommend against trimming as it can hinder the BaseRecalibration preprocessing step. Instead, BaseRecalibration can usually account for poor quality reads used by Mutect2.
ngs-pipeline
follows a modified version of the Broad's
Best Practices Pipelines for Variant Discovery.
Not all scatter operations are performed to save time waiting on excess SLURM jobs.
Preprocessing requires 2 items from the user. First, the user must provide an inputs file describing each sample, its FASTQ files,
Library ID, and the directory where the preprocessed files should be saved. This info should be saved in a .csv
or .xlsx
file.
A reference file must also be specified that tells ngs-pipeline
where to look for items like the reference genome fasta,
cromwell
jar, and email address to notify the user of pipeline failure. Reference genome files can be downloaded from the Broad's
Resource Bundle over FTP. Cromwell executables can be found on
Github.
Input File Columns (.csv or .xslx):
Sample
Type
FASTQ1
FASTQ2
OutputDirectory
Library
SamplePrefix (Optional but still need to include column even if it'll be left blank)
Reference File Columns (Must be .csv formatted):
Reference
Path
Use absolute paths for the reference files. See preprocessing_references_template.csv
for a template file with all the required
references you need to specify.
To call the preprocessing pipeline, type
python preprocess.py [-P] [inputs.csv] [references.csv] [log_directory]
[log_directory]
is the directory to store all the cromwell logs and intermediate files. Cromwell produces a lot of intermediate
files and logs, so its best to specify this directory as somewhere with large storage volume (like $SCRATCH
) and to delete often
once you no longer need the logs or intermediates. -P
will generate an input file for variant calling from your inputs file.
Modified implementation of the Broad's Best Practices Pipelines for Variant Discovery for use on Stanford's Sherlock Compute Cluster. The project consists of two pipelines: PreProcessing and VariantCalling. PreProcessing performs data preprocessing on an individual sample. VariantCalling performs Mutect2 variant calling for a normal/tumor pair of samples.
/scripts
contains the WDL files that specify the workflows.
/inputs
contains the .json files that dictate which files are processed by the workflow.
preprocess_samples.py
preprocesses each each sample within a specified directory by running each sample through the pipeline. Given a directory with the -d parameter, the script creates a customized input .json file for each sample labeled preprocessing_SAMPLENAME.json
that is sent to an sbatch job that controls the preprocessing workflow for that sample. All logs are stored in the cromwell-monitoring-logs directory in $PI_SCRATCH
.
preprocessing.wdl
specifies the preprocessing tasks to create an analysis ready bam given paired end FASTQ files as input.
preprocessing.json
specifies the variables used for the preprocessing.wdl
workflow.
python /home/groups/carilee/software/ngs-pipeline/preprocess_samples.py -d /scratch/groups/carilee/cromwell-test/short-data/
This creates a customized input file for CTR119_short and then launches an sbatch job to control the cromwell process.
With the local backend
java -jar /home/groups/carilee/software/cromwell-35.jar run /home/groups/carilee/software/ngs-pipeline/scripts/preprocessing.wdl -i /home/groups/carilee/software/ngs-pipeline/inputs/preprocessing.json
To use the SLURM backend, specify the config files as your.conf
java -Dconfig.file=/home/groups/carilee/software/ngs-pipeline/your.conf -jar /home/groups/carilee/software/cromwell-35.jar run /home/groups/carilee/software/ngs-pipeline/scripts/preprocessing.wdl -i /home/groups/carilee/software/ngs-pipeline/inputs/preprocessing.json