- The first step is to do some preprocessing on the input FASTQ files. the pipeline assumes the reads are in a directory arranged in the following format:
/input_directory/sample_name_R1_001.fastq.gz # Read1 of DNA fragment
/input_directory/sample_name_R2_001.fastq.gz # Index (barcode) read
/input_directory/sample_name_R3_001.fastq.gz # Read2 of DNA fragment
For example:
/n/scratch/users/m/meb521/methyl_seq/nextseq/xBO87_ATAC_S1_R1_001.fastq.gz
/n/scratch/users/m/meb521/methyl_seq/nextseq/xBO87_ATAC_S1_R2_001.fastq.gz
/n/scratch/users/m/meb521/methyl_seq/nextseq/xBO87_ATAC_S1_R3_001.fastq.gz
The following script wraps the methyl_fastq_pipeline.py
into a SLURM job with two input variables required for input:
input_directory
and sample_name
~/methylation/scripts/SLURM_fastq_pipeline.sh /n/scratch/users/m/meb521/methyl_seq/nextseq xBO87_ATAC_S1
Other examples submitted with sbatch
sbatch ~/methylation/scripts/SLURM_fastq_pipeline.sh /n/scratch/users/m/meb521/xBO140/fastqs xBO140a_S1
sbatch ~/methylation/scripts/SLURM_fastq_pipeline.sh /n/scratch/users/m/meb521/xBO140_nova xBO140_novaseq
sbatch ~/methylation/scripts/SLURM_fastq_pipeline.sh /n/scratch/users/m/meb521/xBO153 xBO153_ATAC_240606_S1
The pipeline currently is harcoded with the assumption that R2 is 24nt long and has 8nt of splint adapter CAGACGCG at the beginning and reverse compliment of 10x ATAC barcodes from 9-24. It is also harcoded to clip first 11nt and last 2nt of R1 first 2nt and last 2nt of R3 reads. These options can be modified by modifying extract_clean_fastq
function within methyl_utils.py
script
The pipeline does several steps including splitting, trimming, barcode transfer from index reads into cDNA reads, potentially quality filtering reads, and collecting raw barcodes for barcode matching.
- After this step we run a simple script which submits MANY jobs to the HPC for each chunk of fastq
python ~/methylation/methyl_alignment_pipeline.py \
-r /n/scratch/users/m/meb521/GRCm39_full \
-i /n/scratch/users/m/meb521/xBO140/fastqs \
-s xBO140a_S1 -b -j
python ~/methylation/methyl_alignment_pipeline.py \
-r /n/scratch/users/m/meb521/GRCm39_full \
-i /n/scratch/users/m/meb521/xBO140_nova \
-s xBO140_novaseq -b -j
To monitor the state of each aligment job we can look at last line of the log which contains total number of reads processed so far
find . -type f -name 'methylation_extractor_job_*' -exec tail -n 1 {} \;
- After alignment postprocessing and count matrix generation is done with the second SLURM pipeline:
Required arguments are window_size for binning, context which can be two values
Non_CpG_context
andCpG_context
and fasta index of the reference used in alignment two such indices are available in data folder humanGRCh38_v44_chrs.fasta
and mouseGRCm39_v34_allcontigs.fasta.fai
~/methylation/scripts/SLURM_bam_mtx_pipeline.sh /n/scratch/users/m/meb521/methyl_seq/nextseq xBO87_ATAC_S1 50000
~/methylation/scripts/SLURM_bam_mtx_pipeline.sh /n/scratch/users/m/meb521/xBO140/fastqs xBO140a_S1 100000 Non_CpG_context ~/methylation/data/GRCm39_v34_allcontigs.fasta.fai
sbatch ~/methylation/scripts/SLURM_bam_mtx_pipeline.sh \
/n/scratch/users/m/meb521/xBO140/fastqs \
xBO140a_S1 \
200000 \
CpG_context \
~/methylation/data/GRCm39_v34_allcontigs.fasta.fai
sbatch ~/methylation/scripts/SLURM_bam_mtx_pipeline.sh \
/n/scratch/users/m/meb521/xBO140_nova \
xBO140_novaseq \
100000 \
CpG_context \
~/methylation/data/GRCm39_v34_allcontigs.fasta.fai
- For very large datasets we split the pipeline into pieces and submit them as array jobs:
This is an array job SLURM_ARRAY_TASK_ID
is embedded as an input argument to save_quad_batch.py
. Each which subsequently splits the parts into batches of 12 and processess them in pools of two using 2 cores. TASK_ID
will determine which 12-part batch of parts the task will process. TASK_ID
1 will process parts 000
to 011
, TASK_ID
4 will process parts 036
to 047
and so on.
sbatch ~/methylation/scripts/SLURM_save_quad_batch.sh \
/n/scratch/users/m/meb521/xBO140_nova \
xBO140_novaseq
sbatch ~/methylation/scripts/SLURM_aggregate_quad_parts.sh \
/n/scratch/users/m/meb521/xBO140_nova \
xBO140_novaseq \
CpG_context
- To Build count matrices from batches, we first make matrix from each batch and then stack them
sbatch ~/methylation/scripts/SLURM_make_count_mtx.sh \
/n/scratch/users/m/meb521/xBO140_nova \
xBO140_novaseq \
100000 \
Non_CpG_context \
~/methylation/data/GRCm39_v34_allcontigs.fasta.fai
sbatch ~/methylation/scripts/SLURM_stack_mtx.sh \
/n/scratch/users/m/meb521/xBO140_nova \
xBO140_novaseq \
100000 \
CpG_context \
~/methylation/data/GRCm39_v34_allcontigs.fasta.fai
- To Build final bam with duplications and barcodes marked We first tag them
sbatch ~/methylation/scripts/SLURM_tag_bam_parts.sh \
/n/scratch/users/m/meb521/xBO140_nova \
xBO140_novaseq
After tagging all bam parts we can aggregate and mark duplicates using the following:
sbatch ~/methylation/scripts/bam_merge.sh \
/n/scratch/users/m/meb521/xBO140_nova \
xBO140_novaseq