A pipeline which demultiplexes unpaired FASTQ files containing libraries with 5' 'inline' barcodes and A-tails (i.e. the barcodes are the first N bases read, followed by a 'T').
- demultiplexing is done using fastq-multx, allowing one mismatch to the barcode (including A-tail)
- barcodes with A-tails are removed, but FASTQ files without barcodes removed are also generated. These are suitable for GEO submission, which require demultiplexed but otherwise unmodified files.
- no 3' quality trimming is applied
- Unix-like operating system (tested on CentOS 7.2.1511)
- Git
- conda
- Multiplexed unpaired FASTQ files of libraries with 5' 'inline' barcodes and A-tails. This pipeline has only been tested with Illumina data.
0. Clone this repository.
git clone https://github.com/winston-lab/demultiplex-single-end.git
1. Create and activate the demultiplex_single_end
virtual environment for the pipeline using conda. The virtual environment creation can take a while.
# navigate into the pipeline directory
cd demultiplex-single-end
# create the demultiplex_paired_end environment
conda env create -v -f envs/demultiplex_single_end.yaml
# activate the environment
source activate demultiplex_single_end
# to deactivate the environment
# source deactivate
2. Make a copy of the configuration file template config_template.yaml
called config.yaml
, and edit config.yaml
to suit your needs.
# make a copy of the configuration template file
cp config_template.yaml config.yaml
# edit the configuration file
vim config.yaml # or use your favorite editor
3. With the demultiplex_single_end
environment activated, do a dry run of the pipeline to see what files will be created.
snakemake -p --dryrun
4. If running the pipeline on a local machine, you can run the pipeline using the above command, omitting the --dryrun
flag. You can also use N cores by specifying the --cores N
flag. The first time the pipeline is run, conda will create separate virtual environments for some of the jobs to operate in. Running the pipeline on a local machine can take a long time, especially for many samples, so it's recommended to use an HPC cluster if possible. On the HMS O2 cluster, which uses the SLURM job scheduler, entering sbatch slurm_submit.sh
will submit the pipeline as a single job which spawns individual subjobs as necessary. This can be adapted to other job schedulers and clusters by adapting slurm_submit.sh
, which submits the pipeline to the cluster, slurm_status.sh
, which handles detection of job status on the cluster, and cluster.yaml
, which specifies the resource requests for each type of job.