A simple pipeline for performing quality control (QC) and adapter trimming. This pipeline may be used for both single-end or paired-end reads. Here is a brief summary of what is being executed:
- Raw reads quality control (FastQC)
- Adapter trimming (Trim Galore!)
- Trimmed reads quality control (FastQC)
This pipeline is currently dependent on Nextflow and Anaconda. You don't need to have Trim Galore! or FastQC installed because a separate conda environment is created for all dependencies.
Note: If you already have Nextflow installed, then you may skip to the next section.
The workflow manager Nextflow is required to run this pipeline. It requires Bash 3.2 (or later) and Java 8 (or later, up to 15) to be installed. To download and install Nextflow into the current directory, run the following in your terminal.
$ curl -s https://get.nextflow.io | bash
Then, make the nextflow
binary executable by running chmod +x
:
$ chmod +x nextflow
Note: If you already have Anaconda installed, then you may skip to the next section.
Download Anaconda from here (I recommend the command-line installer) and run the installer.
To download this pipeline, go to the folder where you wish to store this pipeline in and then run the following:
$ git clone https://github.com/Animal-Evolution-and-Biodiversity/trimming
The input data may be single-end or paired-end reads. Paired-end reads is the
default. These reads should be in the FASTQ format, either compressed or
uncompressed. The resulting reads are always in a compressed format (.gzip
).
If your reads are single-ended, use the --single-end
option. If your reads
are paired-ended, then you don't need to supply any additional options.
Go to the folder where you downloaded this pipeline and cd
into it's folder.
For example:
$ cd ~/Downloads
$ cd trimming
$ ls
bin environment.yml main.nf nextflow.config test_data
CHANGELOG.md LICENSE Makefile README.md
Running this pipeline with the --help
option both tests whether it was
installed correctly and it also gives you all of the options that you might
want to tweak.
$ nextflow run . --help
Here's an example where I'm running the pipeline on a dataset that is approximately 3 GB in size. This took 35 minutes and 34 seconds as seen in the report.
$ nextflow run ~/ownCloud/pipelines/trimming --reads '../Perenereis_calmani/raw_reads/*{1,2}*' --output Alitta_virens_trimming_output
N E X T F L O W ~ version 20.10.0
Launching `/home/feli/ownCloud/pipelines/trimming/main.nf` [insane_darwin] - revision: 618c37f28a
executor > local (2)
[ef/95c6dd] process > rawReadsQuality (1) [100%] 1 of 1 ✔
[c5/e9d75d] process > trimming (1) [100%] 1 of 1 ✔
Completed at: 2021-05-18T13:20:21.791129+02:00
Duration : 35m 34s
Success : true
Exit status : 0
Completed at: 18-May-2021 13:20:23
Duration : 35m 36s
CPU hours : 0.7
Succeeded : 2
The output is stored in trimming_output
(default) or in the directory
provided with the --output
option. Here's an example output. Note that the
FastQC output for the trimmed reads are stored together with the trimmed reads
themselves.
trimming_output/
├── quality_control_pre-trimming
│ ├── fastqc_command.txt
│ ├── NG-26745_B4_lib454432_7213_2_1_pre-trimming_fastqc.html
│ ├── NG-26745_B4_lib454432_7213_2_1_pre-trimming_fastqc.zip
│ ├── NG-26745_B4_lib454432_7213_2_2_pre-trimming_fastqc.html
│ └── NG-26745_B4_lib454432_7213_2_2_pre-trimming_fastqc.zip
└── trimmed_reads
├── NG-26745_B4_lib454432_7213_2_1.fastq.gz_trimming_report.txt
├── NG-26745_B4_lib454432_7213_2_1_unpaired_1.fq.gz
├── NG-26745_B4_lib454432_7213_2_1_val_1_fastqc.html
├── NG-26745_B4_lib454432_7213_2_1_val_1_fastqc.zip
├── NG-26745_B4_lib454432_7213_2_1_val_1.fq.gz
├── NG-26745_B4_lib454432_7213_2_2.fastq.gz_trimming_report.txt
├── NG-26745_B4_lib454432_7213_2_2_unpaired_2.fq.gz
├── NG-26745_B4_lib454432_7213_2_2_val_2_fastqc.html
├── NG-26745_B4_lib454432_7213_2_2_val_2_fastqc.zip
├── NG-26745_B4_lib454432_7213_2_2_val_2.fq.gz
└── trim_galore_command.txt
The files fastqc_command.txt
and trim_galore_command.txt
contain all of the
parameters used for each program. This type of information should be useful
when tracing your steps in your publication.
$ cat trimming_output/trimmed_reads/trim_galore_command.txt
trim_galore --fastqc --gzip --quality 20 --length 55 --cores 1 --paired --retain_unpaired NG-26745_B4_lib454432_7213_2_1.fastq.gz NG-26745_B4_lib454432_7213_2_2.fastq.gz
Nextflow produces a lot of additional files which takes up space on your drive. These files are useful when running and troubleshooting the pipeline but may then safely be removed in our case, since everything you need is saved to the output directory. Run these commands in the folder where you ran the pipeline in order to remove unnecessary files:
$ rm -f .nextflow.log*
$ rm -rf .nextflow*
$ rm -rf work
© Animal Evolution and Biodiversity 2021