/trimming

Primary LanguageNextflowMIT LicenseMIT

Quality Control and Trimming Pipeline

A simple pipeline for performing quality control (QC) and adapter trimming. This pipeline may be used for both single-end or paired-end reads. Here is a brief summary of what is being executed:

  1. Raw reads quality control (FastQC)
  2. Adapter trimming (Trim Galore!)
  3. Trimmed reads quality control (FastQC)

This pipeline is currently dependent on Nextflow and Anaconda. You don't need to have Trim Galore! or FastQC installed because a separate conda environment is created for all dependencies.

Installing Nextflow

Note: If you already have Nextflow installed, then you may skip to the next section.

The workflow manager Nextflow is required to run this pipeline. It requires Bash 3.2 (or later) and Java 8 (or later, up to 15) to be installed. To download and install Nextflow into the current directory, run the following in your terminal.

$ curl -s https://get.nextflow.io | bash

Then, make the nextflow binary executable by running chmod +x:

$ chmod +x nextflow

Installing Anaconda

Note: If you already have Anaconda installed, then you may skip to the next section.

Download Anaconda from here (I recommend the command-line installer) and run the installer.

Downloading this Pipeline

To download this pipeline, go to the folder where you wish to store this pipeline in and then run the following:

$ git clone https://github.com/Animal-Evolution-and-Biodiversity/trimming

Formatting the Input Data

The input data may be single-end or paired-end reads. Paired-end reads is the default. These reads should be in the FASTQ format, either compressed or uncompressed. The resulting reads are always in a compressed format (.gzip).

If your reads are single-ended, use the --single-end option. If your reads are paired-ended, then you don't need to supply any additional options.

Running this Pipeline

Go to the folder where you downloaded this pipeline and cd into it's folder. For example:

$ cd ~/Downloads
$ cd trimming
$ ls
bin           environment.yml  main.nf   nextflow.config  test_data
CHANGELOG.md  LICENSE          Makefile  README.md

Running this pipeline with the --help option both tests whether it was installed correctly and it also gives you all of the options that you might want to tweak.

$ nextflow run . --help

Example Run

Here's an example where I'm running the pipeline on a dataset that is approximately 3 GB in size. This took 35 minutes and 34 seconds as seen in the report.

$ nextflow run ~/ownCloud/pipelines/trimming --reads '../Perenereis_calmani/raw_reads/*{1,2}*' --output Alitta_virens_trimming_output
N E X T F L O W  ~  version 20.10.0
Launching `/home/feli/ownCloud/pipelines/trimming/main.nf` [insane_darwin] - revision: 618c37f28a
executor >  local (2)
[ef/95c6dd] process > rawReadsQuality (1) [100%] 1 of 1 ✔
[c5/e9d75d] process > trimming (1)        [100%] 1 of 1 ✔
Completed at: 2021-05-18T13:20:21.791129+02:00
Duration    : 35m 34s
Success     : true
Exit status : 0
Completed at: 18-May-2021 13:20:23
Duration    : 35m 36s
CPU hours   : 0.7
Succeeded   : 2

The output is stored in trimming_output (default) or in the directory provided with the --output option. Here's an example output. Note that the FastQC output for the trimmed reads are stored together with the trimmed reads themselves.

trimming_output/
├── quality_control_pre-trimming
│   ├── fastqc_command.txt
│   ├── NG-26745_B4_lib454432_7213_2_1_pre-trimming_fastqc.html
│   ├── NG-26745_B4_lib454432_7213_2_1_pre-trimming_fastqc.zip
│   ├── NG-26745_B4_lib454432_7213_2_2_pre-trimming_fastqc.html
│   └── NG-26745_B4_lib454432_7213_2_2_pre-trimming_fastqc.zip
└── trimmed_reads
    ├── NG-26745_B4_lib454432_7213_2_1.fastq.gz_trimming_report.txt
    ├── NG-26745_B4_lib454432_7213_2_1_unpaired_1.fq.gz
    ├── NG-26745_B4_lib454432_7213_2_1_val_1_fastqc.html
    ├── NG-26745_B4_lib454432_7213_2_1_val_1_fastqc.zip
    ├── NG-26745_B4_lib454432_7213_2_1_val_1.fq.gz
    ├── NG-26745_B4_lib454432_7213_2_2.fastq.gz_trimming_report.txt
    ├── NG-26745_B4_lib454432_7213_2_2_unpaired_2.fq.gz
    ├── NG-26745_B4_lib454432_7213_2_2_val_2_fastqc.html
    ├── NG-26745_B4_lib454432_7213_2_2_val_2_fastqc.zip
    ├── NG-26745_B4_lib454432_7213_2_2_val_2.fq.gz
    └── trim_galore_command.txt

The files fastqc_command.txt and trim_galore_command.txt contain all of the parameters used for each program. This type of information should be useful when tracing your steps in your publication.

$ cat trimming_output/trimmed_reads/trim_galore_command.txt
trim_galore --fastqc --gzip --quality 20 --length 55 --cores 1 --paired --retain_unpaired NG-26745_B4_lib454432_7213_2_1.fastq.gz NG-26745_B4_lib454432_7213_2_2.fastq.gz

Cleaning Up

Nextflow produces a lot of additional files which takes up space on your drive. These files are useful when running and troubleshooting the pipeline but may then safely be removed in our case, since everything you need is saved to the output directory. Run these commands in the folder where you ran the pipeline in order to remove unnecessary files:

$ rm -f .nextflow.log*
$ rm -rf .nextflow*
$ rm -rf work

© Animal Evolution and Biodiversity 2021