Fastq Utilities

Overview

The Fastq Utilities service provides capabilities for analyzing and processing raw sequencing data in FASTQ format. This service helps researchers assess read quality, trim adapters and low-quality sequences, align reads to reference genomes, ensure proper pairing of paired-end reads, and remove host contamination.

Key features:

Quality Control: Generate FastQC reports to visualize base call quality metrics, identify potential sequencing biases, and assess the overall quality of sequencing data.
Trimming: Remove adapter sequences and low-quality bases from reads using Trim Galore, improving downstream analysis accuracy.
Alignment: Align reads to a reference genome using Bowtie2, producing BAM files and generating SamStat reports to evaluate alignment quality and coverage.
Paired-End Read Synchronization: Employ Fastq-Pair to ensure proper synchronization of paired-end reads, addressing issues with unordered sequences or missing mates that can hinder downstream analyses like genome assembly.
Host Read Removal: Utilize Hostile to eliminate reads originating from the host organism, enhancing the analysis of microbial or other target sequences in complex samples.

Service Inputs

The Fastq Utilities service accepts a variety of inputs to accommodate different data sources and analysis needs:

1. Input Data:

Paired-end reads: Two separate FASTQ files representing each end of the DNA fragment. Files can be gzipped (.gz) or uncompressed.
Single-end reads: A single FASTQ file containing reads. Files can be gzipped (.gz) or uncompressed.
Sequence Read Archive (SRA) accession numbers: Directly submit data from the NCBI SRA database using their accession numbers (e.g., SRR1234567).

2. Pipeline Options:

Specify the processing steps to be performed. These options can be selected independently or in any desired combination:

Trim: Perform adapter and quality trimming using Trim Galore.
FastQC: Generate a FastQC report to assess read quality.
Align: Align reads to a reference genome using Bowtie2.
Paired Filter: Synchronize paired-end reads using Fastq-Pair.
Hostile: Remove human reads from the data. This option utilizes either Bowtie2 (for short reads) or Minimap2 (for long reads) for alignment and supports paired or unpaired FASTQ files, both compressed and uncompressed.

3. Parameters:

Output Folder: Specify the workspace folder where the results will be stored.
Output Name: Provide a unique name to identify the analysis results.
Target Genome (for Alignment): Select the reference genome for read alignment.

Service Outputs

The service generates various output files depending on the selected pipeline options:

Trim:

xxx.fastq_trimming_report.txt: A detailed report summarizing trimming parameters, detected adapter sequences, processed reads, and trimming statistics for each input file.
xxx_val_1.fq.gz / xxx_val_2.fq.gz: Trimmed FASTQ files (for paired-end reads) or xxx_trimmed.fq.gz (for single-end reads).

FastQC:

xxx_fastqc.html: An HTML report containing comprehensive quality control metrics and visualizations generated by FastQC for each read file.

Align:

xxx.bam: A compressed binary file (BAM format) containing aligned reads.
xxx.bam.bai: An index file for the BAM file, enabling efficient data access.
xxx.bam.samstat.html: An HTML report summarizing alignment statistics, including mapping quality (MAPQ) distributions and error profiles, generated by SamStat.
xxx.unmapped#.fq.gz: FASTQ files containing unmapped reads, where # represents the read pair (1 or 2).

Paired Filter:

xxx_1.paired.fq.gz / xxx_2.paired.fq.gz: Synchronized paired-end FASTQ files, ensuring each read has a corresponding mate in the paired file.
xxx_meta.txt: A report file providing details about the paired filtering process.

Hostile:

nonhuman_*.fastq.gz: Output files containing reads identified as non-human.

Common Output Files:

meta.txt: A general metadata file containing information about the job submission and parameters.
fqutils.err.txt: A log file capturing any errors or warnings encountered during the pipeline execution.
fqutils.out.txt: A log file containing the executed commands for the pipeline.

Scripts and Utilities

This module contains the following scripts that power the Fastq Utilities Service:

Script Name	Purpose
App-FastqUtils.pl	The primary application script for the Fastq Utilities Service. It handles user input, job submission, and result management for the service.
FQUtils.pm	A Perl module containing functions and subroutines specific to the Fastq Utilities Service. It handles tasks such as reading configuration files, parsing input parameters, generating command-line arguments for external tools, and processing output files.
p3-fqutils.py	Python interface for invoking the tool via command line
fastq_uils.py	Python library for forming and running commands given inputs

References

Andrews, S. FastQC: a quality control tool for high throughput sequence data. 2010.
Edwards, J.A. and Edwards, R.A. Fastq-pair: efficient synchronization of paired-end fastq files. bioRxiv, 2019: p. 552885.
Fitzgerald, M., et al. "Hostile: Precise host read removal from metagenomes." bioRxiv (2023). https://www.biorxiv.org/content/10.1101/2023.07.04.547735v1
https://github.com/bede/hostile
Krueger, F. Trim Galore: a wrapper tool around Cutadapt and FastQC to consistently apply quality and adapter trimming to FastQ files, with some extra functionality for MspI-digested RRBS-type (Reduced Representation Bisulfite-Seq) libraries. URL http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/. (Date of access: 28/04/2016), 2012.
Langmead, B. and Salzberg, S.L. Fast gapped-read alignment with Bowtie 2. Nature methods, 2012. 9(4): p. 357.
Lassmann, T., Hayashizaki, Y. and Daub, C.O. SAMStat: monitoring biases in next generation sequencing data. Bioinformatics, 2010. 27(1): p. 130-131.
Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal, 2011. 17(1): p. 10-12.