/BactSeq

A nextflow pipeline for performing bacterial RNA-Seq data analysis.

Primary LanguageRMIT LicenseMIT

BactSeq

Nextflow run with conda run with docker run with singularity

Introduction

BactSeq is a Nextflow pipeline for performing bacterial RNA-Seq analysis.

Pipeline summary

The pipeline will perform the following steps:

  1. Trim adaptors from reads (Trim Galore!)
  2. Read QC (FastQC)
  3. Align reads to reference genome (BWA-MEM)
  4. Size-factor scaling and gene length (RPKM) scaling of counts (TMM from edgeR)
  5. Principal component analysis (PCA) of normalised expression values
  6. Differential gene expression (DESeq2) (optional)
  7. Functional enrichment of differentially expressed genes (topGO) (optional)

Installation

You will need to install Nextflow (version 21.10.3+).

Usage:
nextflow run BactSeq --data_dir [dir] --sample_file [file] --ref_genome [file] --ref_ann [file] -profile docker [other_options]

Mandatory arguments:
  --data_dir [file]               Path to directory containing FastQ files.
  --ref_genome [file]             Path to FASTA file containing reference genome sequence (bwa) or multi-FASTA file containing coding gene sequences (kallisto).
  --ref_ann [file]                Path to GFF file containing reference genome annotation.
  --sample_file [file]            Path to file containing sample information.
  -profile [str]                  Configuration profile to use.
                                  Available: conda, docker, singularity.

Other options:
  --aligner [str]                 (Pseudo-)aligner to be used. Options: `bwa`, `kallisto`. Default = bwa.
  --cont_tabl [file]              Path to tsv file containing contrasts to be performed for differential expression.
  --fragment_len [str]            Estimated average fragment length for kallisto transcript quantification (only required for single-end reads). Default = 150.
  --fragment_sd [str]             Estimated standard deviation of fragment length for kallisto transcript quantification (only required for single-end reads). Default = 20.
  --func_file [file]              Path to GMT-format file containing functional annotation.
  --l2fc_thresh [str]             Absolute log2(FoldChange) threshold for identifying differentially expressed genes. Default = 1.
  --outdir [file]                 The output directory where the results will be saved (Default: './results').
  --paired [str]                  Is data paired-end? Default = FALSE.
  --p_thresh [str]                Adjusted p-value threshold for identifying differentially expressed genes. Default = 0.05.
  --skip_trimming [bool]          Do not trim adaptors from FastQ files.
  --strandedness [str]            Is data stranded? Options: `unstranded`, `forward`, `reverse`. Default = reverse.
  -name [str]                     Name for the pipeline run. If not specified, Nextflow will automatically generate a random mnemonic.

Explanation of parameters:

  • ref_genome: genome sequence for mapping reads.
  • ref_ann: annotation of genes/features in the reference genome.
  • sample_file: TSV file containing sample information (see below)
  • data_dir: path to directory containing FASTQ files.
  • paired: data are paired-end (default is to assume single-end)
  • strandedness: is data stranded? Options: unstranded, forward, reverse. Default = reverse.
  • cont_tabl: (optional) table of contrasts to be performed for differential expression.
  • func_file: (optional) functional annotation file - if provided, functional enrichment of DE genes will be performed.
  • p_thresh: adjusted p-value threshold for identifying differentially expressed genes. Default = 0.05.
  • l2fc_thresh: absolute log2(FoldChange) threshold for identifying differentially expressed genes. Default = 1.
  • skip_trimming: do not trim adaptors from reads.
  • outdir: the output directory where the results will be saved (Default: ./results).
  • -resume: will re-start the pipeline if it has been previously run.

Required input

  • Genome sequence: FASTA file containing the genome sequence. Can be retrieved from NCBI.

  • Gene annotation file: GFF file containing the genome annotation. Can be retrieved from NCBI.

  • Sample file: TSV file containing sample information. Must contain the following columns:

    • sample: sample ID
    • file_name: name of the FASTQ file.
    • group: grouping factor for differential expression and exploratory plots.
    • rep_no: repeat number (if more than one sample per group).
    • paired: data are paired-end? (0 = single-end, 1 = paired-end).

    Example:

    If data are single-end, leave the file2 column blank.

    sample	file1   file2	group	rep_no  paired
    AS_1	SRX1607051_T1.fastq.gz	    Artificial_Sputum	1   1
    AS_2	SRX1607052_T1.fastq.gz	    Artificial_Sputum	2   1
    AS_3	SRX1607053_T1.fastq.gz	    Artificial_Sputum	3   1
    MB_1	SRX1607054_T1.fastq.gz	    Middlebrook	1   1
    MB_2	SRX1607055_T1.fastq.gz	    Middlebrook	2   1
    MB_3	SRX1607056_T1.fastq.gz	    Middlebrook	3   1
    ER_1	SRX1607060_T1.fastq.gz	    Erythromycin	1   1
    ER_2	SRX1607061_T1.fastq.gz	    Erythromycin	2   1
    ER_3	SRX1607062_T1.fastq.gz	    Erythromycin	3   1
    KN_1	SRX1607066_T1.fastq.gz	    Kanamycin	1   1
    KN_2	SRX1607067_T1.fastq.gz	    Kanamycin	2   1
    KN_3	SRX1607068_T1.fastq.gz	    Kanamycin	3   1

Output

  1. trim_galore directory containing adaptor-trimmed RNA-Seq files and FastQC results.
  2. read_counts directory containing:
    1. ref_gene_df.tsv: table of genes in the annotation.
    2. gene_counts.tsv: raw read counts per gene.
    3. cpm_counts.tsv: size factor scaled counts per million (CPM).
    4. rpkm_counts.tsv: size factor scaled and gene length-scaled counts, expressed as reads per kilobase per million mapped reads (RPKM).
  3. PCA_samples directory containing principal component analysis results.
  4. diff_expr directory containing differential expression results.
  5. func_enrich directory containing functional enrichment results (optional).