BactSeq is a Nextflow pipeline for performing bacterial RNA-Seq analysis.
The pipeline will perform the following steps:
- Trim adaptors from reads (
Trim Galore!
) - Read QC (
FastQC
) - Align reads to reference genome (
BWA-MEM
) - Size-factor scaling and gene length (RPKM) scaling of counts (TMM from
edgeR
) - Principal component analysis (PCA) of normalised expression values
- Differential gene expression (
DESeq2
) (optional) - Functional enrichment of differentially expressed genes (
topGO
) (optional)
You will need to install Nextflow
(version 21.10.3+).
Usage:
nextflow run BactSeq --data_dir [dir] --sample_file [file] --ref_genome [file] --ref_ann [file] -profile docker [other_options]
Mandatory arguments:
--data_dir [file] Path to directory containing FastQ files.
--ref_genome [file] Path to FASTA file containing reference genome sequence (bwa) or multi-FASTA file containing coding gene sequences (kallisto).
--ref_ann [file] Path to GFF file containing reference genome annotation.
--sample_file [file] Path to file containing sample information.
-profile [str] Configuration profile to use.
Available: conda, docker, singularity.
Other options:
--aligner [str] (Pseudo-)aligner to be used. Options: `bwa`, `kallisto`. Default = bwa.
--cont_tabl [file] Path to tsv file containing contrasts to be performed for differential expression.
--fragment_len [str] Estimated average fragment length for kallisto transcript quantification (only required for single-end reads). Default = 150.
--fragment_sd [str] Estimated standard deviation of fragment length for kallisto transcript quantification (only required for single-end reads). Default = 20.
--func_file [file] Path to GMT-format file containing functional annotation.
--l2fc_thresh [str] Absolute log2(FoldChange) threshold for identifying differentially expressed genes. Default = 1.
--outdir [file] The output directory where the results will be saved (Default: './results').
--paired [str] Is data paired-end? Default = FALSE.
--p_thresh [str] Adjusted p-value threshold for identifying differentially expressed genes. Default = 0.05.
--skip_trimming [bool] Do not trim adaptors from FastQ files.
--strandedness [str] Is data stranded? Options: `unstranded`, `forward`, `reverse`. Default = reverse.
-name [str] Name for the pipeline run. If not specified, Nextflow will automatically generate a random mnemonic.
Explanation of parameters:
ref_genome
: genome sequence for mapping reads.ref_ann
: annotation of genes/features in the reference genome.sample_file
: TSV file containing sample information (see below)data_dir
: path to directory containing FASTQ files.paired
: data are paired-end (default is to assume single-end)strandedness
: is data stranded? Options:unstranded
,forward
,reverse
. Default =reverse
.cont_tabl
: (optional) table of contrasts to be performed for differential expression.func_file
: (optional) functional annotation file - if provided, functional enrichment of DE genes will be performed.p_thresh
: adjusted p-value threshold for identifying differentially expressed genes. Default = 0.05.l2fc_thresh
: absolute log2(FoldChange) threshold for identifying differentially expressed genes. Default = 1.skip_trimming
: do not trim adaptors from reads.outdir
: the output directory where the results will be saved (Default:./results
).-resume
: will re-start the pipeline if it has been previously run.
-
Genome sequence: FASTA file containing the genome sequence. Can be retrieved from NCBI.
-
Gene annotation file: GFF file containing the genome annotation. Can be retrieved from NCBI.
-
Sample file: TSV file containing sample information. Must contain the following columns:
sample
: sample IDfile_name
: name of the FASTQ file.group
: grouping factor for differential expression and exploratory plots.rep_no
: repeat number (if more than one sample per group).paired
: data are paired-end? (0 = single-end, 1 = paired-end).
Example:
If data are single-end, leave the
file2
column blank.sample file1 file2 group rep_no paired AS_1 SRX1607051_T1.fastq.gz Artificial_Sputum 1 1 AS_2 SRX1607052_T1.fastq.gz Artificial_Sputum 2 1 AS_3 SRX1607053_T1.fastq.gz Artificial_Sputum 3 1 MB_1 SRX1607054_T1.fastq.gz Middlebrook 1 1 MB_2 SRX1607055_T1.fastq.gz Middlebrook 2 1 MB_3 SRX1607056_T1.fastq.gz Middlebrook 3 1 ER_1 SRX1607060_T1.fastq.gz Erythromycin 1 1 ER_2 SRX1607061_T1.fastq.gz Erythromycin 2 1 ER_3 SRX1607062_T1.fastq.gz Erythromycin 3 1 KN_1 SRX1607066_T1.fastq.gz Kanamycin 1 1 KN_2 SRX1607067_T1.fastq.gz Kanamycin 2 1 KN_3 SRX1607068_T1.fastq.gz Kanamycin 3 1
- trim_galore directory containing adaptor-trimmed RNA-Seq files and FastQC results.
- read_counts directory containing:
ref_gene_df.tsv
: table of genes in the annotation.gene_counts.tsv
: raw read counts per gene.cpm_counts.tsv
: size factor scaled counts per million (CPM).rpkm_counts.tsv
: size factor scaled and gene length-scaled counts, expressed as reads per kilobase per million mapped reads (RPKM).
- PCA_samples directory containing principal component analysis results.
- diff_expr directory containing differential expression results.
- func_enrich directory containing functional enrichment results (optional).