This is a Snakemake workflow for downloading and quality-controlling metagenomic read files (FASTQ) from ENA. It uses fastq-dl to first download a set of paired-end reads from ENA, and subsequently QCs the data using the metagen-fastqc script.
-
Prepare and index a FASTA file of your host genome for decontamination. You can follow the instructions provided in the metagen-fastqc script here.
-
Clone this repository
git clone https://github.com/alexmsalmeida/metagen-fetch.git
-
Edit the configuration file
config/config.yml
.input_file
: TSV file (no header) with run accessions listed as the first column and corresponding study accessions as the second column.output_dir
: Output directory to store cleaned files.host_ref
: Location of the indexed FASTA file for host decontamination.
-
Run the pipeline on a cluster (e.g., SLURM)
snakemake --use-conda -k -j 25 --profile config/slurm
- Cleaned files will be stored in the specified output directory followed by
[study]/[run]
.