MetaGen-Fetch - Processing public metagenomes

This is a Snakemake workflow for downloading and quality-controlling metagenomic read files (FASTQ) from ENA. It uses fastq-dl to first download a set of paired-end reads from ENA, and subsequently QCs the data using the metagen-fastqc script.

Installation

Install conda and snakemake (tested v7.32.3)
Prepare and index a FASTA file of your host genome for decontamination. You can follow the instructions provided in the metagen-fastqc script here.
Clone this repository

git clone https://github.com/alexmsalmeida/metagen-fetch.git

How to run

Edit the configuration file config/config.yml.
- input_file: TSV file (no header) with run accessions listed as the first column and corresponding study accessions as the second column.
- output_dir: Output directory to store cleaned files.
- host_ref: Location of the indexed FASTA file for host decontamination.
Run the pipeline on a cluster (e.g., SLURM)

snakemake --use-conda -k -j 25 --profile config/slurm

Cleaned files will be stored in the specified output directory followed by [study]/[run].