/mleprae-genomics_workflow

Snakemake workflow for processing Mycobacterium leprae whole genome sequencing data

Primary LanguagePython

πŸ’» Workflow for bacterial genomics

About

This workflow takes in unprocessed fastq.gz files from short-read sequencers then pre-process, map, and call variants against the M. leprae genome TN reference (pre-configured). The original workflow was built upon python2 code from Chloe Loiseau and adapted with python3 code and Snakemake workflow manager by me. The workflow is delivered through github + conda environment and run locally with the installed dependencies.

The workflow was designed to be user-friendly to those without coding skills. However, a basic understanding is still necessary to run things without problems and to fix any issues that may happen.

If you have any questions, please write-me up.

Pre-requisites

  • A computer with any linux distribution or MacOS (not tested in Windows, but should work under a virtual machine or WSL);

  • Snakemake,

  • conda and mamba;

  • MEGA 11 and megacc for phylogeny;

  • FASTQ.gz files sample-demultiplexed, separated by lanes or not, single-ended (SE) or paired-end (PE).

  • ❗ Please, make sure that your fastq files follow the Illumina official naming scheme;

  • Internet connection for downloading software (at least for the first time).

    Instructions

    Setting up

  1. Download this repository by using git clone github.com/thyagoleal/mleprae-genomics_workflow/ or through the Code -> Download ZIP green button.
  2. Unzip the contents and navigate inside the directory;

image-20220715165418518

  1. Install miniconda following the instructions in the software page;

  2. Install Mamba with: conda install mamba -n base -c conda-forge;

  3. Launch a terminal making sure you're inside the mleprae_genomics_workflow directory. Then, create the environment with the specified dependencies using mamba:

    conda activate base
    mamba env create --file docs/dependencies.yaml
  4. If everything installed correctly run conda activate mleprae-genomics;

  5. Install MEGA 11 according to your operating system. megacc must be available in your command line. If not, try installing MEGA 11 for command line CC specifically as well;

  6. Now you can use the workflow and its installed tools. It is likely that all of these steps have been done for you already;

    ❗ Every time you wish to run the pipeline or one of its tools, you must have the mleprae-genomics environment active. To activate it, just run conda activate mleprae-genomics.

    Configuring and preparing to run the workflow

Now that you have a conda environment with the installed tools, you can configure your samples.

  1. Inside of the data directory, copy your fastqs to samples/ ;

  2. Make sure you are in the root workflow directory mleprae-genomics_workflow.

  3. Important Make sure your files are named according to Illumina scheme, with underscores _ between sample name, sample id, lanes, and so on. Use dash - in the sample name if you need a delimiter. For example:

    Correct Incorrect
    SampleName**S1L001R1**001.fastq.gz SampleName**-S1-L001-R1-**001.fastq.gz
    Sample1_L001_R1.fastq.gz Sample1.L001.R1.fastq.gz
    Sample5-bait-Capture_S5_L005_R2_001.fastq.gz Sample5_bait-Capture_S5-L005-R2-001.fastq.gz

    🚫 If your files violate these rules, you should rename them manually or they'll cause issues in the next steps. Please see the rename Perl tool included in most Debian linux distros.

  4. Then, to create the sample files and feed the workflow, execute the code below:

    python3 scripts/prepare_input.py /path/to/samples/

    This script will check the names of your fastq files, merge lanes (default), fix suffixes (i.e., standardize fastq.gz instead of fq.gz etc), compress, and check which samples are paired-end or not. Then, it will build the samples.tsv and units.tsv files necessary to start the workflow with soft links inside data/samples pointing to the original files.

    The original files will remain untouched in the data/samples dir. The workflow-ready files will be in data/samples/prepared-fastqs/ by default (you can change it, run python3 prepare_input.py --help).

    Running the workflow

    🚫 Before running the workflow, make sure you configured the options by editing the file config/config.yaml. All parameters should be changed only in this file. If you need to pass a flag or change a parameter not in the config file, you can add them in the "extra" section of the config file.

    The directory tree of the workflow and its files should be like this:
    .
    β”œβ”€β”€ config/
    β”‚   β”œβ”€β”€ config.yaml
    β”‚   └── README.md
    β”œβ”€β”€ data/
    β”‚   β”œβ”€β”€ refs/
    β”‚   └── samples/
    β”œβ”€β”€ docs/
    β”‚   β”œβ”€β”€ dependencies.yaml
    β”‚   β”œβ”€β”€ Dockerfile
    β”‚   β”œβ”€β”€ enviroment.yaml
    β”‚   β”œβ”€β”€ illumina.txt
    β”‚   └── requirements.yml
    β”œβ”€β”€ envs/
    β”‚   β”œβ”€β”€ basic.yml
    β”‚   β”œβ”€β”€ bowtie.yml
    β”‚   β”œβ”€β”€ mapping_tools.yml
    β”‚   β”œβ”€β”€ picard.yml
    β”‚   β”œβ”€β”€ qc_trimming.yml
    β”‚   β”œβ”€β”€ qualimap.yml
    β”‚   β”œβ”€β”€ samtools.yml
    β”‚   β”œβ”€β”€ seqprep.yml
    β”‚   β”œβ”€β”€ snpeff.yml
    β”‚   β”œβ”€β”€ varscan.yml
    β”‚   └── vcf2bed.yml
    β”œβ”€β”€ README.md
    β”œβ”€β”€ rules/
    β”‚   β”œβ”€β”€ fastqc.smk
    β”‚   β”œβ”€β”€ helpers.smk
    β”‚   β”œβ”€β”€ mapping.smk
    β”‚   β”œβ”€β”€ markduplicates.smk
    β”‚   β”œβ”€β”€ merge_overlapping.smk
    β”‚   β”œβ”€β”€ multiqc.smk
    β”‚   β”œβ”€β”€ prepare_phylogeny.smk
    β”‚   β”œβ”€β”€ qualimap.smk
    β”‚   β”œβ”€β”€ trimming.smk
    β”‚   β”œβ”€β”€ variant_calling.smk
    β”‚   β”œβ”€β”€ vcf2table.smk
    β”‚   └── workflow_recap.smk
    β”œβ”€β”€ scripts/
    β”‚   β”œβ”€β”€ collect_metrics.py
    β”‚   β”œβ”€β”€ collect_seqPrep-metrics.py
    β”‚   β”œβ”€β”€ prepare4phylogeny.py
    β”‚   β”œβ”€β”€ prepare-input.py
    β”‚   └── SNP-table_to_SNPeff-table_v2.sh
    └── Snakefile

    Running

    With the mleprae-genomics environment activated, you can:

    • Run a dry-run, that is, just show what would be done without running;
    snakemake --cores 5 -pn 
    • Run the workflow quietly only showing progress;
    snakemake --cores 5 --quiet progress
    • Run the workflow in verbose mode (--verbose) and printing the shell commands (-p);
    snakemake --cores 5 -p --verbose

    πŸ“ŒTo see all the snakemake options and flags, just run snakemake -h

    βœ”οΈ ​ After successfully running the workflow, two directories will be created, namely results/ and logs/.