/mapache

mapping pipeline for ancient DNA

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

Mapache

Mapache ([maˈpa.t͡ʃe]) is a lightweight mapping pipeline for ancient DNA using the workflow manager Snakemake.

Visit the Wiki for extensive documentation on how to use mapache and follow the tutorial to map and impute the test dataset.

If you already have some experience with DNA mapping and/or Snakemake, you can follow the quick guide below.

The usage of this workflow is described in the Snakemake Workflow Catalog.

The following main steps are included in mapache:

for each fastq file:
    - subsampling of a small subset of reads (options, not run by default; useful for quick tests)
    - adapter removal with AdapterRemoval (optional; run by default)
    - mapping with bwa aln (default), bwa mem or bowtie2
for each library:
    - marking duplicates with MarkDuplicates (optional; duplicates are removed by default)
    - rescaling of base qualitiy scores with mapDamage (optional, not run by default)
for each sample:
    - re-alignment of reads with GATK (optional; run by default)
    - re-computation of the md flag with samtools (optional; run by default)
    - imputation of low-coverage genomes with GLIMPSE (optional, not run by default)

Moreover, mapache allows you to map large datasets to a single reference genome or to multiple genomes.

Mapache is designed having in mind that one or multiple DNA libraries are generated per sample, and such libraries are sequenced at least once (e.g., for screening), but usually multiple times (normally prioritizing the highest-quality libraries). This results in several FASTQ files, which have to be mapped several times over the course of a project in order to generate a single BAM file.

The goal of mapache is to make the mapping process as easy and transparent as possible, whether you are mapping to a single or multiple genomes, mapping for the first time, needing to update a BAM file or even impute low-coverage genomes.

Quick guide

To install mapache, you need to clone the repository and create a conda environment following the instructions below. This will automatically install the software needed to map FASTQ files to a reference genome and create a report with the mapping statistics.

To execute mapache, you can move to the cloned directory (mapache) or symlink its content (the directories config/, results/, workflow/, and test_data/ if you want to run the test) to your working directory.

You will mostly interact with mapache through its configuration file (config/config.yaml), where you can tweak the parameteres of the pipeline, and the samples file (config/samples.tsv), in which you can list all the input FASTQ files.

Installation

#--------------------------------------------------------------#
## create mamba environment
conda create -n base-mamba -c conda-forge mamba

## activate mamba environment
conda activate base-mamba

#--------------------------------------------------------------#
## clone mapache repository
git clone https://github.com/sneuensc/mapache.git
cd mapache

## create conda environment for mapache
mamba env create -n mapache --file config/mapache-env.yaml

#--------------------------------------------------------------#
## now you can activate the mapache environment
conda activate mapache

Prepare samples file

A samples file for the test dataset is provided (config/samples.tsv). If you want to run it on your own datasets, you can prepare a tab-separated file containing the same columns as the example.

Example of a sample file for single-end libraries:

SM      LB    ID       Data
ind1    L1    L1_1     reads/ind1.L1_R1_001.fastq.gz
ind1    L1    L1_2     reads/ind1.L1_R1_002.fastq.gz
ind1    L2    L2_1     reads/ind1.L2_R1_001.fastq.gz
ind1    L2    L2_2     reads/ind1.L2_R1_002.fastq.gz

Example of a sample file for paired-end and single-end libraries:

SM      LB    ID       Data1                            Data2
ind2    L1    L1_1     reads/ind2.L1_R1_001.fastq.gz    reads/ind2.L1_R2_001.fastq.gz
ind2    L1    L1_2     reads/ind2.L1_R1_002.fastq.gz    reads/ind2.L1_R2_001.fastq.gz
ind2    L2    L2_1     reads/ind2.L2_R1_002.fastq.gz    NULL

In the first example, four fastq files will be mapped. They were generated from two different libraries (here, labelled as L1 and L2) from a single sample (ind1).

In the second example, there is still only one sample (ind2), and two libraries, sequenced in paired-end (L1) and single-end (L2) mode.

You can add more samples/libraries/fastq files in the input dataset by adding a new row including the 4 or 5 fields indicated above.

The columns of the sample file are:

  • SM: Sample name. Libraries are merged according to this name.
  • LB: Library name. Fastq files are merged according to this name.
  • ID: An ID for the fastq library (examples: id1, fq_1, ind1_lib1_fq2, etc.)
  • Data (single-end format): Path to the fastq file. The file may be gzipped or not. Path may be absolute or relative to the working directory.
  • Data1 (paired-end format): Path to the forward fastq file (R1) for paired-end data or the fastq file for single-end data. The file may be gzipped or not. Path may be absolute or relative to the working directory.
  • Data2 (paired-end format): Path to the reverse fastq file (R2) for paired-end data or NULL for single-end data. The file may be gzipped or not. Path may be absolute or relative to the working directory.

👆🏿 Note that you can select any label you want for the columns SM, LB, and ID, but they may not contain points '.'. That is, they do not need to match any substring or the name of your FASTQ files (FASTQ file names may in contrast contain points). However, we recommend that you stick to meaningful names (e.g.: Denisova, Mota, UstIshim, Anzick, lib1, lib2, lib3_USER, etc.) and ACII characters.

Adapt config file

By default, mapache is configured to map ancient data to a human reference genome. You need to specify the path to the reference genome in FASTA format in the configuration file (config/config.yaml) provided by mapache.

Specify reference genome

Mapache will map the input FASTQ files to the reference genome(s) indicated in the config file under the keyword genome. The path to the genome in FASTA format should be indicated with the keyword fasta.

In the example below, all the samples will be mapped to two versions of the human genome. The final BAM files will be named with the format {sample_ID}.{genome_name}.bam (e.g., ind1.hg19.bam, ind1.GRCh38.bam).

genome: 
    GRCh37: path_to_reference/GRCh37.fasta
    # GRCh38: path_to_reference/GCA_000001405.15_GRCh38.fa

Enable/disable steps and specify memory and runtime

Mapache will perform different steps (see below) in order to produce a BAM file. Most of the steps are optional, but run by default.

The config file contains entries corresponding to each of the steps that can be run and customized. For example, the entry with the options to clean the fastq files looks like this:

# fastq cleaning (optional)
cleaning:
    run: 'adapterremoval'   # options: adapterremoval (default), fastp, False
    params_adapterremoval: '--minlength 30 --trimns --trimqualities'
    params_fastp: ''
    threads: 4 
    mem: 4 ## in GB
    time: 2

If you need to modify the pipeline, for instance to ommit the step step, you can set in the config file its option to run: False, or run: 'adapterremovel' and run: 'fastp' to run AdapterRemoval2 and fastp, respectively. Additionally, if you intend to run mapache on an HPC system, you can specify the number of threads, memory (mem in GB), and runtime (time in hours) to be allocated to each step. Finally, you can pass additional parameters to the tool to be executed (e.g. --minlength 30 to AdapterRemoval2) with the keyword params_adapterremoval.

Running mapache

To run mapache on your working directory, you will need to copy or create symbolic links to the following directories:

  • workflow (mandatory)
  • config (mandatory)
  • test_data (optional, only if you want to run the tutorial)
  • slurm (optional, useful to submit jobs in HPC systems with slurm as cluster manager)

Assuming that you are in the mapache directory, you can use the next lines to create an alias to copy or symlink those directories. You can adjust it by including only the directories that you want to copy/symlink.

# alias to cp directories
 echo "alias copy_mapache='cp -r $(pwd -P)/{config,workflow,test_data,slurm} . ' " >> ~/.bash_profile
# alias to symlink directories
 echo "alias symlink_mapache='ln -s $(pwd -P)/{config,workflow,test_data,slurm} . ' ">> ~/.bash_profile

source ~/.bash_profile

Make sure that the paths to

  • the reference genome(s) in the config files
  • the file listing the FASTQ files (samples.tsv)

as well as the paths listed in samples.tsv are properly specified

Dry run

We highly recommend to make use of dry runs to get an idea of the jobs that will be executed.

# print jobs to be executed
snakemake -pn
# Visualization of the pipeline in a directed acyclic graph (DAG).
snakemake dag --dag | dot -Tpng > dag.png               

The command below will produce a figure with the rules that will be run.

snakemake --rulegraph |dot -Tpng > rulegraph.png

Examples for the DAG and rulegraph can be found here and here, respectively.

Optional: configure a profile to automatically submit jobs to a queue

This is a very handy utility that will allow mapache to automatically schedule jobs for submission to a queuing system. If you work with a system managed by slurm, you can skip this step, as mapache comes with a slurm profile (mapache/slurm).

Other profiles are available at: https://github.com/Snakemake-Profiles

Note however that the exact configuration might vary depending on your system (i.e., some users might have different accounts on a server for billing purposes). We thus highly encourage you to seek for help with your IT team if you are in doubt about the configuration that suits your system the best. Please note that we are not in charge of developing these profiles.

Run mapping pipeline

mapache can be run locally by indicating the number of cores available or with a high-performance computing system, by configuring a profile.

Example of execution using only one CPU:

snakemake --cores 1

If you work on an HPC system managed by slurm, you can use the slurm profile in the repository (mapache/slurm/) by symlinking it to your working directory. We recommend to start a screen session prior to the job submission.

Example of a submission of 999 jobs (simoultaneously) with the slurm profile:

snakemake --jobs 999 --profile slurm

Create report with mapping statistics

The template using to produce the html report will change depending on the version of Snakemake that you are using with mapache. Although we try to include one of the most recent distributions of Snakemake, we think that the template used in older versions is more useful to navigate through the results in the report. We thus recommend you to create a new environment with conda or mamba to create such report:

mamba create -n snakemake610 snakemake=6.10.0
mamba activate snakemake610

Once you have activated the new environment, creating the report is straightforward:

snakemake --report report.zip

or

snakemake --report report.html

We recommend creating the zip version of the report, as it contains the html report in it and it allows you to download any of the output tables or plots by clicking on the links of the report, making it easier to share with your colleagues.

Explore the zipped report produced with Snakemake 6.10.0.

Summary of useful commands

#------------------------------------------
# Get an idea of the jobs that will be executed. Not mandatory but very useful to spot possible mistakes in the configuration or input files

snakemake dag --dag | dot -Tsvg > dag.svg                Visualization of the pipeline in a directed acyclic graph (DAG).
snakemake dag --rulegraph | dot -Tsvg > rulegraph.svg    Visualization the interplay of the rules.
snakemake -n                                             Dry run
snakemake -p -n                                          Print out the commands in a dry run

#------------------------------------------
# recommended workflow on a cluster (e.g. slurm)
snakemake --jobs 999 --profile slurm                     Run all mappings and imputation (note that the profile has to be the last argument)

#------------------------------------------
# recommended workflow on a single machine without a queuing system
snakemake --cores 16                                     Run all mappings and imputation (assuming that you have 16 CPUs available)

#------------------------------------------
# Create report (after execution)

snakemake --report report.html                           Create a html report 
snakemake --report report.zip                            Create a zip report; useful if you want to download the files by clicking on the HTML report

#------------------------------------------
# Options to control the execution.
# Visit https://snakemake.readthedocs.io/en/stable/ for full documentation

--profile slurm                           To send it on the cluster (must be the last argument).
-R RULE_NAME                              Force a start at least at the given rule.
--until RULE_NAME                         Run until the given rule (included).
--rerun-incomplete                        Re-run all jobs the output of which is recognized as incomplete.
--configfile FILE                         Define the config file (default config/config.yaml).
-t                                        Reset the timestamp that the output is not re-computed.
-p                                        Print out the shell commands that will be executed.
--rerun-triggers mtime                    for a rerun, consider only the time stamp (previously the default 
                                          setting, now any change (code, config,..) lead to a rerun of 
                                          the rule)

Citing mapache

If you use mapache in your study, please cite Neuenschwander S, Cruz Dávalos DI, et al. (2023) Mapache: a flexible pipeline to map ancient DNA. Bioinformatics (https://doi.org/10.1093/bioinformatics/btad028) and the tools that you used within mapache. See the table below for a list of tools used at each step.

Tools included in mapache

Version Reference Link
Workflow manager
Snakemake 7.18.2 Mölder, et al. (2021) https://github.com/snakemake/snakemake
Subsample
seqtk 1.3 https://github.com/lh3/seqtk
Clean
AdapterRemoval2 2.3.2 Schubert, et al. (2016) https://github.com/MikkelSchubert/adapterremoval
fastp 0.23.2 Chen, et al. (2018) https://github.com/OpenGene/fastp
Map
BWA aln 0.7.17 Li and Durbin (2009) https://github.com/lh3/bwa
BWA mem 0.7.17 Li and Durbin (2009) https://github.com/lh3/bwa
Bowtie2 2.4.4 Langmead and Salzberg (2012) https://github.com/BenLangmead/bowtie2
Sort
SAMtools 1.14 Danecek, et al. (2021) https://github.com/samtools/samtools
Filter
SAMtools 1.14 Danecek, et al. (2021) https://github.com/samtools/samtools
Merge lanes
SAMtools 1.14 Danecek, et al. (2021) https://github.com/samtools/samtools
Remove duplicates
Picard MarkDuplicates 2.25.5 Broad Institute (2019) http://broadinstitute.github.io/picard
dedup 0.12.8 Peltzer, et al.(2016) https://github.com/apeltzer/DeDup
Rescale damage
mapDamage2 2.2.1 Jonsson, et al. (2013) https://github.com/ginolhac/mapDamage
Trim alignments
BamUtil 1.0.15 Jun, et al. (2015) https://github.com/statgen/bamUtil
Merge libraries
SAMtools 1.14 Danecek, et al. (2021) https://github.com/samtools/samtools
Realign indels
GATK IndelRealigner 3.8 DePristo, et al. (2011) https://gatk.broadinstitute.org
Recompute md flag
SAMtools 1.14 Danecek, et al. (2021) https://github.com/samtools/samtools
Imputation
GLIMPSE 1.1.1 Rubinacci, et al. (2021) https://github.com/odelaneau/GLIMPSE
BCFtools 1.15 Danecek, et al. (2021) https://github.com/samtools/bcftools
Reports
FastQC 0.11.9 Andrews (2010) https://www.bioinformatics.babraham.ac.uk/projects/fastqc
Qualimap 2.2.2d Okonechnikov, et al. (2016) http://qualimap.conesalab.org
MultiQC 1.13 Ewels, et al. (2016) https://multiqc.info
Statistics
BEDTools 2.30.0 Quinlan and Hall (2010) https://github.com/arq5x/bedtools2
bamdamage modified Malaspinas, et al. (2014) https://savannah.nongnu.org/projects/bammds
R 4.0 R Core Team (2022) https://www.r-project.org