/eager

A fully reproducible and state-of-the-art ancient DNA analysis pipeline

Primary LanguageNextflowMIT LicenseMIT

nf-core/eager

A fully reproducible and state-of-the-art genomics pipeline for ancient DNA.

GitHub Actions CI Status GitHub Actions Linting Status Nextflow nf-core DOI

Docker Singularity Container available install with bioconda Get help on Slack

Introduction

nf-core/eager is a bioinformatics best-practice analysis pipeline for NGS sequencing based ancient DNA (aDNA) data analysis.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. The pipeline pre-processes raw data from FASTQ inputs, or preprocessed BAM inputs. It can align reads and performs extensive general NGS and aDNA specific quality-control on the results. It comes with docker, singularity or conda containers making installation trivial and results highly reproducible.

nf-core/eager schematic workflow

Pipeline steps

Default Steps

By default the pipeline currently performs the following:

  • Create reference genome indices for mapping (bwa, samtools, and picard)
  • Sequencing quality control (FastQC)
  • Sequencing adapter removal and for paired end data merging (AdapterRemoval)
  • Read mapping to reference using (bwa aln, bwa mem, CircularMapper, or bowtie2)
  • Post-mapping processing, statistics and conversion to bam (samtools)
  • Ancient DNA C-to-T damage pattern visualisation (DamageProfiler)
  • PCR duplicate removal (DeDup or MarkDuplicates)
  • Post-mapping statistics and BAM quality control (Qualimap)
  • Library Complexity Estimation (preseq)
  • Overall pipeline statistics summaries (MultiQC)

Additional Steps

Additional functionality contained by the pipeline currently includes:

Input

  • Automatic merging of complex sequencing setups (e.g. multiple lanes, sequencing configurations, library types)

Preprocessing

  • Illumina two-coloured sequencer poly-G tail removal (fastp)
  • Automatic conversion of unmapped reads to FASTQ (samtools)
  • Host DNA (mapped reads) stripping from input FASTQ files (for sensitive samples)

aDNA Damage manipulation

  • Damage removal/clipping for UDG+/UDG-half treatment protocols (BamUtil)
  • Damaged reads extraction and assessment (PMDTools)
  • Nuclear DNA contamination estimation of human samples (angsd)

Genotyping

  • Creation of VCF genotyping files (GATK UnifiedGenotyper, GATK HaplotypeCaller and FreeBayes)
  • Creation of EIGENSTRAT genotyping files (pileupCaller)
  • Creation of Genotype Likelihood files (angsd)
  • Consensus sequence FASTA creation (VCF2Genome)
  • SNP Table generation (MultiVCFAnalyzer)

Biological Information

  • Mitochondrial to Nuclear read ratio calculation (MtNucRatioCalculator)
  • Statistical sex determination of human individuals (Sex.DetERRmine)

Metagenomic Screening

  • Taxonomic binner with alignment (MALT)
  • Taxonomic binner without alignment (Kraken2)
  • aDNA characteristic screening of taxonomically binned data from MALT (MaltExtract)

Quick Start

  1. Install nextflow (version >= 20.04.0)

  2. Install either Docker or Singularity for full pipeline reproducibility (please only use Conda as a last resort; see docs)

  3. Download the pipeline and test it on a minimal dataset with a single command:

    nextflow run nf-core/eager -profile test,<docker/singularity/conda/institute>

    Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use -profile <institute> in your command. This will enable either docker or singularity and set the appropriate execution settings for your local compute environment.

  4. Start running your own analysis!

    nextflow run nf-core/eager -profile <docker/singularity/conda> --input '*_R{1,2}.fastq.gz' --fasta '<your_reference>.fasta'
  5. Once your run has completed successfully, clean up the intermediate files.

    nextflow clean -f -k

See usage docs for all of the available options when running the pipeline.

N.B. You can see an overview of the run in the MultiQC report located at ./results/MultiQC/multiqc_report.html

Modifications to the default pipeline are easily made using various options as described in the documentation.

Documentation

The nf-core/eager pipeline comes with documentation about the pipeline which you can read at https://nf-co.re/eager/usage or find in the docs/ directory.

  1. Nextflow installation
  2. Pipeline configuration
  3. Running the pipeline
    • This includes tutorials, FAQs, and troubleshooting instructions
  4. Output and how to interpret the results

Credits

This pipeline was mostly written by Alexander Peltzer (apeltzer) and James A. Fellows Yates, with contributions from Stephen Clayton, Thiseas C. Lamnidis, Maxime Borry, Zandra Fagernäs, Aida Andrades Valtueña and Maxime Garcia and the nf-core community.

If you would like to contribute to this pipeline, please open an issue (or even better, a pull request - please see the contributing guidelines, and ask to be added to the project - everyone is welcome to contribute here!.

For further information or help, don't hesitate to get in touch on the Slack #eager channel (you can join with this invite).

Authors (alphabetical)

Additional Contributors (alphabetical)

Those who have provided conceptual guidance, suggestions, bug reports etc.

If you've contributed and you're missing in here, please let us know and we will add you in of course!

Tool References

Data References

This repository uses test data from the following studies:

  • Fellows Yates, J. A. et al. (2017) ‘Central European Woolly Mammoth Population Dynamics: Insights from Late Pleistocene Mitochondrial Genomes’, Scientific reports, 7(1), p. 17714. doi: 10.1038/s41598-017-17723-1.
  • Gamba, C. et al. (2014) ‘Genome flux and stasis in a five millennium transect of European prehistory’, Nature communications, 5, p. 5257. doi: 10.1038/ncomms6257.
  • Star, B. et al. (2017) ‘Ancient DNA reveals the Arctic origin of Viking Age cod from Haithabu, Germany’, Proceedings of the National Academy of Sciences of the United States of America, 114(34), pp. 9152–9157. doi: 10.1073/pnas.1710186114.

Citation

If you use nf-core/eager for your analysis, please cite the eager preprint as follows:

James A. Fellows Yates, Thiseas Christos Lamnidis, Maxime Borry, Aida Andrades Valtueña, Zandra Fagneräs, Stephen Clayton, Maxime U. Garcia, Judith Neukamm, Alexander Peltzer Reproducible, portable, and efficient ancient genome reconstruction with nf-core/eager bioRxiv 2020.06.11.145615; doi: https://doi.org/10.1101/2020.06.11.145615

You can cite the eager zenodo record for a specific version using the following doi: 10.5281/zenodo.3698082

You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x. ReadCube: Full Access Link