/ARChES

Dosage Compensation Analyses Pipeline using Snakemake

Primary LanguagePythonMIT LicenseMIT

ARChES: Ancestral Reconstruction of Chromosome Expression States to analyze Dosage Compensation using Snakemake.

Uses: orthofinder salmon transdecoder cd-hit BUSCO Trinity Trinity deseq2 edgeR

Depends: snakemake

Summary

This repository contains a snakemake pipeline that handles all the steps needed for a sex chromosome dosage compensation analyses. Mainly, the workflow

  1. Constructs a non-redundant transcriptome assembly (for each species) using Trinity.
  2. Adds annotation by identifying orthologs using OrthoFinder.
  3. Estimates Differential Expression (DE) among males and females within each species using salmon.
    • This step identifies Dosage Balance between males and females.
  4. Using a (given) dated phylogenetic tree of all the species in the analysis, (weighted) ancestral-X chromosome expression is compared against the neo-X chromosome.

Rules Graph

Dependencies.

  • conda
    • conda should be available in the $PATH
    • The workflow was tested in python 3.7 and conda version 4.8.3.
  • snakemake
    • This is a conda env and can be installed as follows.
     # Install mamba to solve all snakemake Dependencies
     conda install -c conda-forge mamba
     # create/install a snakemake environment
     mamba create -c conda-forge -c bioconda -n snakemake snakemake
    • The snakemake workflow was created and tested in 5.19.3

Data Input.

User-specific data and reference files can be configured in the config.yml file. For a complete analysis, the following files are needed.

  1. RNA-Seq reads
  2. Reference Protein file URL
  3. Reference GFF file URL
  4. Reference Orthodb file URL
  5. Dated Phylogenetic Tree

Usage.

# Activate the snakemake conda environment
conda activate snakemake
# Do a dry run to verify the workflow and the jobs
snakemake --cores 16 --use-conda -np
# Run the pipeline
snakemake --cores 16 --use-conda

Number of cores can be increased to reflect the compute resource available

Pipeline Output/ Directory Structure:

├── config.yml - (analysis specific configuration file)
├── data - (reads, tree and other user input data)
├── envs - (environments for different programs)
├── logs - (log file for each job) - created
├── results - (all the major results) - created
├── report.html - (the final output file with interactive graphs) - created
├── scripts - (scripts for processing)
├── Snakefile - (the driver script for the workflow)
└── tmp_dir - (all temporary files and supporting result files) - created

(Expected) Releases.

Current release is highlighted in bold font below and also is tagged in github.

  • α
    • Modular workflow
    • Test on sample dataset
  • β
    • Parallelize BLAST in 2 bash scripts. (or)
    • Elimininate BLAST and annotate based on the Homology BLAST.
    • Snakemake, Python and R Best Practices
      • Create Rules directory
  • Pre-Release
    • Test using complete beetle dataset
  • Release
    • Test using some other species dataset (Drosophila?)

See Wiki for further information.

Citation:

Ramesh, Balan and Demuth, Jeff. "A General Framework for Dosage Compensation Analyses using Snakemake" (in prep).2020

References:

  1. Köster, Johannes and Rahmann, Sven. “Snakemake - A scalable bioinformatics workflow engine”. Bioinformatics 2012.
  2. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, di Palma F, Birren BW, Nusbaum C, Lindblad-Toh K, Friedman N, Regev A. Full-length transcriptome assembly from RNA-seq data without a reference genome.
  3. Patro, R., Duggal, G., Love, M. I., Irizarry, R. A., & Kingsford, C. (2017). Salmon provides fast and bias-aware quantification of transcript expression. Nature methods, 14(4), 417-419.
  4. Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu and Weizhong Li, CD-HIT: accelerated for clustering the next generation sequencing data. Bioinformatics, (2012), 28 (23): 3150-3152.
  5. Haas, B., & Papanicolaou, A. J. G. S. (2016). TransDecoder (find coding regions within transcripts).
  6. Emms, D. M., & Kelly, S. (2019). OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome biology, 20(1), 1-14.
  7. Julien, P., Brawand, D., Soumillon, M., Necsulea, A., Liechti, A., Schütz, F., ... & Kaessmann, H. (2012). Mechanisms and evolutionary patterns of mammalian and avian dosage compensation. PLoS Biol, 10(5), e1001328.
  8. Schield, D.R., Card, D.C., Hales, N.R., Perry, B.W., Pasquesi, G.M., Blackmon, H., Adams, R.H., Corbin, A.B., Smith, C.F., Ramesh, B. and Demuth, J.P., 2019. The origins and evolution of chromosomes, dosage compensation, and mechanisms underlying venom regulation in snakes. Genome research, 29(4), pp.590-601.