nea-over-time: An HTML repository from bodkan

This repository contains source code and Jupyter notebooks for data processing, simulations and analyses used in this paper.

To reproduce everything from scratch, you'll need to install all dependencies listed bellow.

Full disclosure: I've been very lucky to have access to amazing computational resources (60 core machines with 1 TB RAM and a cluster with hundreds of nodes) and I often used them to their full potential. Unless you have similar resources, it's not going to be trivial to reproduce all results from scratch. At the very least, it will take much longer to run all the simulations if you cannot parallelize them effectively.

If you don't want to re-run the whole simulation and analysis pipeline but still want to play around with results and plots, you can use the rds and RData files in the data/ subdirectory. The notebooks/figures_for_paper.ipynb notebook is a good start, as it loads those processed R data files and uses them to generate plots for the paper.

Python

I used Python version 3.6.5 and the following Python modules:

pip install numpy pandas msprime pybedtools jupyter

The full list of Python modules I had installed in the project environment can be found in the requirement.txt file.

R

I used R version 3.4.3.

Packages from CRAN:

install.packages(c("broom", "forcats", "future", "ggbeeswarm", "ggrepel",
                   "here", "magrittr", "modelr", "purrr", "stringr", "tidyverse"))

Packages from Bioconductor:

install.packages("BiocManager")
BiocManager::install(c("biomaRt", "VariantAnnotation", "BSgenome.Hsapiens.UCSC.hg19",
                       "GenomicRanges",  "rtracklayer"))

Packages from GitHub:

install.packages("devtools")
devtools::install_github("bodkan/bdkn")
devtools::install_github("bodkan/slimr", ref = "v0.1")
devtools::install_github("bodkan/admixr", ref = "v0.6.2")

To be able to run Jupyter notebooks that contain all my analses and figures, you will also need to install IRkernel.

SLiM

I used SLiM v2.6. Be aware that SLiM introduced some backwards incompatible changes since its 2.0 release, so make sure to use exactly version 2.6.

HOWTO

In principle, different notebooks in the notebooks/ directory use different data generated by "pipeline scripts" in the root of the repository (00_...sh, 01_...sh, etc.).

However, there's no strict sequential order of executing everything. In fact, I ran those scripts mostly by pieces, adding additional commands as the project developed, and analyzed new data as they were being generated.

bodkan/nea-over-time

Python

R

SLiM

HOWTO