Important
If you want to perform spatio-temporal similary search with the mobest R package as presented in the paper, then you can directly check the documentation here: https://nevrome.de/mobest
Research compendium for 'Estimating human mobility in Holocene Western Eurasia with large-scale ancient genomic data'
Schmid, C., & Schiffels, S. (2023). Estimating human mobility in Holocene Western Eurasia with large-scale ancient genomic data. Proceedings of the National Academy of Sciences, 120(9), e2218375120. doi:10.1073/pnas.2218375120
A preprint based on the analysis in this old release is available here: https://doi.org/10.1101/2021.12.20.473345
http://dx.doi.org/10.17605/OSF.IO/6UWM5
The files in this archived storage will generate the results as found in the publication. The files hosted on GitHub are the development versions and may have changed since the paper was published.
This repository contains the following main top level directories:
code
: The R and shell scripts necessary to reproduce the analysis and create the figures. They are organised in subdirectories for different domains and roughly ordered with a leading number. Some scripts provide code beyond what is required to reproduce figures and results in the publication (e.g. scripts to create didactic figures for presentations).data
: The scaffold of a directory structure to hold the intermediate data output of the scripts. The actual data is too big to be uploaded here and therefore not tracked by Git.data_tracked
: Small input data files manually created for this analysis.plots
,tables
,plots_renamed
: Directories not tracked by Git to catch rendered versions of tables and plots for the publication.schemata
: Schematic drawings created for the paper.
The DESCRIPTION
and the .Rbuildignore
file are defining this repository as an R package. This mechanism is only used for R package dependency management, so that all necessary packages can be installed automatically (e.g. with remotes::install_github("nevrome/mobest.analysis.2020", dependencies = TRUE, repos = "https://mran.microsoft.com/snapshot/2022-10-03")
). The .Rproj
config file defines an RStudio project for convenient opening this repository in the RStudio IDE.
The other additional files are part of a mechanism to simplify running and reproducing the code. singularity_mobest.def
defines a singularity container that includes all software necessary to run the code in this repository. It can be build with a script like singularity_build_sif.sh
, which requires an empty directory for temporary data stempdir
. To run arbitrary scripts through singularity and the SGE scheduler of the HPC at the MPI-EVA, we used the script singularity_qsub.sh
.
The Haskell stack scripts Shake*.hs
define a build-pipeline for the complete analysis with the build tool shake.
This repository features a lot of the code and data necessary to reproduce the analysis performed for the paper. It usually relies on relative paths and is generally independent of a specific computational environment. There are a number of exceptions though, for which we can only provide partial solutions. So if you want to rerun this analysis in its entirety you will have to apply some tweaks.
- Data: This analysis depends on one major dataset of genotype data with (archaeological) context information: The Allen Ancient DNA Resource. For our analysis we worked with Version 50 of the dataset, for which we have reason to assume it will be permanently hosted on the website of the Reich Lab. We therefore refrained from copying the large dataset to a data repository. All other datasets necessary to run the analysis are available in the
data_tracked
directory. - Software dependencies: All necessary R packages and command line software tools are listed either in the
DESCRIPTION
or implicitly mentioned in thesingularity_mobest.def
file. The latter even manages a mechanism to download the software and create an independent and self-sufficient software environment (withsingularity build
). But as software develops rapidly, this will soon be downloading and installing software versions, which are not compatible any more with what we used here. That's why we pre-build a version of the singularity container (with singularity v3.6), which is part of the long-term archive for the paper (see above). As long as singularity is available and sufficiently stable, this container should feature the exact software versions used to compile the paper. singularity was recently renamed to apptainer, but the mechanisms to use the image should stay mostly as described. - High performance computing: Some analysis and data transformation in this repository are computationally expensive. As of today, they can only be run in a high performance computing environment. The MPI-EVA provides such an environment for us, which we can access and manage with the scheduling software SGE v8.1.6. We therefore wrote wrapper scripts to submit our code specifically to this system and environment (see for example
singularity_qsub.sh
). If you want to run the respective scripts with whatever system/environment is available to you, then you have to rewrite the wrapper scripts. - Pipeline: For our own convenience we structured the analysis in a series of shake build-scripts (
Shake*.hs
). They list all R and shell scripts, their input files and their expected output (except scripts creating figures and tables). Shake constructs a pull-based build-order from this to run the whole pipeline, including download, transformation and finally analysis of data. Theoretically, it should be the most reliable way to reproduce the complete analysis, but as it depends on stack and stackage for its Haskell dependencies, it may not be long-term stable. It is also hard-wired for our HPC environment, so if you want to run it, you will most likely have to create a new instance of theSettings
datatype inShakeUtils.hs
and replace ourmpiEVAClusterSettings
.
So while me did our best to make this repository as accessible and reproducible as possible, we admit that there are some hurdles to overcome. We believe for most users interested in a specific part of the analysis, it might be more convenient to build some custom scripts around the mobest
R package (https://github.com/nevrome/mobest), where we provide functions for the core tasks. The README there describes a minimal workflow, on top of which applications like the one for our paper can be erected.