Diffusion Approximation of Demographic Inference (dadi) is a powerful software tool for simulating the joint frequency spectrum (FS) of genetic variation among multiple populations and employing the FS for population-genetic inference. Here we introduce donni, a supervised machine learning-based framework for easier application of dadi's underlying demographic models. These machine learning models were trained on dadi-simulated data and can be used to make quick predictions on dadi demographic model parameters given FS input data from user and specified demographic model. The pipeline we used to train the machine learning models are also available here for users interested in using the same framework to train a new predictor for their customized demographic models.
If you've found an apparent bug, please submit an Issue so we can address it. You can submit questions about usage to the dadi-user Google Group (http://groups.google.com/group/dadi-user).
Shallow clone this repo to your local directory and cd
into the donni
dir. We recommend using shallow clone for faster cloning.
$ git clone https://github.com/lntran26/donni.git --depth 1
$ cd donni/
We recommend you start by creating a new conda
environment. This can be done using the command below, which will create a new conda
env called donni
and installed the required packages to this env. The env can then be activated for each subsequent use.
$ conda env create -f environment.yml
$ conda activate donni
After installation, users can check for successful installation or get help information using:
$ donni -h
usage: donni [-h] {generate_data,train,infer,validate} ...
Demography Optimization via Neural Network Inference
positional arguments:
{generate_data,train,infer,validate}
generate_data Simulate allele frequency data from demographic history models
train Train MLPR with simulated allele frequency data
infer Infer demographic history parameters from allele frequency with trained MLPRs
validate Validate trained MLPRs inference accuracy and CI coverage
optional arguments:
-h, --help show this help message and exit
There are four subcommands in donni
and the detailed usage for each subcommand can be found below:
To display help information for each subcommand, users can use -h
. For example:
$ donni generate_data -h
To infer demographic model parameters with donni, users need to specify the desired demographic model name and the path to allele frequency data.
$ donni infer --input_fs examples/data/1d_ns20_sfs.fs --model three_epoch
This command will automatically download the relevant trained MLPRs used for inference to the user's home directory and the download location will be printed to the command line. The file size of one trained MLPR varies from a few Kb to a few Mb, depending on the sample size of the input data and the number of populations in the demographic model. The number of trained MLPRs downloaded will correspond to the number of parameters in the requested demographic model (i.e. one trained MLPR per parameter.)
Downloading model: /iplant/home/shared/donni/three_epoch/unfolded/ss_20/v0.9.0/tuned_models/param_01_predictor.keras to /home/lntran/.cache/donni/0.9.0/three_epoch_unfolded_ns_20
Downloading model: /iplant/home/shared/donni/three_epoch/unfolded/ss_20/v0.9.0/tuned_models/param_02_predictor.keras to /home/lntran/.cache/donni/0.9.0/three_epoch_unfolded_ns_20
Downloading model: /iplant/home/shared/donni/three_epoch/unfolded/ss_20/v0.9.0/tuned_models/param_03_predictor.keras to /home/lntran/.cache/donni/0.9.0/three_epoch_unfolded_ns_20
Downloading model: /iplant/home/shared/donni/three_epoch/unfolded/ss_20/v0.9.0/tuned_models/param_04_predictor.keras to /home/lntran/.cache/donni/0.9.0/three_epoch_unfolded_ns_20
Downloading model: /iplant/home/shared/donni/three_epoch/unfolded/ss_20/v0.9.0/tuned_models/param_05_predictor.keras to /home/lntran/.cache/donni/0.9.0/three_epoch_unfolded_ns_20
Downloading QC: /iplant/home/shared/donni/three_epoch/unfolded/ss_20/v0.9.0/plots/three_epoch_test_theta_1000_coverage_coverage.png to /home/lntran/.cache/donni/0.9.0/three_epoch_unfolded_ns_20_QC
Downloading QC: /iplant/home/shared/donni/three_epoch/unfolded/ss_20/v0.9.0/plots/three_epoch_test_theta_1000_param_01_accuracy_95_ci.png to /home/lntran/.cache/donni/0.9.0/three_epoch_unfolded_ns_20_QC
Downloading QC: /iplant/home/shared/donni/three_epoch/unfolded/ss_20/v0.9.0/plots/three_epoch_test_theta_1000_param_01_accuracy.png to /home/lntran/.cache/donni/0.9.0/three_epoch_unfolded_ns_20_QC
Downloading QC: /iplant/home/shared/donni/three_epoch/unfolded/ss_20/v0.9.0/plots/three_epoch_test_theta_1000_param_02_accuracy_95_ci.png to /home/lntran/.cache/donni/0.9.0/three_epoch_unfolded_ns_20_QC
Downloading QC: /iplant/home/shared/donni/three_epoch/unfolded/ss_20/v0.9.0/plots/three_epoch_test_theta_1000_param_02_accuracy.png to /home/lntran/.cache/donni/0.9.0/three_epoch_unfolded_ns_20_QC
Downloading QC: /iplant/home/shared/donni/three_epoch/unfolded/ss_20/v0.9.0/plots/three_epoch_test_theta_1000_param_03_accuracy_95_ci.png to /home/lntran/.cache/donni/0.9.0/three_epoch_unfolded_ns_20_QC
Downloading QC: /iplant/home/shared/donni/three_epoch/unfolded/ss_20/v0.9.0/plots/three_epoch_test_theta_1000_param_03_accuracy.png to /home/lntran/.cache/donni/0.9.0/three_epoch_unfolded_ns_20_QC
Downloading QC: /iplant/home/shared/donni/three_epoch/unfolded/ss_20/v0.9.0/plots/three_epoch_test_theta_1000_param_04_accuracy_95_ci.png to /home/lntran/.cache/donni/0.9.0/three_epoch_unfolded_ns_20_QC
Downloading QC: /iplant/home/shared/donni/three_epoch/unfolded/ss_20/v0.9.0/plots/three_epoch_test_theta_1000_param_04_accuracy.png to /home/lntran/.cache/donni/0.9.0/three_epoch_unfolded_ns_20_QC
Downloading QC: /iplant/home/shared/donni/three_epoch/unfolded/ss_20/v0.9.0/plots/three_epoch_test_theta_1000_param_05_accuracy_95_ci.png to /home/lntran/.cache/donni/0.9.0/three_epoch_unfolded_ns_20_QC
Downloading QC: /iplant/home/shared/donni/three_epoch/unfolded/ss_20/v0.9.0/plots/three_epoch_test_theta_1000_param_05_accuracy.png to /home/lntran/.cache/donni/0.9.0/three_epoch_unfolded_ns_20_QC
Downloading QC: /iplant/home/shared/donni/three_epoch/unfolded/ss_20/v0.9.0/plots/three_epoch_test_theta_1000_report.txt to /home/lntran/.cache/donni/0.9.0/three_epoch_unfolded_ns_20_QC
Finished downloading files to /home/lntran/.cache/donni/0.9.0/three_epoch_unfolded_ns_20 folder.
Once downloaded, donni will use the trained MLPRs to infer the demographic parameter values and confidence intervals (default: 95% CI) for the user's input allele frequency data.
***Inferred demographic model parameters***
# nuB nuF TB TF misid theta nuB_lb_95 nuB_ub_95 nuF_lb_95 nuF_ub_95 TB_lb_95 TB_ub_95 TF_lb_95 TF_ub_95 misid_lb_95 misid_ub_95
0.4146571254600148 0.3869760036609383 0.5405870676040649 0.6910597085952759 0.01265975832939148 2489.12434337297 0.0022699636903605234 75.74593920815283 0.033626422287440526 4.45335593924669 -0.3420185887813568 1.4231927239894868 -0.13996889114379885 1.5220883083343506 -0.028875216841697693 0.05419473350048065
# CIs: |----------95----------|
# nuB: [ 0.002270, 75.745939]
# nuF: [ 0.033626, 4.453356]
# TB: [ -0.342019, 1.423193]
# TF: [ -0.139969, 1.522088]
# misid: [ -0.028875, 0.054195]
Check the plots in /home/lntran/.cache/donni/0.9.0/three_epoch_unfolded_ns_20_QC for accuracy scores of the downloaded model.
Users can save this output at a text file using the --output_prefix
command:
$ donni infer --input_fs examples/data/1d_ns20_sfs.fs --model three_epoch --output_prefix three_epoch_ns20_results
will make a text file three_epoch_ns20_results
with the Inferred demographic model parameters from the previous command.
If users want to use dadi
with the results from donni to do plotting or further analysis, they can export a file that can be read by dadi-cli
, a command line interface for dadi
. Users can use the --export_dadi_cli
flag to store results that can be read by dadi-cli
in a file with a user specified name which will have .donni.pseudofit
as an extension.
For example the command
$ donni infer --input_fs examples/data/1d_ns20_sfs.fs --model three_epoch --export_dadi_cli three_epoch_ns20
will generate a file three_epoch_ns20.donni.pseudofit
, which can then be read by dadi-cli
subcommands that use the --demo-pop
and --bestfit-p0-file
flags.
donni can also remove files that were downloaded for the specific combination of data and demographic model using the --cleanup
flag:
$ donni infer --input_fs examples/data/1d_ns20_sfs.fs --model three_epoch --cleanup
donni currently supports all demographic models in the dadi API as well as the models from Portik et al.. The supported sample sizes are 10, 20, 40, 80, and 160 chromosomes per population (up to 20 chromosomes only for three-population models). Input allele frequency spectra with a different sample size will be automatically down-projected to the closest available supported size before inference. donni will also automatically detect whether the input data is a folded or unfolded spectra.
If the requested model is not available, users will see this message:
$ donni infer --input_fs examples/data/1d_ns20_sfs.fs --model foo
The requested demographic model does not exist on the CyVerse Data Store.
Users can check for available models at https://de.cyverse.org/data/ds/iplant/home/shared/donni
If the user has generated their own trained MLPRs, use --mlpr_dir
The trained MLPRs (along with accuracy score and confidence interval coverage plots) are publicly available on the University of Arizona CyVerse Data Store (https://de.cyverse.org/data/ds/iplant/home/shared/donni).
All MLPRs generated are sorted kept based on the version of donni they were generated with. Users or developers can specify the version of the MLPRs they want to download with --version
command.
Note that if the packages used to generate MLPRs changes, older versions (for example, MLPRs generated by donni v0.0.1) will not be supported with the latest version of donni.
Users can specify the confidence intervals they want with the --cis
argument. For example, the 80th and 90th percent confidence intervals can be requested with the following command:
$ donni infer --input_fs examples/data/1d_ns20_sfs.fs --model three_epoch --cis 80 90
***Inferred demographic model parameters***
# nuB nuF TB TF misid theta nuB_lb_80 nuB_ub_80 nuF_lb_80 nuF_ub_80 TB_lb_80 TB_ub_80 TF_lb_80 TF_ub_80 misid_lb_80 misid_ub_80 nuB_lb_90 nuB_ub_90 nuF_lb_90 nuF_ub_90 TB_lb_90 TB_ub_90 TF_lb_90 TF_ub_90 misid_lb_90 misid_ub_90
0.4146571254600148 0.3869760036609383 0.5405870676040649 0.6910597085952759 0.01265975832939148 2489.12434337297 0.013825473303355517 12.436502383830257 0.07848449596883701 1.9080255986972294 -0.03580846309661867 1.1169825983047486 0.14834715366363527 1.2337722635269164 -0.014465123414993286 0.039784640073776245 0.005312160541557728 32.367344764837036 0.05010784243050499 2.9885626709447877 -0.19791970610618592 1.2790938413143158 -0.00429075241088861 1.3864101696014404 -0.0220939964056015 0.04741351306438446
# CIs: |----------80----------| |----------90----------|
# nuB: [ 0.013825, 12.436502] [ 0.005312, 32.367345]
# nuF: [ 0.078484, 1.908026] [ 0.050108, 2.988563]
# TB: [ -0.035808, 1.116983] [ -0.197920, 1.279094]
# TF: [ 0.148347, 1.233772] [ -0.004291, 1.386410]
# misid: [ -0.014465, 0.039785] [ -0.022094, 0.047414]
If users have trained their own MLPRs, they can direct donni to the trained MLPRs file directory with the --mlpr_dir
argument.
$ donni infer --input_fs examples/data/1d_ns20_sfs.fs --model two_epoch --mlpr_dir examples/data/two_epoch_20_mlprs/
***Inferred demographic model parameters***
# nu T misid theta nu_lb_95 nu_ub_95 T_lb_95 T_ub_95 misid_lb_95 misid_ub_95
0.24181668926375313 0.9589159488677979 0.0008354485034942627 4046.0925977734805 0.02504349414857273 2.334950181455216 0.1795339167118073 1.7382979810237884 -0.018864348903298377 0.020535245910286902
# CIs: |----------95----------|
# nu: [ 0.025043, 2.334950]
# T: [ 0.179534, 1.738298]
# misid: [ -0.018864, 0.020535]
Sometimes the trained MLPRs will output a negative parameter estimation, leading to an undefined theta value. An example is included and can be accessed by running the following command:
$ donni infer --input_fs examples/data/1d_ns20_sfs.fs --model two_epoch --mlpr_dir examples/data/two_epoch_20_mlprs_negative/
When this is the case, the pipeline will output something like the below with a warning about undefined theta:
***Inferred demographic model parameters***
# nu T misid theta nu_lb_95 nu_ub_95 T_lb_95 T_ub_95 misid_lb_95 misid_ub_95
0.13990851968127813 1.1827903985977173 -0.010053243488073349 nan 0.010451441687755284 1.8728893548092593 0.3536277294158936 2.011953067779541 -0.02659363556653261 0.006487148590385912
# CIs: |----------95----------|
# nu: [ 0.010451, 1.872889]
# T: [ 0.353628, 2.011953]
# misid: [ -0.026594, 0.006487]
WARNING: Theta is not defined. Check inferred demographic model parameters for negative values.
As indicated in the output here, the misid parameter is very slightly negative (-0.010), causing theta to be undefined. Handling of cases like this depends on a case-by-case basis, such as which parameter is negative, how accurate the trained MLPRs are on predicting such parameter (should be reviewed in the uploaded QC validation plots), and the absolute value of the negative estimation. In this example, it is likely that misid is simply very close to 0, which is good, but further optimization with dadi/dadi-cli is recommended.
The three subcommands generate_data
, train
, and validate
are for training and testing trained MLPRs. This is the procedure we used to produce all the MLPRs in the current library. Users can use the same method outlined below to create their custom demographic model with dadi and produce the corresponding trained MLPRs.
donni uses dadi to simulate allele frequency data for training the MLPR. Hence, the custom model must be a Python script (.py
file) in the format that dadi uses (see https://dadi.readthedocs.io/en/latest/user-guide/specifying-a-model/). The path to this file should be passed into donni with --model_file
and the specific model in the file with --model
. We can start building our command:
$ donni generate_data --model out_of_africa --model_file donni/custom_models.py
Three additional arguments are required for generating the training data: --n_samples
, --sample_sizes
, and --outfile
.
The --sample_sizes
argument defines the number of sampled chromosomes per population hence the size of the simulated AFS. A three-population model should have three numbers separated by a space for the number of chromosomes in each population. For example:
--sample_sizes 10 10 10
The --n_samples
argument specifies the number of AFS with unique parameter labels to be generated. We used 5000 AFS for the training set and 1000 AFS for test set. The optional argument --seed
can be used for reproducibility as well as ensuring different AFS being generated for training vs. testing.
--n_samples 5000 --seed 1
The --outfile
argument specifies the file name and path to save the output data. This is a pickled Python dictionary with parameter labels as keys and simulated AFS as values.
--outfile data/train_5000
The argument --save_individual_fs
can be used to save each simulated AFS as a single file instead of all in one dictionary. The parameter labels will be saved separately in a pickled file true_log_params
. This usage required specifying an additional argument --outdir
, which specifies the directory where all the individual AFS files and the true_log_params
file will be saved to. Note that for this usage, --outfile
still needs to be specified, as it is also used to save the corresponding QC files (for checking simulated AFS quality).
--save_individual_fs --outdir test_fs --outfile test_fs/test_100_theta_1000
The --grids
argument is used by the dadi simulation engine to calculate the AFS. donni will calculate the appropriate grids by default based on the specified --sample_sizes
. Higher grid points can improve the quality of the simulated AFS. donni will automatically check the quality of the spectra generated, which can be turned off using the --no_fs_qual_check
option.
The --theta
argument is used to control the variance (noise) in the simulated AFS by scaling the spectra with the theta value passed in then resampling. By default, --theta
is 1 (no scaling, no sampling), which is used for generating the training AFS (no noise). For generating test AFS, we often use --theta 1000
to simulate moderately noisy AFS. If the value passed into --theta
is > 1, the AFS generated will be scaled by the passed in value. Because we often only want to do this to generate noisy AFS for testing, by default donni will also Poisson-sample from the scaled AFS. If this is not desirable, use --no_sampling
argument for scaling without sampling. Similarly, all simulated AFS are, by default, normalized before being saved. The --non_normalize
argument allows bypassing this and will generate non-normalized AFS.
Users can use the --folded
argument if they want to generate folded AFS. By default, unfolded AFS will be generated.
donni can also generate bootstraped AFS data with the --bootstrap
argument. For this usage, the --n_bstr
argument is required to specify how many bootstraped AFS to generate per simulated AFS.
By default, donni will use all available CPUs to simulate the AFS in parallel. Users can control the number of CPUs used with --n_cpu
.
To generate 5000 training AFS for the out_of_africa model found in the donni/donni/custom_models.py file and the hyperparmeters:
$ donni generate_data --model out_of_africa --model_file donni/custom_models.py --n_samples 5000 \
--sample_sizes 10 10 10 --seed 1 --outfile data/train_5000
To generate 1000 test AFS with moderate noise for the same model:
$ donni generate_data --model out_of_africa --model_file donni/custom_models.py --n_samples 1000 \
--sample_sizes 10 10 10 --seed 100 --theta 1000 --outfile data/test_1000_theta_1000
After we have generated the data for training and testing, we will now use these data to tune and train the MLPRs for each demographic model parameter. This can be done using the donni subcommand train
with the two required flags: --data_file
pointing to the training data output from the previous step, and --mlpr_dir
indicating the path the save the output trained MLPR.
$ donni train --data_file data/train_5000 --mlpr_dir trained_models
While it is possible to train MLPR using the default set of hyperparameters, we recommend users to first run the tuning procedure to find the most optimized set of hyperparameters. This can be done by adding the argument --tune
.
$ donni train --data_file data/train_5000 --mlpr_dir tuned_models --tune
Finally, we can use the simulated test data to measure the accuracy performance of the trained MLPRs with the subcommand validate
. The required arguments are:
--mlpr_dir
: path to the trained MLPRs to be validated
--test_dict
: path to the test data
--results_dir
: path to save the output plots
--plot_prefix
: for the prefix of each plot's filename
--model
: for the demographic model used (required to obtain the demographic model parameters). If a custom model is used then also need to provide a path to it with --model_file
.
--folded
: required if the MLPRs were trained on folded AFS data (no misid parameter in demographic model).
Below is an example of a full command:
$ donni validate --model two_epoch --mlpr_dir examples/data/two_epoch_20_mlprs \
--test_dict examples/data/two_epoch_test_1000_theta_1000 \
--results_dir examples/data/plots --plot_prefix two_epoch_theta_1000
For descriptions of all arguments, use:
$ donni validate -h
- Python 3.11+
- dadi
- Tensorflow 2.12
- keras-tuner 1.4.6