Fast demography-aware inference of fine-scale recombination rates based on fused-LASSO
NB: in version 0.1.0 we switched everything from coalescent units to natural scale -- that is inputs should now be in terms of Ne, generations, and per-base, per-generation mutation rates. The output is now also automatically scaled to be the per-generation recombination rate.
We have inferred recombination maps for each of the 26 populations in phase 3 of the 1000 Genomes Project. Those maps are available at this link including maps inferred for the two most recent genome builds hg19/GRCh37 and hg38/GRCh38.
The recombination maps for hg38/GRCh38 are now also available for simulations using the wonderful stdpopsim package.
When making the recombination maps for the 1000 Genomes Project populations we used
smc++
to infer population size histories for each population.
Those size histories are plotted in
Figure 2 of the pyrho paper,
and the data used to make those plots are available in smcpp_popsizes_1kg.csv
.
The x
column is time in years
assuming a generation time of 29 years,
and the y
column
contains the population size in Ne.
pyrho is compatible with both python 2 and python 3. pyrho makes use of a number of external packages including the excellent numba, msprime, and cyvcf2 packages. As such it is recommended to install pyrho in a virtual environment.
If using conda this can be accomplished by running
conda create -n my-pyrho-env
conda activate my-pyrho-env
or using
virtualenv my-pyrho-env
source my-pyrho-env/bin/activate
Once you have set up and activated your virtual environment, you must first install ldpop. Change to a directory where you want to store ldpop and then run
git clone https://github.com/popgenmethods/ldpop.git ldpop
pip install ldpop/
Note that this will create a directory ldpop.
pyrho makes use of
msprime,
which requires
gsl
and
hdf.
pyrho also has a dependency on
openssl.
If you do not have these installed, these can be installed using
apt-get
, yum
, conda
, brew
etc...
For example, to install openssl on Ubuntu run:
sudo apt-get install libssl-dev
You will also need to have cython installed. If you do not yet have it installed, run
pip install cython
You should be able to then just clone and install pyrho by running
git clone https://github.com/popgenmethods/pyrho.git pyrho
pip install pyrho/
You can check that everything is running smoothly by running
python -m pytest pyrho/tests/tests.py
NB: the first time you run pyrho, numba will compile and cache a number of functions, which can take up to ~30 seconds.
pyrho has a command line interface and consists of a number of separate commands. A typical workflow is to first use make_table to build a lookup table and then use hyperparam to find reasonable hyperparameter settings for your data. Finally, use optimize to infer a fine-scale recombination map from data. There is also a command, compute_r2, that computes statistics of the theoretical distribution of r2.
Before performing any inference, you must first compute a lookup table. A standard use case would be
pyrho make_table --samplesize <n> --approx --moran_pop_size <N> \
--numthreads <par> --mu <mu> --outfile <outfile> \
--popsizes <size1>,<size2>,<size3> --epochtimes <breakpoint1>,<breakpoint2>
which indicates that we should compute a lookup table for a sample of size
<n>
, from a population where at present the size is
<size1>
, at
<breakpoint1>
generations in the past the size was
<size2>
and so on, with a per-generation
mutation rate
<mu>
The --numthreads option tells pyrho to use
<par>
processors
when computing the lookup table. Finally, --approx with --moran_pop_size
tells pyrho to compute an approximate lookup table for a larger sample size
<N>
and then downsample to a table for size
<n>
. In general
<N>
should be
about 25-50% larger than
<n>
. Without using the --approx flag, pyrho can
compute lookup tables for
<n>
< ~50, whereas with the --approx flag, pyrho
can handle sample sizes in the hundreds (e.g.,
<n>
= 200,
<N>
= 256) with little loss in accuracy as long as <n>
<< <N>
.
The output is an hdf format table containing all of the pre-computed likelihoods needed to run hyperparam, optimize, and compute_r2.
Note that make_table can consume significant amounts of memory (N=210 requires about 20G of RAM using the --approx flag).
To see a full list of options and their meaning, run
pyrho make_table --help
.
After computing a lookup table with make_table, it is a good idea to find reasonable settings for the main hyperparameters of pyrho: the window size and the smoothness penalty. This command simulates data and performs optimization under a number of different hyperparameter settings and outputs the accuracy in terms of a number of different metrics. A typical usage is
pyrho hyperparam --samplesize <n> --tablefile <make_table_output> \
--mu <mu> --ploidy <ploidy> \
--popsizes <size1>,<size2>,<size3> --epochtimes <breakpoint1>,<breakpoint2> \
--outfile <output_file>
where <n>
is your haploid sample size, <make_table_output>
is
the output of running make_table, and
<mu>
, <size1>
..., and <breakpoint1>
... are as above.
<ploidy>
should be set to 1
if using phased data and 2
for
unphased genotype data. Ploidies other than 1
or 2
are not
currently supported.
The output is a table containing various measures of accuracy of the inferred maps compared to the recombination maps used in the simulations (drawn from the hapmap recombination map). The optimal hyperparameters will depend on your application -- it may be important to maximize a particular measure of correlation or to minimize the L2 norm between the true and inferred maps. In any case, you can use the results in the output table to choose the hyperparameters for running optimize.
To see a full list of options and their meaning, run
pyrho hyperparam --help
.
After computing a lookup table using make_table and choosing reasonable hyperparameters (optionally using hyperparam) you are ready to infer recombination maps from real data.
pyrho supports data in fasta format and LDhat's sites and locs format, but it is easiest to directly use VCF formatted data. pyrho supports VCF, bgzipped VCF, and BCF format files. If using LDhat's formats see the note about using sites and locs.
A typical usage is
pyrho optimize --vcffile <data> --windowsize <w> --blockpenalty <bpen> \
--tablefile <make_table_output> --ploidy <ploidy> --outfile <output_file> \
--numthreads <par>
with <data>
being a VCF, bgzipped VCF, or BCF file containing a
single chromosome, <w>
and
<bpen>
being hyperparameters chosen using hyperparam,
and <ploidy>
should be 1
for phased data and 2
for unphased
data.
The output file has three columns -- the first column is the zero-indexed start of an interval, the second column is the end of the interval (non-inclusive), and the third column is r, which is the per-base per-generation recombination rate in that interval.
To see a full list of options and their meaning, run
pyrho optimize --help
.
pyrho optimize can be slow if there is a lot of missing data.
Consider removing SNPs with more than a few missing genotypes
to increase the speed. Another option is to use the
--fast_missing
option. This option uses a bit more memory,
and throws away some data, but can be substantially faster
for datasets with lots of missing data, and does not require
pre-filtering SNPs for missingness.
The preferred input format for pyrho is a VCF format file containing a single chrmosome. LDhat format files may also be used with the following important caveat.
If using LDhat formatted data (i.e., a sites file and a locs file) note that we use a slighly different convention than LDhat (sorry!). For unphased data, LDhat uses the convention 0 = homozygous reference, 1 = homozygous alternate, 2 = heterozygous. We do not use this convention.
We use 0 = homozygous reference, 1 = heterozygous, 2 = homozygous alternate, and N = missing. That is, each entry should be the number of alternate alleles, or N for missing. We otherwise follow the formatting (including the headers) of LDhat as described here.
compute_r2 computes statistics of the theoretical distribution of r2 using a lookup table generated using make_table. In particular, it can compute the mean and/or quantiles of the distribution of r2.
A typical usage is
pyrho compute_r2 --tablefile <make_table_output> --samplesize <n> \
--quantiles 0.25,0.5,0.75 --compute_mean
which will compute the 25th, 50th, and 75th percentiles as well as the mean
of the distribution of r2 for the lookup table stored in
<make_table_output>
. It is possible to compute statistics for a smaller
sample size by setting <n>
to be less than the sample size for which
<make_table_output>
was computed.
The example folder contains a well-commented shell script example.sh which runs through a typical use-case, taking a VCF file and the output of smc++ and ultimately computing a fine-scale recombination map.
Does pyrho
return the population-scaled recombination rate "rho" or the per-generation recombination rate "r"?
Short answer: pyrho
infers the per-generation, per-base recombination rate "r".
Long answer: internally pyrho
infers "rho" and then uses the user-specified Ne to
convert "rho" to "r". This scaling works well in humans, but has not been tested
extensively in species with different mutation rates or ratios of mutation rate
to recombination rate. Overall, differences across the genome will be stable (i.e.,
relative rates) but the absolute scaling should be taken with some degree of
skepticism, especially when working with species whose evolutionary parameters
differ from those of humans.
Missing data makes the default implementation of pyrho optimize
run very slowly.
To get reasonable runtimes either remove sites with more that a few missing genotypes
or use the --fast-missing
flag. The --fast-missing
flag causes pyrho
to only
use individuals who are genotyped at both loci when calculating two-locus likelihoods.
This effectively throws away the information provided by individuals who are genotyped
at only one of the two loci. This information is not particularly useful for inferring
recombination rates, and ignoring it can result in a significant speedup in the
presence of missing data.
If you use pyrho please cite
and