Genealogical Likelihood (gLike) is a maximum likelihood method to infer the demographic history of given populations that best explains the observed genealogical relationships between sample haplotypes. This is, to our knowledge, the first attempt to infer all types of parameters (split times, admixture times and proportions, migration rates) in a complex demography model under a unified framework.
download the package and install from local:
git clone https://github.com/Ephraim-usc/glike.git
python3 -m pip install ./glike
There are three pre-requisite packages: tskit
, numpy
and scipy
.
Because glike contains a C extension module, a proper C environment is required.
See a simple tutorial to go though a toy example showing how glike works.
Core functionality
logp = glike.glike_trees(trees, demo, samples, prune)
Where trees
is any enumerable that contains tskit
genealogical trees.
Note that it is the user's duty to manually pick out trees that are selective neutral and independent, and to wrap them in a list or other iterable objects.
A tskit.TreeSequence
object is not directly iterable and thus not a legitimate input.
Although calling ts.trees()
to enumerate all trees in the tskit.TreeSequence
would work grammatically, it is not recommended in most cases, since neighboring trees are generally not independent from each other.
demo
is the hypothesized Demography created manually (see the following section) or from provided models in models.py
.
samples
is the dict that contains sample:pop
pairs, which specifies which sample is collected from which population.
The population can be identified either by an integer representing the population index or a string denoting the population name.
Samples not mentioned in the dictionary are considered potentially from any available population at the time of the sample.
The default is an empty dictionary.
prune
is the proportion of discarding low likelihood trees, enabling this feature often reduces noise when dealing with reconstructed trees.
The default is zero, meaning that all trees are reserved.
This function returns the log probability that such genealogical trees are generated under the hypothesized demography.
A Demography object is initialized with
demo = glike.Demo()
And a number of Phases are added into this Demography
demo.add_phase(phase)
A Phase is created by
phase = glike.Phase(t, t_end, ns, grs, P, Q, populations)
Where t
is the starting (most recent) time, t_end
is the ending (most ancient) time, ns
is the vector of coalescent rates, grs
is the vector of growth rates (a positive growth rate means the population size is growing forward in time, or equivalently, the coalescent rate is growing backward in time), P
is the mass migration matrix at the beginning of this phase, Q
is the continuous migration rate matrix, and populations
is the vector of population names. Only t
, t_end
and ns
are required , other arguments are optional. The number of populations in this phase should be consistent among parameters, so it is required that
len(ns) == len(grs) == P.shape[1] == Q.shape[1] == len(populations)
When adding new Phases into Demogrpahy, the times and dimensions should match. Specifically, if the last added phase is phases[i]
, and we are trying to add another phase phases[i+1]
, then it is required that
phases[i].t < phases[i].t_end = phases[i+1].t < phases[i+1].t_end
phases[i].P.shape[1] == phases[i+1].P.shape[0]
gLike cannot be applied directly to a demography that contains continuous migrations, and requires it to be discretized first:
demo.add_phase(phase, discretize)
Where discretize
is a float number indicating the time interval (that is, if phase.Q != None
, phase
will be sliced into chunks of discretize
generations).
The resulting demography can be visualized by
demo.print()
To make a demographic model containing variable parameters, the idiom is
def model(...):
demo = glike.Demography()
# add Phases that depend on the parameters
return demo
We provide an function for estimating parameters
glike.estimate(trees, model, search)
Which runs a smart maximum likelihood protocol specially designed for glike to find the estimated parameters. We use a Search object to tell the information about initial values and restrictions of the parameters. It is created with
search = Search(names, values, limits, fixed)
Where names
is a list of the names of the parameters, values
is a list of initial parameter values, fixed
is a list of the names of fixed parameters so that their values will keep untouched. limits
is a list of tuples (low, high)
where low
and high
could be a number or a string expression involving names of the parameters (which will be evaluated at runtime by eval()
). For example, if our model has three events happening between 0 and 100 generations ago, in order to estimate the times of these three events we could do
names = ["t1", "t2", "t3"]
values = [25, 50, 75] # or other initial values you see fit
limits = [(0, "t2"), ("t1", "t3"), ("t2", 100)]
Empirically, if the output parameters take values very close to the lower or upper limits, it's likely that the estimation is stuck in a local optimal, or the proposed model is not well compatible with the genealogical trees. If that's the case, it's suggested to try other initial values or demography models.
This project is introduced here.
If you are interested in knowing more, please let us know. Email the author: fcq1116@gmail.com