xin-huang/dadi-cli

different proportions from gamma and lognormal distributions

xin-huang opened this issue · 12 comments

I used the simulated data from https://github.com/popsim-consortium/analysis2 to fit a gamma or lognormal DFE with dadi.

However, the proportions of mutations with different selection coefficients are different from the inferred gamma and lognormal DFEs.

The inferred gamma DFE for pop0 from the Constant demographic model without annotation is shape=0.187013 and scale=0.05817376022120527
The inferred lognormal DFE from the same data is mu=-1.40769 and sigma=4.12304

To calculate the proportions of mutations with different selection coefficients, I used the following codes:

from scipy.stats import lognorm, gamma
def lognormal_mut_prop(mu, sigma):
    ps = lognorm.cdf([1e-5, 1e-4, 1e-3, 1e-2], s=sigma, scale=np.exp(mu))
    props = [ps[0], ps[1]-ps[0], ps[2]-ps[1], ps[3]-ps[2], 1-ps[3]]
    
    return props

def gamma_mut_prop(shape, scale):
    ps = gamma.cdf([1e-5, 1e-4, 1e-3, 1e-2], a=shape, scale=scale)
    props = [ps[0], ps[1]-ps[0], ps[2]-ps[1], ps[3]-ps[2], 1-ps[3]]
    
    return props

Then the proportions from the gamma DFE are [0.21445487739385305, 0.11533939611182581, 0.17626500815128215, 0.2542754681968391, 0.23966525014619988]
the proportions from the lognormal DFE are [0.00712460844025545, 0.022090862661779957, 0.06188924246452784, 0.1279129303758305, 0.7809823560576062]

If I used the parameters from Huber et al. 2017, then the proportions are similar.
For the gamma DFE, shape=0.19 and scale=0.074, the proportions are [0.19981642594537893, 0.10960250183554734, 0.16888719378656925, 0.24865042521127445, 0.27304345322123].
For the lognormal DFE, mu=-6.86 and sigma=4.89, the proportions are [0.17067061591167176, 0.14471479180194546, 0.18071862151578205, 0.18153626737811707, 0.32235970339248365].

Reminder... To explore this, we'll need the data file you were fitting and the cache you were using.

Reminder... To explore this, we'll need the data file you were fitting and the cache you were using.

He sent me the file location on the HPC, I'll look at them today.

I'm seeing similar things to Xin. If I recall, Huber et al. converted parameters from gamma to s. @xin-huang it looks like you are converting parameters you get from dadi-cli, how are you doing the conversion for the lognormal parameters?

Hi @tjstruck

I used the following codes to convert parameters to s.

from scipy.stats import lognorm, gamma
def lognormal_mut_prop(mu, sigma):
    ps = lognorm.cdf([1e-5, 1e-4, 1e-3, 1e-2], s=sigma, scale=np.exp(mu))
    props = [ps[0], ps[1]-ps[0], ps[2]-ps[1], ps[3]-ps[2], 1-ps[3]]
    
    return props

def gamma_mut_prop(shape, scale):
    ps = gamma.cdf([1e-5, 1e-4, 1e-3, 1e-2], a=shape, scale=scale)
    props = [ps[0], ps[1]-ps[0], ps[2]-ps[1], ps[3]-ps[2], 1-ps[3]]
    
    return props

And I think we are inferring deleterious DFEs here, so after I got the bestfit parameters from dadi-cli, I added a negative sign before mu as Huber et al. did.

To be clear, these codes do not convert from gamma to s. They calculate proportions within ranges of gamma values. (Converting gamma to s requires an estimate of the effective population size.) Note also that there is no need to change the sign of mu. (The lognormal distribution is only defined on one side of 0, not matter the sign of mu.) Are you actually inferring mu = 1.4 or mu = -1.4?

To be clear, these codes do not convert from gamma to s. They calculate proportions within ranges of gamma values. (Converting gamma to s requires an estimate of the effective population size.) Note also that there is no need to change the sign of mu. (The lognormal distribution is only defined on one side of 0, not matter the sign of mu.) Are you actually inferring mu = 1.4 or mu = -1.4?

I think the s you mean should be population-scaled selection coefficient?
And my codes estimate the proportion of selection coefficient (not scaled with the population size).
Anyway, Huber et al. provided both proportions (population-scaled or not scaled). Please see Fig. S10 in their paper. I think my estimates are closed to their plots (Fig. S10A the right plot and Fig. S10C the right plot).
And I verified my codes gamma_mut_prop(shape, scale) with the true parameters in Table 1 (shape = 0.184, scale=3266/(2*10085) from Kim et al. 2017. My result is the same as theirs (Table 1 Row 1).

I think the sign matters. If you put the positive mu into the function lognormal_mut_prop(mu, sigma), you will get more proportion for mutations with |s|>1e-2

I got mu=1.4 and put it into lognormal_mut_prop(mu, sigma), the proportions are [0.0008684119679748278, 0.004166439747958517, 0.016919637095415885, 0.05067399471767366, 0.9273715164709772]

I think this is too strange, and added a negative sign into mu, then the proportions are [0.00712460844025545, 0.022090862661779957, 0.06188924246452784, 0.1279129303758305, 0.7809823560576062]

The reason that the sign matters/can be a negative value is that a negative mu will be converted into 1/e^|mu| (e here being np.exp) for the lognormal equation.
ex.

import numpy as np
print(1/np.exp(1) == np.exp(-1))

So negative mu will be closer to 0, and so you will get a higher proportion of |s| values near zero. So it probably makes sense that if you scale s by the population size, the mean will be larger and there will be fewer |s| near zero.

Huber et al. do compare lognormal models scaled and not scaled by population size in Table S3 for their "Constrained model (H0)". The inferred parameters for the model using gamma (Nes) units are mu = 5.92 and sigma = 4.26, and the parameters for the selection coefficient (s) units model are mu = 5.92 and sigma = 4.26.

Using the functions Xin provided with Table S3 All Data inferences, the proportion results are:
s units, mu = -8.74 and sigma = 4.88 : [0.28494197097185714, 0.17666693100678837, 0.1847318972613387, 0.15524556938197398, 0.1984136313780418]

Nes units, mu = 5.92 and sigma = 4.26 : [2.1361673360203138e-05, 0.0001699970257673072, 0.0011096847433310067, 0.005441453385729758, 0.9932575031718117]

Similarly, using the "Constrained model (H0)" for Table S2 (the gamma distribution) All Data inferences, the proportion results are:
s units, shape = 0.24 and scale = 4.28e-3 : [0.2569935271777528, 0.18780443060685165, 0.2987019977331315, 0.24580759161984078, 0.010692452862423263]

Nes, shape = 0.33 and scale = 2.53e3 : [0.001887708572730054, 0.0021481407561730914, 0.004592642773648407, 0.009818880619862677, 0.9815526272775857]

So from Huber's paper, no matter what distribution it is (gamma or lognormal), the proportions are similar if they are in the same unit (Ns or s). However, this is different from the results I inferred from the simulated data or real data.

I used the YRI population from the 1000 Genomes Project, I can get similar result with Huber's if I used a gamma distribution. But the result is different if I used a lognormal distribution. Can you reanalyze the YRI population with a gamma and lognormal DFE @tjstruck ?

Here are the results from 5 seeds for pop0 (I believe this is YRI, but I think the results would be the same for all three since they all seem to fit the same two epoch parameters). I'm also putting the Huber et al results I got at the end so that it is easier for others to compare:

Seed 1, (137773603)
Gamma, shape = 0.17945057416475152 and scale = 368.4954714228547:
[0.04748759871048619, 0.02429691576867407, 0.036728338045816905, 0.055519736793774116, 0.8359674106812487]
Lognormal, mu = 1.3780547310338864 and sigma = 4.251817090347809:
[0.0012151969938256717, 0.0051661672071724235, 0.019280135082577227, 0.05402017925598655, 0.9203183214604381]

Seed 2, (1069045866)
Gamma, shape = 0.1673423806906049 and scale = 452.10569706547767:
[0.0564461913109368, 0.026534492940645582, 0.039007921718736474, 0.05734447498698611, 0.8206669190426951]
Lognormal, mu = 1.3003808312050886 and sigma = 4.514123274750947:
[0.0022663572838178803, 0.007678864660326917, 0.024562135040771527, 0.06088881331759297, 0.9046038296974908]

Seed 3, (1402032511)
Gamma, shape = 0.1920702308965862 and scale = 295.7171347017453:
[0.039914597209378774, 0.02220118529579855, 0.034549823642992736, 0.05376634200258361, 0.8495680518492463]
Lognormal, mu = 1.4160329388692905 and sigma = 3.9825457797124386:
[0.0005843612690058565, 0.003228168899276579, 0.014493352624230994, 0.046974019659861685, 0.9347200975476249]

Seed 4, (1358822686)
Gamma, shape = 0.199202892821409 and scale = 240.00946831129045:
[0.03688199602489587, 0.021464836230457744, 0.03395703664127213, 0.0537187362848751, 0.8539773948184992]
Lognormal, mu = 1.3462726312028233 and sigma = 3.919919257258311:
[0.0005181617888196735, 0.003021719068434502, 0.014076948580578924, 0.04685804753488855, 0.9355251230272783]

Seed 5, (1605979228)
Gamma, shape = 0.19086285721280774 and scale = 291.69717342954937:
[0.04084408503141787, 0.02254171955326012, 0.03498237851059806, 0.0542883374706702, 0.8473434794340537]
Lognormal, mu = 1.37887123125695 and sigma = 3.978532386254991:
[0.0005969367530048246, 0.003291720790061259, 0.014744475415171964, 0.04764691436127324, 0.9337199526804887]

Huber et al. (or at least what I calculated)
Gamma, shape = 0.33 and scale = 2.53e3:
[0.001887708572730054, 0.0021481407561730914, 0.004592642773648407, 0.009818880619862677, 0.9815526272775857]
Lognormal, 5.92 and sigma = 4.26:
[2.1361673360203138e-05, 0.0001699970257673072, 0.0011096847433310067, 0.005441453385729758, 0.9932575031718117]

I think the simulations have similar differences between each other compared to the Huber results:
For the |s|<1e-5| proportion the gamma Nes is ~100 times bigger than lognormal
For the 1e-5<|s|<1e-4 proportion the gamma Nes is ~10 times bigger than lognormal
For the 1e-4<|s| proportions, the gamma and lognormal are relatively similar in the same way that the s results are.

For the results from Huber et al., I don't understand why you used the parameters from the constrained model assuming the DFE in humans is equal to the DFE in Drosophila. I think we should the parameters from the full model (for the gamma distribution, shape=0.19, scale=0.0741; for the lognormal distribution, mu=-6.86, sigma=4.89), because Izabel also chose these parameters and put them into stdpopsim, see https://github.com/popsim-consortium/stdpopsim/blob/main/stdpopsim/catalog/HomSap/dfes.py#L57.

From your results, we can see the shape parameters is ~0.19, this is similar to Huber's results.
But the parameters (mu and sigma) from the lognormal distribution are different from Huber's results. This is also what I observed from my results.

OK.

The first function

def lognormal_mut_prop(mu, sigma):
    ps = lognorm.cdf([1e-5, 1e-4, 1e-3, 1e-2], s=sigma, scale=np.exp(mu))
    props = [ps[0], ps[1]-ps[0], ps[2]-ps[1], ps[3]-ps[2], 1-ps[3]]
    
    return props

calculates the proportions in Ns unit.
To calculate the proportions in s unit with a lognormal distribution, the mu parameter should be transformed into mu-ln(2Ne).

while the second function

def gamma_mut_prop(shape, scale):
    ps = gamma.cdf([1e-5, 1e-4, 1e-3, 1e-2], a=shape, scale=scale)
    props = [ps[0], ps[1]-ps[0], ps[2]-ps[1], ps[3]-ps[2], 1-ps[3]]
    
    return props

can calculate the proportions in either Ns or s units, depending on the scale parameter in Ns or s unit.