Alpha values for EDGE encoding are not correct
tomszar opened this issue · 2 comments
Pandas-genomics generate different alpha values when compared to PLATO
Reproduce example
To generate the environment:
conda create -n alphas python pandas numpy
conda activate alphas
pip install clarite pandas-genomics
To reproduce, open a python terminal:
import pandas as pd
import numpy as np
import clarite
from pandas_genomics import sim, io, scalars
additive_model = sim.BAMS.from_model(eff1=sim.SNPEffectEncodings.ADDITIVE,
eff2=sim.SNPEffectEncodings.ADDITIVE,
penetrance_base=0.45,
main1=1,
main2=0,
interaction=0)
sim_data = additive_model.generate_case_control(n_cases=5000,
n_controls=5000)
alpha_values = sim_data.genomics.\
calculate_edge_encoding_values(data=sim_data,
outcome_variable='Outcome')
alpha_values
The output is the following:
Variant ID Alpha Value Ref Allele Alt Allele Minor Allele Frequency
0 rs1 0.273881 A a 0.30265
1 rs2 0.003415 B b 0.29895
When using PLATO, the alpha value is 0.49798733
for both SNPs
It seems that PLATO and pandas_genomics generate different beta coefficients. For future reference, these are the regression models that pandas_genomics generate from the simulated data defined previously
import patsy
import statsmodels.api as sm
family = sm.families.Binomial(link=sm.families.links.logit())
counts = sim_data['Outcome'].value_counts().to_dict()
categories = sim_data['Outcome'].cat.categories
codes, categories = zip(*enumerate(categories))
sim_data['Outcome'].replace(categories, codes, inplace=True)
gt_col_names = ['SNP1',
'SNP2']
outcome = sim_data['Outcome']
genos = sim_data[gt_col_names]
for gt in gt_col_names:
encoded = genos[gt].genomics.encode_codominant()
df = pd.concat([outcome, encoded], axis=1)
formula = f"Q('{'Outcome'}') ~ Q('{gt}')"
y, X = patsy.dmatrices(formula,
df,
return_type="dataframe",
NA_action="drop")
# Drop the intercept column
X = X.drop(columns=["Intercept"])
# Run Regression
est = sm.GLM(y,
X,
family=family).\
fit(use_t=False)
print(est.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: Q('Outcome') No. Observations: 10000
Model: GLM Df Residuals: 9998
Model Family: Binomial Df Model: 1
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -6917.4
Date: Tue, 26 Oct 2021 Deviance: 13835.
Time: 11:11:01 Pearson chi2: 1.00e+04
No. Iterations: 4
Covariance Type: nonrobust
====================================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------
Q('SNP1')[T.Het] -0.0819 0.031 -2.630 0.009 -0.143 -0.021
Q('SNP1')[T.Hom] -0.2992 0.065 -4.591 0.000 -0.427 -0.171
====================================================================================
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: Q('Outcome') No. Observations: 10000
Model: GLM Df Residuals: 9998
Model Family: Binomial Df Model: 1
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -6929.4
Date: Tue, 26 Oct 2021 Deviance: 13859.
Time: 11:11:01 Pearson chi2: 1.00e+04
No. Iterations: 4
Covariance Type: nonrobust
====================================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------
Q('SNP2')[T.Het] 0.0005 0.031 0.015 0.988 -0.060 0.061
Q('SNP2')[T.Hom] 0.1382 0.068 2.033 0.042 0.005 0.271
====================================================================================
It seems that the original paper incorrectly stated that the regression model should be without intercept, but it needs to be with intercept. When running the model on the same simulated dataset sim_data
with intercept estimation, these are the results
import patsy
import statsmodels.api as sm
family = sm.families.Binomial(link=sm.families.links.logit())
counts = sim_data['Outcome'].value_counts().to_dict()
categories = sim_data['Outcome'].cat.categories
codes, categories = zip(*enumerate(categories))
sim_data['Outcome'].replace(categories, codes, inplace=True)
gt_col_names = ['SNP1',
'SNP2']
outcome = sim_data['Outcome']
genos = sim_data[gt_col_names]
for gt in gt_col_names:
encoded = genos[gt].genomics.encode_codominant()
df = pd.concat([outcome, encoded], axis=1)
formula = f"Q('{'Outcome'}') ~ Q('{gt}')"
y, X = patsy.dmatrices(formula,
df,
return_type="dataframe",
NA_action="drop")
# Run Regression
est = sm.GLM(y,
X,
family=family).\
fit(use_t=False)
print(est.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: Q('Outcome') No. Observations: 10000
Model: GLM Df Residuals: 9997
Model Family: Binomial Df Model: 2
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -6907.4
Date: Mon, 01 Nov 2021 Deviance: 13815.
Time: 14:52:21 Pearson chi2: 1.00e+04
No. Iterations: 4
Covariance Type: nonrobust
====================================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------
Intercept 0.1273 0.029 4.450 0.000 0.071 0.183
Q('SNP1')[T.Het] -0.2092 0.042 -4.946 0.000 -0.292 -0.126
Q('SNP1')[T.Hom] -0.4265 0.071 -5.992 0.000 -0.566 -0.287
====================================================================================
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: Q('Outcome') No. Observations: 10000
Model: GLM Df Residuals: 9997
Model Family: Binomial Df Model: 2
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -6929.0
Date: Mon, 01 Nov 2021 Deviance: 13858.
Time: 14:52:21 Pearson chi2: 1.00e+04
No. Iterations: 4
Covariance Type: nonrobust
====================================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------
Intercept -0.0249 0.029 -0.872 0.383 -0.081 0.031
Q('SNP2')[T.Het] 0.0254 0.042 0.606 0.545 -0.057 0.108
Q('SNP2')[T.Hom] 0.1631 0.074 2.212 0.027 0.019 0.308
====================================================================================
Resulting in 0.4905
and 0.1557
alpha values for rs1
and rs2
respectively.