HallLab/pandas-genomics

Alpha values for EDGE encoding are not correct

tomszar opened this issue · 2 comments

Pandas-genomics generate different alpha values when compared to PLATO

Reproduce example

To generate the environment:

conda create -n alphas python pandas numpy
conda activate alphas
pip install clarite pandas-genomics

To reproduce, open a python terminal:

import pandas as pd
import numpy as np
import clarite
from pandas_genomics import sim, io, scalars

additive_model = sim.BAMS.from_model(eff1=sim.SNPEffectEncodings.ADDITIVE,
                                     eff2=sim.SNPEffectEncodings.ADDITIVE,
                                     penetrance_base=0.45,
                                     main1=1,
                                     main2=0,
                                     interaction=0)

sim_data = additive_model.generate_case_control(n_cases=5000,
                                                n_controls=5000)

alpha_values = sim_data.genomics.\
                calculate_edge_encoding_values(data=sim_data,
                                               outcome_variable='Outcome')

alpha_values

The output is the following:

  Variant ID  Alpha Value Ref Allele Alt Allele  Minor Allele Frequency
0        rs1     0.273881          A          a                 0.30265
1        rs2     0.003415          B          b                 0.29895

When using PLATO, the alpha value is 0.49798733 for both SNPs

It seems that PLATO and pandas_genomics generate different beta coefficients. For future reference, these are the regression models that pandas_genomics generate from the simulated data defined previously

import patsy
import statsmodels.api as sm

family = sm.families.Binomial(link=sm.families.links.logit())
counts = sim_data['Outcome'].value_counts().to_dict()
categories = sim_data['Outcome'].cat.categories
codes, categories = zip(*enumerate(categories))
sim_data['Outcome'].replace(categories, codes, inplace=True)
gt_col_names = ['SNP1',
                'SNP2']
outcome = sim_data['Outcome']
genos   = sim_data[gt_col_names]

for gt in gt_col_names:
    encoded = genos[gt].genomics.encode_codominant()
    df      = pd.concat([outcome, encoded], axis=1)
    formula = f"Q('{'Outcome'}') ~ Q('{gt}')"
    y, X = patsy.dmatrices(formula,
                           df,
                           return_type="dataframe",
                           NA_action="drop")

    # Drop the intercept column
    X = X.drop(columns=["Intercept"])
    # Run Regression
    est = sm.GLM(y,
                 X,
                 family=family).\
                     fit(use_t=False)

    print(est.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:           Q('Outcome')   No. Observations:                10000
Model:                            GLM   Df Residuals:                     9998
Model Family:                Binomial   Df Model:                            1
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -6917.4
Date:                Tue, 26 Oct 2021   Deviance:                       13835.
Time:                        11:11:01   Pearson chi2:                 1.00e+04
No. Iterations:                     4                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------
Q('SNP1')[T.Het]    -0.0819      0.031     -2.630      0.009      -0.143      -0.021
Q('SNP1')[T.Hom]    -0.2992      0.065     -4.591      0.000      -0.427      -0.171
====================================================================================


                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:           Q('Outcome')   No. Observations:                10000
Model:                            GLM   Df Residuals:                     9998
Model Family:                Binomial   Df Model:                            1
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -6929.4
Date:                Tue, 26 Oct 2021   Deviance:                       13859.
Time:                        11:11:01   Pearson chi2:                 1.00e+04
No. Iterations:                     4                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------
Q('SNP2')[T.Het]     0.0005      0.031      0.015      0.988      -0.060       0.061
Q('SNP2')[T.Hom]     0.1382      0.068      2.033      0.042       0.005       0.271
====================================================================================

It seems that the original paper incorrectly stated that the regression model should be without intercept, but it needs to be with intercept. When running the model on the same simulated dataset sim_data with intercept estimation, these are the results

import patsy
import statsmodels.api as sm

family = sm.families.Binomial(link=sm.families.links.logit())
counts = sim_data['Outcome'].value_counts().to_dict()
categories = sim_data['Outcome'].cat.categories
codes, categories = zip(*enumerate(categories))
sim_data['Outcome'].replace(categories, codes, inplace=True)
gt_col_names = ['SNP1',
                'SNP2']
outcome = sim_data['Outcome']
genos   = sim_data[gt_col_names]

for gt in gt_col_names:
    encoded = genos[gt].genomics.encode_codominant()
    df      = pd.concat([outcome, encoded], axis=1)
    formula = f"Q('{'Outcome'}') ~ Q('{gt}')"
    y, X = patsy.dmatrices(formula,
                           df,
                           return_type="dataframe",
                           NA_action="drop")

    # Run Regression
    est = sm.GLM(y,
                 X,
                 family=family).\
                     fit(use_t=False)

    print(est.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:           Q('Outcome')   No. Observations:                10000
Model:                            GLM   Df Residuals:                     9997
Model Family:                Binomial   Df Model:                            2
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -6907.4
Date:                Mon, 01 Nov 2021   Deviance:                       13815.
Time:                        14:52:21   Pearson chi2:                 1.00e+04
No. Iterations:                     4                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept            0.1273      0.029      4.450      0.000       0.071       0.183
Q('SNP1')[T.Het]    -0.2092      0.042     -4.946      0.000      -0.292      -0.126
Q('SNP1')[T.Hom]    -0.4265      0.071     -5.992      0.000      -0.566      -0.287
====================================================================================
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:           Q('Outcome')   No. Observations:                10000
Model:                            GLM   Df Residuals:                     9997
Model Family:                Binomial   Df Model:                            2
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -6929.0
Date:                Mon, 01 Nov 2021   Deviance:                       13858.
Time:                        14:52:21   Pearson chi2:                 1.00e+04
No. Iterations:                     4                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept           -0.0249      0.029     -0.872      0.383      -0.081       0.031
Q('SNP2')[T.Het]     0.0254      0.042      0.606      0.545      -0.057       0.108
Q('SNP2')[T.Hom]     0.1631      0.074      2.212      0.027       0.019       0.308
====================================================================================

Resulting in 0.4905 and 0.1557 alpha values for rs1 and rs2 respectively.