FixedEffectModel is a Python Package designed and built by Kuaishou DA ecology group. It is used to estimate the class of linear models which handles panel data. Panel data refers to the type of data when time series and cross-sectional data are combined.
- Linear model
- Linear model with high dimensional fixed effects
- Difference-in-difference model with parallel checking plot
- Instrumental variable model
- Robust/white standard error
- Multi-way cluster standard error
- Instrumental variable model tests, including weak iv test (cragg-dolnald statistics+stock and yogo critical values), over-identification test (sargan/Basmann test), endogeneity test (durbin test)
For instrumental variable model, we now only provide two stage least square estimator and produce second stage regression result. In our next release we will include GMM method and robust standard error based on GMM.
Install this package directly from PyPI
$ pip install FixedEffectModel
This very simple case-study is designed to get you up-and-running quickly with fixedeffectmodel. We will show the steps needed.
After installing statsmodels and its dependencies, we load a few modules and functions:
import numpy as np
import pandas as pd
from fixedeffect.iv import iv2sls, ivgmm, ivtest
from fixedeffect.fe import fixedeffect, did, getfe
from fixedeffect.utils.panel_dgp import gen_data
gen_data is the function we use to simulate data.
We use a simulated dataset with 100 cross-sectional units and 10 time units.
N = 100
T = 10
beta = [-3,1,2,3,4]
ate = 1
exp_date = 5
df = gen_data(N, T, beta, ate, exp_date)
Ihe the above simulated dataset, "beta" are true coefficients, "ate" is the true treatment effect, "exp_date" is the start date of experiment.
We include two function: "iv2sls" and "iv2gmm" for instrumental variable regression.
This function return two-stage least square estimation results. Define y as the dependent variable, x_1 as exogenous variable, x_2 as endogenous variable, x_3 and x_4 are instrumental variables. id and time are cross sectional id and time id. An IV two-way fixed effect model estimated by two-stage least square is achieved by using:
formula = 'y ~ x_1|id+time|0|(x_2~x_3+x_4)'
model_iv2sls = iv2sls(data_df = df,
formula = formula)
result = model_iv2sls.fit()
result.summary()
or
exog_x = ['x_1']
endog_x = ['x_2']
iv = ['x_3','x_4']
y = ['y']
model_iv2sls = iv2sls(data_df = df,
dependent = y,
exog_x = exog_x,
endog_x = endog_x,
category = ['id','time'],
iv = iv)
result = model_iv2sls.fit()
result.summary()
The two grammars above yield identical results. We provide specification test for iv models:
ivtest(result1)
Three tests are included: weak iv test (Cragg-Dolnald statistics + Stock and Yogo critical values), over-identification test (Sargan/Basmann test), and endogeneity test (Durbin test).
This function returns one-step gmm estimation result. With same variables definition, estimation is achieved by:
formula = 'y ~ x_1|id+time|0|(x_2~x_3+x_4)'
model_ivgmm = ivgmm(data_df = df,
formula = formula)
result = model_ivgmm.fit()
result.summary()
or
exog_x = ['x_1']
endog_x = ['x_2']
iv = ['x_3','x_4']
y = ['y']
model_ivgmm = ivgmm(data_df = df,
dependent = y,
exog_x = exog_x,
endog_x = endog_x,
category = ['id','time'],
iv = iv)
result = model_ivgmm.fit()
result.summary()
This function returns fixed effect model estimation result. Define y as the dependent variable, x_1 as independent variable, id and time are cross sectional ID and time ID. Following code yield estimation of a two-way fixed effect model with two-way cluster standard error:
formula = 'y ~ x_1|id+time|id+time|0'
model_fe = fixedeffect(data_df = df,
formula = formula,
no_print=True)
result = model_fe.fit()
result.summary()
or
exog_x = ['x_1']
y = ['y']
category = ['id','time']
cluster = ['id','time']
model_fe = fixedeffect(data_df = df,
dependent = y,
exog_x = exog_x,
category = category,
cluster = cluster)
result = model_fe.fit()
result.summary()
DID is simply a specific type of fixed effect model. We provide a function of DID to help simplify the estimation process. The regular DID estimation is achieved using following command:
formula = 'y ~ 0|0|0|0'
model_did = did(data_df = df,
formula = formula,
treatment = ['treatment'],
csid = ['id'],
tsid = ['time'],
exp_date = 2)
result = model_did.fit()
result.summary()
"exp_date" is the first date that the experiment begins, "treatment" is the column name of the treatment variable. This command estimate the equation below:
We also provide DID with individual effect:
formula = 'y ~ 0|0|0|0'
model_did = did(data_df = df,
formula = formula,
treatment = ['treatment'],
group_effect='individual',
csid = ['id'],
tsid = ['time'],
exp_date = 2)
result = model_did.fit()
result.summary()
This command above estimate the equation below:
Currently there are five main function you can call:
Function name | Description | Usage |
---|---|---|
fixedeffect | define class for fixed effect estimation | fixedeffect (data_df = None, dependent = None, exog_x = None, category = None, cluster = None, formula = None, robust = False, noint = False, c_method = 'cgm', psdef = True) |
iv2sls | define class for 2sls estimation | iv2sls (data_df = None, dependent = None, exog_x = None, endog_x = None, iv = None, category = None, cluster = None, formula = None, robust = False, noint = False) |
ivgmm | define class for gmm estimation | ivgmm (data_df = None, dependent = None, exog_x = None, endog_x = None, iv = None, category = None, cluster = None, formula = None, robust = False, noint = False) |
did | define class for did estimation | did (data_df = None, dependent = None, exog_x = None, treatment = None, csid = None, tsid = None, exp_date = None, group_effect = 'treatment', cluster = None, formula = None, robust = False, noint = False, c_method = 'cgm', psdef = True) |
model.fit | fit pre-defined models | result = model.fit() |
result.summary | result.object | result.summary() |
fit_multi_model | fit multiple models | models = [model,model_did,model_iv2sls], fit_multi_model (models) |
getfe | get fixed effects | getfe(result) |
ivtest | get iv post estimation tests results | ivtest (result) |
Provide results for a fixed effect model:
model = fixedeffect (data_df = None, dependent = None, exog_x = None, category = None, cluster = None, formula = None, robust = False, noint = False, c_method = 'cgm', psdef = True)
Input parameters | Type | Description |
---|---|---|
data_df | pandas dataframe | Dataframe with relevant data. |
dependent | list | List object of dependent variables |
exog_x | list | List object of independent variables |
category | list, default [] | List object of category variables, i.e, fixed effect |
cluster | list, default [] | List object of cluster variables, i.e, the cluster level of your standard error |
formula | string, default None | Formula used to parse grammar. |
robust | bool, default False | Whether or not to calculate df-adjusted white standard error (HC1) |
noint | bool, default True | Whether or not generate intercept |
c_method | str, default 'cgm' | Method to calculate multi-cluster standard error. Possible choices are 'cgm' and 'cgm2'. |
psdef | bool, default True | if True, replace negative eigenvalue of variance matrix with 0 (only in multi-way clusters variance) |
Return an object of results:
Attribute | Type |
---|---|
params | Estimated coefficients |
df | Degree of freedom. |
bse | standard error |
variance_matrix | coefficients' variance-covariance matrix |
model = iv2sls (data_df = None, dependent = None, exog_x = None, endog_x = None, iv = None, category = None, cluster = None, formula = None, robust = False, noint = False)
model = ivgmm (data_df = None, dependent = None, exog_x = None, endog_x = None, iv = None, category = None, cluster = None, formula = None, robust = False, noint = False)
Input parameters | Type | Description |
---|---|---|
data_df | pandas dataframe | Dataframe with relevant data. |
dependent | list | List object of dependent variables |
exog_x | list | List object of exogenous variables |
endof_x | list | List object of endogenous variables |
iv | list | List object of instrumental variables |
category | list, default [] | List object of category variables, i.e, fixed effect |
formula | string, default None | Formula used to parse grammar. |
robust | bool, default False | Whether or not to calculate df-adjusted white standard error (HC1) |
noint | bool, default True | Whether or not generate intercept |
Return the same object of results as fixedeffect does.
We also provide two-step GMM estimator if you set thet option "gmm2=True". Define a matrix
-
"ivgmm", the one-step GMM estimator generate with variance-covariance matrices equal
-
"ivgmm" with "gmm2=True", the two-step GMM estimator generate
- Unadjusted.
- Heteroskedasticity robust. Define and as the diagonal matrix generated using the residual from the two-step GMM. , the variance-covariance matrix is
- Cluster. Define
model = did (data_df = None, dependent = None, exog_x = None, treatment = None, csid = None, tsid = None, exp_date = None, group_effect = 'treatment', cluster = None, formula = None, robust = False, noint = False, c_method = 'cgm', psdef = True)
Input parameters | Type | Description |
---|---|---|
data_df | pandas dataframe | Dataframe with relevant data. |
dependent | list | List object of dependent variables |
exog_x | list | List object of independent variables |
treatment | list | List object of treatment variables |
csid | list | List object of cross sectional id variables |
tsid | list | List object of time variables |
exp_date | string | Experiment start date |
group_effect | string, default 'treatment' | Either equals 'treatment' or 'individual' |
cluster | list, default [] | List object of cluster variables, i.e, the cluster level of your standard error |
formula | string, default None | Formula used to parse grammar. |
robust | bool, default False | Whether or not to calculate df-adjusted white standard error (HC1) |
noint | bool, default True | Whether or not generate intercept |
c_method | str, default 'cgm' | Method to calculate multi-cluster standard error. Possible choices are 'cgm' and 'cgm2'. |
psdef | bool, default True | if True, replace negative eigenvalue of variance matrix with 0 (only in multi-way clusters variance) |
Return the same object of results as fixedeffect does.
This function is used to get multi results of multi models on one dataframe. During analyzing data with large data size and complicated, we usually have several model assumptions. By using this function, we can easily get the results comparison of the different models.
Input parameters | Type | Description |
---|---|---|
data_df | pandas dataframe | Dataframe with relevant data |
models | list, default [] | List of models |
table_header | str, default None | Title of summary table |
Return a summary table of results of the different models.
This function is used to get fixed effect.
Input parameters | Type | Description |
---|---|---|
result | object | output object of fixedeffect function |
epsilon | double, default 1e-8 | tolerance for projection |
normalize | bool, default False | Whether or not to normalize fixed effects. |
category_input | list, default [] | List of category variables to calculate fixed effect. |
Return a summary table of estimates of fixed effects and its standard errors.
This function is used to obtain iv test result.
Input parameters | Type | Description |
---|---|---|
result | object | output object of ivgmm/iv2sls function |
Return a test result table of iv tests.
# need to install from kuaishou product base
import numpy as np
import pandas as pd
from fixedeffect.iv import iv2sls, ivgmm,ivtest
from fixedeffect.fe import fixedeffect, did,getfe
from fixedeffect.utils.panel_dgp import gen_data
from fixedeffect.iv import ivtest
N = 100
T = 10
beta = [-3,1,2,3,4]
ate = 1
exp_date = 5
#generate sample data
df = gen_data(N, T, beta, ate, exp_date)
#------------------------------#
#define instrumental variable model
# iv2sls
formula = 'y ~ x_1|id+time|0|(x_2~x_3+x_4)'
model_iv2sls = iv2sls(data_df = df,
formula = formula)
result = model_iv2sls.fit()
result.summary()
# ivgmm
formula = 'y ~ x_1|id|0|(x_2~x_3+x_4)'
model_ivgmm = ivgmm(data_df = df,
formula = formula)
result = model_ivgmm.fit()
result.summary()
# obtain iv test results
ivtest(result)
#------------------------------#
#define fixed effect model
exog_x = ['x_1']
y = ['y']
category = ['id','time']
cluster = ['id','time']
model_fe = fixedeffect(data_df = df,
dependent = y,
exog_x = exog_x,
category = category,
cluster = cluster)
result = model_fe.fit()
result.summary()
#obtain fixed effect
getfe(result)
#------------------------------#
#define DID model
formula = 'y ~ 0|0|0|0'
model_did = did(data_df = df,
formula = formula,
treatment = ['treatment'],
csid = ['id'],
tsid = ['time'],
exp_date=2)
result = model_did.fit()
result.summary()
- Python 3.6+
- Pandas and its dependencies (Numpy, etc.)
- Scipy and its dependencies
- statsmodels and its dependencies
- networkx
If you use FixedEffectModel in your research, please cite us as follows:
Kuaishou DA Ecology. FixedEffectModel: A Python Package for Linear Model with High Dimensional Fixed Effects.https://github.com/ksecology/FixedEffectModel,2020.Version 0.x
BibTex:
@misc{FixedEffectModel,
author={Kuaishou DA Ecology},
title={{FixedEffectModel: {A Python Package for Linear Model with High Dimensional Fixed Effects}},
howpublished={https://github.com/ksecology/FixedEffectModel},
note={Version 0.x},
year={2020}
}
This package welcomes feedback. If you have any additional questions or comments, please contact da_ecology@kuaishou.com.
[1] Simen Gaure(2019). lfe: Linear Group Fixed Effects. R package. version:v2.8-5.1 URL:https://www.rdocumentation.org/packages/lfe/versions/2.8-5.1
[2] A Colin Cameron and Douglas L Miller. A practitioner’s guide to cluster-robust inference. Journal of human resources, 50(2):317–372, 2015.
[3] Simen Gaure. Ols with multiple high dimensional category variables. Computational Statistics & Data Analysis, 66:8–18, 2013.
[4] Douglas L Miller, A Colin Cameron, and Jonah Gelbach. Robust inference with multi-way clustering. Technical report, Working Paper, 2009.
[5] Jeffrey M Wooldridge. Econometric analysis of cross section and panel data. MIT press, 2010.