Clustered standard errors in `lineramodels` and `statsmodels`
spring-haru opened this issue · 3 comments
spring-haru commented
Clustered standard errors calculated in linearmodels
differ from those in statsmodels
. I am wondering what makes them differ.
Consider the following code:
import pandas as pd
import statsmodels.formula.api as smf
from linearmodels.panel import PooledOLS
from linearmodels.datasets import wage_panel
# data
df = wage_panel.load()
df['lwage10000'] = 10000*df['lwage'] # (1) see below
dfm = df.set_index(['nr','year'])
# statsmodels ---------------------
mod_sm = smf.ols('lwage10000 ~ hours', data=df)
res_sm = mod_sm.fit(cov_type='cluster',
cov_kwds={'groups': df['nr']},
use_t=True)
res_sm.bse[1]
# 0.27068014199116713
# linearmodels --------------------
mod_lm = PooledOLS.from_formula('lwage10000 ~ 1 + hours', data=dfm)
res_lm = mod_lm.fit(cov_type='clustered',
cluster_entity=True)
res_lm.std_errors[1]
# 0.27046271571316427
(1) is inserted to make standard errors deliberately large.
There is a difference in the standard errors, though "small". I expect them to give the same value. If I am wrong, what am I missing?
bashtage commented
I suspect that it is a small sample correction. The math for the correction is here.
You should try computing the ratio
r = res_lm.std_errors[1] / res_sm.bse[1]
print(r**2)
print(1/(r**2))
And see if one of these looks like the G/(G-1).
spring-haru commented
Thanks for the hint. The following code generates the same result as in statsmodels
:
# linearmodels ------------------------------
mod_lm = PooledOLS.from_formula('lwage10000 ~ 1 + hours', data=dfm)
res_lm = mod_lm.fit(cov_type='clustered',
cluster_entity=True,
group_debias=True) # this line is added
res_lm.std_errors[1]
# 0.2706801419911707
bashtage commented
Closing as answered.