bashtage/linearmodels

Clustered standard errors in `lineramodels` and `statsmodels`

spring-haru opened this issue · 3 comments

Clustered standard errors calculated in linearmodels differ from those in statsmodels. I am wondering what makes them differ.

Consider the following code:

import pandas as pd
import statsmodels.formula.api as smf
from linearmodels.panel import PooledOLS
from linearmodels.datasets import wage_panel

# data
df = wage_panel.load()
df['lwage10000'] = 10000*df['lwage']    # (1) see below
dfm = df.set_index(['nr','year'])

# statsmodels ---------------------
mod_sm = smf.ols('lwage10000 ~ hours', data=df)
res_sm = mod_sm.fit(cov_type='cluster',
                    cov_kwds={'groups': df['nr']},
                    use_t=True)
res_sm.bse[1]
# 0.27068014199116713

# linearmodels --------------------
mod_lm = PooledOLS.from_formula('lwage10000 ~ 1 + hours', data=dfm)
res_lm = mod_lm.fit(cov_type='clustered',
                    cluster_entity=True)
res_lm.std_errors[1]
# 0.27046271571316427

(1) is inserted to make standard errors deliberately large.

There is a difference in the standard errors, though "small". I expect them to give the same value. If I am wrong, what am I missing?

I suspect that it is a small sample correction. The math for the correction is here.

https://bashtage.github.io/linearmodels/panel/mathematical-formula.html#clustered-covariance-estimator

You should try computing the ratio

r = res_lm.std_errors[1]  / res_sm.bse[1]
print(r**2)
print(1/(r**2))

And see if one of these looks like the G/(G-1).

Thanks for the hint. The following code generates the same result as in statsmodels:

# linearmodels ------------------------------
mod_lm = PooledOLS.from_formula('lwage10000 ~ 1 + hours', data=dfm)
res_lm = mod_lm.fit(cov_type='clustered',
                    cluster_entity=True,
                    group_debias=True)     # this line is added
res_lm.std_errors[1]
# 0.2706801419911707

Closing as answered.