PanelOLS: Produced different Std. errors from Stata when clustered by the same variable in linearmodels
yuz0101 opened this issue · 15 comments
Hi there, it seems that we can only use one cov_type. While comparing results based on a model clustered by a variable between linearmodels and Stata, I found that Stata's std. errors are robust adjusted while linearmodels are not (All coefficients are the same).
Please may I know how I can get the same results as Stata by using linearmodels?
You need to change the cov_type
in a call to fit
.
Hi bashtage, thanks for your quick reply. I follow that. For example,
mod = PanelOLS.from_formula(formula, data) reg = mod.fit(cov_type='clustered', clusters=data['var'])
The standard errors from the reg
are not robust adjusted.
If I change it to reg = mod.fit(cov_type='robust')
, then results are not based on the clustered by var one.
Could you please help me with it? How can I get both robust- and clustered-results? I appreciate your time.
What is formula? The definition of robust depends on whether entity effects are included. Clustered std errors are robust to heteroskedasticity.
Thanks for that. y ~ 1 + a*b*c + a + b + c
is the formula I'm using. There is no entity effects in my regression model. : )
What Stata command are you using?
gen abc = a*b*c
reg y abc a b c, vce (cluster var)
What you have looks correct to me
mod = PanelOLS.from_formula(formula, data)
reg = mod.fit(cov_type='clustered', clusters=data['var'])
You could also use
mod = PooledOLS.from_formula(formula, data)
reg = mod.fit(cov_type='clustered', clusters=data['var'])
and the results should be the same.
Thank you Kevin, for all your efforts.
I still got different results. I noticed that the std. error from Stata is robust (pls have a look at the scrnshot below). Sorry for the mess data shown below, they are shown in different orders. Would you please look at the last line of pic 1 (black background) and the first line of pic 2 (white background). They are the same variable.
What happends when you take the ratio of the parameter variance from stata to that from linear models? Stata has a log of magic small sample adjustments it makes. If this ratio is the same for all parameters, this indicates that it is a scalar adjustment.
I did a quick check and this looks to be the issue. What does changing the value of debiased
do?
I just did a check, and most of the std. errors from Sata is 70% of the std. errors from linearmodels. For the debiased
, I did the check and found nothing affecting the results.
Hi Kevin, you are right. Massive thanks for your time. The std, errors in my model are clustered by only one dummy variable. However, I tried that with xtreg and found that it produced errors in xtreg, mainly due to the usage of clusters. Then I tried to cluster the id, which completely worked in both linearmodels and Stata (using a command of reg
), and they produced the same results. However, the same setting with xtreg
in Stata produces different results from using reg
and using linearmodels.
Q: Is it reasonable to cluster the std. errors with a dummy variable rather than a variable consisting of a number of clusters?
Hi Kevin, adding my issue here because it is most likely related. By the way, thanks for all your work on linearmodels (and the awesome documentation!), it was the missing piece to my Python workflow.
I'm preparing notes for my students in which I was hoping to replicate Mitch Petersen's results using his sample dataset, which he obtained with Stata using the following command:
regress dependent_variable independent_variables, robust cluster(cluster_variable)
For reference, his Stata code is here, and the sample data and associated estimations are here.
His sample dataset contains 500 firms but only 10 years, so that dimension is small. I'm able to replicate his results using statsmodels, but not with linearmodels. More specifically, I get the same results when clustering by firm, but not when clustering by year or by both firm and year. Are you aware of a small sample adjustment that is implemented in statsmodels but not in linearmodels? debiased
does not affect my results.
Here is the code to replicate the difference:
import pandas as pd
import statsmodels.formula.api as smf
from linearmodels import PanelOLS
df = pd.read_table(
"http://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/se/test_data.txt",
names=["firmid", "year", "x", "y"],
delim_whitespace=True,
)
smf.ols(formula="y ~ x", data=df).fit(
cov_type="cluster", cov_kwds={"groups": df["year"]}, use_t=True
).summary()
With statsmodels:
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.208
Model: OLS Adj. R-squared: 0.208
Method: Least Squares F-statistic: 960.6
Date: Fri, 19 Jan 2024 Prob (F-statistic): 1.86e-10
Time: 06:51:42 Log-Likelihood: -10573.
No. Observations: 5000 AIC: 2.115e+04
Df Residuals: 4998 BIC: 2.116e+04
Df Model: 1
Covariance Type: cluster
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0.0297 0.023 1.269 0.236 -0.023 0.083
x 1.0348 0.033 30.993 0.000 0.959 1.110
==============================================================================
Omnibus: 4.912 Durbin-Watson: 1.096
Prob(Omnibus): 0.086 Jarque-Bera (JB): 4.862
Skew: 0.070 Prob(JB): 0.0880
Kurtosis: 3.063 Cond. No. 1.01
==============================================================================
Notes:
[1] Standard Errors are robust to cluster correlation (cluster)
df = df.set_index(["firmid", "year"])
mod = PanelOLS.from_formula("y ~ 1 + x", df)
res = mod.fit(cov_type="clustered", cluster_entity=False, cluster_time=True)
res.summary
With linearmodels:
PanelOLS Estimation Summary
================================================================================
Dep. Variable: y R-squared: 0.2078
Estimator: PanelOLS R-squared (Between): 0.2208
No. Observations: 5000 R-squared (Within): 0.1907
Date: Fri, Jan 19 2024 R-squared (Overall): 0.2078
Time: 06:51:42 Log-likelihood -1.057e+04
Cov. Estimator: Clustered
F-statistic: 1310.7
Entities: 500 P-value 0.0000
Avg Obs: 10.0000 Distribution: F(1,4998)
Min Obs: 10.0000
Max Obs: 10.0000 F-statistic (robust): 1067.1
P-value 0.0000
Time periods: 10 Distribution: F(1,4998)
Avg Obs: 500.00
Min Obs: 500.00
Max Obs: 500.00
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
Intercept 0.0297 0.0222 1.3376 0.1811 -0.0138 0.0732
x 1.0348 0.0317 32.667 0.0000 0.9727 1.0969
==============================================================================
You probably need group_debias=True
.
mod = PanelOLS.from_formula("y ~ 1 + x", df)
res = mod.fit(cov_type="clustered", cluster_entity=False, cluster_time=True, group_debias=True)
res.summary
PanelOLS Estimation Summary
================================================================================
Dep. Variable: y R-squared: 0.2078
Estimator: PanelOLS R-squared (Between): 0.2208
No. Observations: 5000 R-squared (Within): 0.1907
Date: Fri, Jan 19 2024 R-squared (Overall): 0.2078
Time: 12:00:52 Log-likelihood -1.057e+04
Cov. Estimator: Clustered
F-statistic: 1310.7
Entities: 500 P-value 0.0000
Avg Obs: 10.0000 Distribution: F(1,4998)
Min Obs: 10.0000
Max Obs: 10.0000 F-statistic (robust): 960.59
P-value 0.0000
Time periods: 10 Distribution: F(1,4998)
Avg Obs: 500.00
Min Obs: 500.00
Max Obs: 500.00
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
Intercept 0.0297 0.0234 1.2691 0.2045 -0.0162 0.0755
x 1.0348 0.0334 30.993 0.0000 0.9694 1.1003
==============================================================================
Thanks, that's what I was missing! I get the same results now.