Fixed-effect `p-value` with clustered covariance differs from Stata
spring-haru opened this issue · 2 comments
A p-value
for a fixed-effect estimation with clustered covariance differs from Stata (no problem with unadjusted covariance).
First, consider the case of unadjusted covariance to show that no problem arises.
from linearmodels import PanelOLS
from linearmodels.datasets import jobtraining
data = jobtraining.load()
mi_data = data.set_index(['fcode', 'year']).dropna(subset=['lscrap','hrsemp'])
formula = 'lscrap ~ 1 + hrsemp + EntityEffects'
mod = PanelOLS.from_formula(formula, data=mi_data)
res = mod.fit()
print(res.summary.tables[1])
# the result not shown here
Turning to Stata,
import delimited jobtraining.csv
xtset fcode year
xtreg lscrap hrsemp, fe
hrsemp _cons
b -.00541856 .49817637
se .00240525 .06087802
t -2.2528065 8.18319
pvalue .02667494 1.597e-12
To confirm that the linearmodels
values match those of Stata, use estimates/statistics of hrsemp
from both and calculate their corresponding ratios:
print("--- hrsemp ------")
print(f"Parameter: {res.params['hrsemp']/(-.00541856)}")
print(f"Std. Err.: {res.std_errors['hrsemp']/(.00240525)}")
print(f"T-stat: {res.tstats['hrsemp']/(-2.2528065)}")
print(f"P-value: {res.pvalues['hrsemp']/(.02667494)}")
giving
--- hrsemp ------
Parameter: 0.9999991122293972
Std. Err.: 0.9999986473078816
T-stat: 0.9999999418819207
P-value: 1.000000205050997
As expected, the ratios are very close to 1
, meaning those values match.
Second, consider the case of clustered covariance.
resc = mod.fit(cov_type='clustered',
cluster_entity=True,
group_debias=True)
print(resc.summary.tables[1])
# the result now shown here
The Stata result is (note that robust
means clustered robust
by design in Stata):
xtreg lscrap hrsemp, fe vce(robust)
mat list r(table)
hrsemp _cons
b -.00541856 .49817637
se .00192429 .02954154
t -2.8158738 16.863589
pvalue .00708766 1.452e-21
Again calcuate the ratios of estimates/statistics:
print("--- hrsemp ------")
print(f"Parameter: {resc.params['hrsemp']/(-.00541856)}")
print(f"Std. Err.: {resc.std_errors['hrsemp']/(.00192429)}")
print(f"T-stat: {resc.tstats['hrsemp']/(-2.8158738)}")
print(f"P-value: {resc.pvalues['hrsemp']/(.00708766)}")
giving
--- hrsemp ------
Parameter: 0.9999991122293972
Std. Err.: 0.9999995325700923
T-stat: 0.999999986667238
P-value: 0.8414508380471644
The first three ratios are close to 1
, whereas the last one (P-value) is not.
This seems to suggest some problems in calculating clustered P-value
. Or am I missing something?
Stata is choosing to use a student's t. Since there is no obvious justification aside from strong assumptions (gaussian residuals), I do not follow that practice. If you use an appropriate student's t to compute the p-value, you will get the same value.
It does not seem to be the problem of Student's t
vs Gaussian residuals. Rather, the difference seems to be caused by degree of freedom.
Define the following:
N
: number of observations
n
: number of entities
k
: number of parameters
In the example above, linearmodels
uses Student's t
with degree of freedom df_resid
which is N-n-k
, whereas Stata uses n-1
(I have not written n-k
, see blelow). This is numerically confirmed by the following code:
(resc
in the above code is used)
import numpy as np
from scipy.stats import t
pval_manual_calc = 2*t.cdf(resc.tstats['hrsemp'], resc.df_resid)
pval_auto_calc = resc.pvalues['hrsemp']
np.isclose(pval_auto_calc, pval_manual_calc)
which gives True
.
To reproduce the Stata result
n_1 = int(resc.entity_info['total']-1)
2*t.cdf(resc.tstats['hrsemp'], n_1)
giving 0.007087663950431981
, which is the value Stata returns.
See the following links for discussion on the use of n-1
(not n-k
) for Stata.