bashtage/linearmodels

Fixed-effect `p-value` with clustered covariance differs from Stata

spring-haru opened this issue · 2 comments

A p-value for a fixed-effect estimation with clustered covariance differs from Stata (no problem with unadjusted covariance).

First, consider the case of unadjusted covariance to show that no problem arises.

from linearmodels import PanelOLS
from linearmodels.datasets import jobtraining

data = jobtraining.load()
mi_data = data.set_index(['fcode', 'year']).dropna(subset=['lscrap','hrsemp'])

formula = 'lscrap ~ 1 + hrsemp + EntityEffects'
mod = PanelOLS.from_formula(formula, data=mi_data)
res = mod.fit()
print(res.summary.tables[1])
# the result not shown here

Turning to Stata,

import delimited jobtraining.csv
xtset fcode year
xtreg lscrap hrsemp, fe

            hrsemp       _cons
     b  -.00541856   .49817637
    se   .00240525   .06087802
     t  -2.2528065     8.18319
pvalue   .02667494   1.597e-12

To confirm that the linearmodels values match those of Stata, use estimates/statistics of hrsemp from both and calculate their corresponding ratios:

print("--- hrsemp ------")
print(f"Parameter: {res.params['hrsemp']/(-.00541856)}")
print(f"Std. Err.: {res.std_errors['hrsemp']/(.00240525)}")
print(f"T-stat: {res.tstats['hrsemp']/(-2.2528065)}")
print(f"P-value: {res.pvalues['hrsemp']/(.02667494)}")

giving

--- hrsemp ------
Parameter: 0.9999991122293972
Std. Err.: 0.9999986473078816
T-stat: 0.9999999418819207
P-value: 1.000000205050997

As expected, the ratios are very close to 1, meaning those values match.

Second, consider the case of clustered covariance.

resc = mod.fit(cov_type='clustered',
               cluster_entity=True,
               group_debias=True)
print(resc.summary.tables[1])
# the result now shown here

The Stata result is (note that robust means clustered robust by design in Stata):

xtreg lscrap hrsemp, fe vce(robust)
mat list r(table)

            hrsemp       _cons
     b  -.00541856   .49817637
    se   .00192429   .02954154
     t  -2.8158738   16.863589
pvalue   .00708766   1.452e-21

Again calcuate the ratios of estimates/statistics:

print("--- hrsemp ------")
print(f"Parameter: {resc.params['hrsemp']/(-.00541856)}")
print(f"Std. Err.: {resc.std_errors['hrsemp']/(.00192429)}")
print(f"T-stat: {resc.tstats['hrsemp']/(-2.8158738)}")
print(f"P-value: {resc.pvalues['hrsemp']/(.00708766)}")

giving

--- hrsemp ------
Parameter: 0.9999991122293972
Std. Err.: 0.9999995325700923
T-stat: 0.999999986667238
P-value: 0.8414508380471644

The first three ratios are close to 1, whereas the last one (P-value) is not.

This seems to suggest some problems in calculating clustered P-value. Or am I missing something?

Stata is choosing to use a student's t. Since there is no obvious justification aside from strong assumptions (gaussian residuals), I do not follow that practice. If you use an appropriate student's t to compute the p-value, you will get the same value.

It does not seem to be the problem of Student's t vs Gaussian residuals. Rather, the difference seems to be caused by degree of freedom.

Define the following:
N: number of observations
n: number of entities
k: number of parameters

In the example above, linearmodels uses Student's t with degree of freedom df_resid which is N-n-k, whereas Stata uses n-1 (I have not written n-k, see blelow). This is numerically confirmed by the following code:

(resc in the above code is used)

import numpy as np
from scipy.stats import t

pval_manual_calc = 2*t.cdf(resc.tstats['hrsemp'], resc.df_resid)
pval_auto_calc = resc.pvalues['hrsemp']

np.isclose(pval_auto_calc, pval_manual_calc)

which gives True.

To reproduce the Stata result

n_1 = int(resc.entity_info['total']-1)
2*t.cdf(resc.tstats['hrsemp'], n_1)

giving 0.007087663950431981, which is the value Stata returns.

See the following links for discussion on the use of n-1 (not n-k) for Stata.

https://www.statalist.org/forums/forum/general-stata-discussion/general/1479383-df-in-clustered-regression-linear-coefficient-test

https://www.statalist.org/forums/forum/general-stata-discussion/general/1576484-estimating-p-value-from-t-value-in-regression-with-cluster-robust-standard-errors