pydata/patsy

Accept to use variable and categorical variable from dataframe index

Opened this issue · 1 comments

Very often in panel regression, the fixed effect is implemented as categorical variable. Currently, unless using some hacky way, patsy cannot read put index as variables. See below example panel dataset,

import statsmodels.api as sm
df_raw = sm.datasets.get_rdataset('pwt_sample', 'stevedata').data.set_index(['isocode', 'year']).drop(['country'], axis=1)
df = df_raw.dropna()
print(df)

And the panel dataframe looks like:

                     pop        hc        rgdpna         rgdpo         rgdpe     labsh          avh         emp          rnna
isocode year                                                                                                                 
AUS     1950    8.354106  2.667302  1.274612e+05  1.141350e+05  1.219940e+05  0.680492  2170.923406    3.429873  6.399912e+05
        1951    8.599923  2.674344  1.307031e+05  1.105431e+05  1.139294e+05  0.680492  2150.846928    3.523916  6.901136e+05
        1952    8.782430  2.681403  1.253531e+05  1.088834e+05  1.112199e+05  0.680492  2130.956115    3.591675  7.045624e+05
        1953    8.950892  2.688482  1.389522e+05  1.226885e+05  1.233289e+05  0.680492  2111.249251    3.653409  7.331073e+05
        1954    9.159148  2.695580  1.500607e+05  1.318364e+05  1.314721e+05  0.680492  2091.724634    3.731083  7.714542e+05
...                  ...       ...           ...           ...           ...       ...          ...         ...           ...
USA     2015  320.878310  3.728116  1.877616e+07  1.878487e+07  1.890040e+07  0.595646  1770.023174  150.248474  6.505781e+07
        2016  323.015995  3.733411  1.909750e+07  1.909468e+07  1.928048e+07  0.593773  1766.744125  152.396957  6.597406e+07
        2017  325.084756  3.738714  1.954298e+07  1.954298e+07  1.975004e+07  0.596151  1763.726676  154.672318  6.694270e+07
        2018  327.096265  3.744024  2.012858e+07  2.015604e+07  2.036575e+07  0.594326  1774.703811  156.675903  6.800735e+07
        2019  329.064917  3.749341  2.056359e+07  2.059635e+07  2.085650e+07  0.597091  1765.346390  158.299591  6.905906e+07

Very often we need patsy to do a regression with from_formula which indeed uses patsy.dmatrices:

sm.OLS.from_formula('pop ~ rgdpna + year + C(isocode)', df_raw).fit().summary()

This prompts errors:

PatsyError: Error evaluating factor: NameError: name 'isocode' is not defined
    pop ~ rgdpna + year + C(isocode)
                          ^^^^^^^^^^

Very often it has the panel dimension is in the index level and users would like to use them in fixed effect and endog. Any chance patsy could support to use dataframe index? Thanks.

patsy is in maintenance (only) mode, and so this behavior is unlikely to change. Looking in the index is also potentially problematic since there might be named indices with the same names as columns. Enabling this could mean perfectly valid code under the existing rules becomes ambiguous.