pydata/patsy

Builtins for numpy recarrays

Opened this issue · 4 comments

icam0 commented

When trying to use the patsy builtin identity matrix I()

adding two features the numpy recarray throws an error while the pandas equivalent executes without a problem. Code to reproduce the error:

from patsy import dmatrix
import numpy as np
import pandas as pd
recarray = np.array([(1.0, 2), (3.0, 4)], dtype=[('x', float), ('y', int)])
df = pd.DataFrame.from_records(recarray)
result_df = dmatrix("I(x+y)-1",df)
result_rec = dmatrix("I(x+y)-1",recarray)

python 3.6.5
patsy 0.5.0
pandas 0.23.4
numpy 1.14.1

traceback:

Traceback (most recent call last):
  File "/anaconda3/lib/python3.6/site-packages/patsy/compat.py", line 36, in call_and_wrap_exc
    return f(*args, **kwargs)
  File "/anaconda3/lib/python3.6/site-packages/patsy/eval.py", line 166, in eval
    + self._namespaces))
  File "<string>", line 1, in <module>
  File "/anaconda3/lib/python3.6/site-packages/patsy/eval.py", line 48, in __getitem__
    return d[key]
  File "/anaconda3/lib/python3.6/site-packages/patsy/eval.py", line 48, in __getitem__
    return d[key]
ValueError: no field of name I
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/Applications/PyCharm.app/Contents/helpers/pydev/pydev_run_in_console.py", line 52, in run_file
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/Applications/PyCharm.app/Contents/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/Users/daviddejong/PycharmProjects/quick_test/patsy_test.py", line 7, in <module>
    result_rec = dmatrix("I(x+y)-1",recarray)
  File "/anaconda3/lib/python3.6/site-packages/patsy/highlevel.py", line 291, in dmatrix
    NA_action, return_type)
  File "/anaconda3/lib/python3.6/site-packages/patsy/highlevel.py", line 165, in _do_highlevel_design
    NA_action)
  File "/anaconda3/lib/python3.6/site-packages/patsy/highlevel.py", line 70, in _try_incr_builders
    NA_action)
  File "/anaconda3/lib/python3.6/site-packages/patsy/build.py", line 696, in design_matrix_builders
    NA_action)
  File "/anaconda3/lib/python3.6/site-packages/patsy/build.py", line 443, in _examine_factor_types
    value = factor.eval(factor_states[factor], data)
  File "/anaconda3/lib/python3.6/site-packages/patsy/eval.py", line 566, in eval
    data)
  File "/anaconda3/lib/python3.6/site-packages/patsy/eval.py", line 551, in _eval
    inner_namespace=inner_namespace)
  File "/anaconda3/lib/python3.6/site-packages/patsy/compat.py", line 43, in call_and_wrap_exc
    exec("raise new_exc from e")
  File "<string>", line 1, in <module>
patsy.PatsyError: Error evaluating factor: ValueError: no field of name I
    I(x+y)-1
    ^^^^^^

Looks like the issue is that recarray objects don't actually follow the standard Python "mapping" interface, which is what patsy expects from the data object – in particular, trying to access an undefined field name raises ValueError, instead of KeyError, so patsy can't tell whether I is supposed to be a field name or what.

Is this a problem? Pandas dataframes are better than recarrays in pretty every way...?

icam0 commented

Apparently, it's raising a ValueError to maintain backwards compatibility as explained here. Maybe it's worthwhile to check the dtype before catching the exception. Also, the example is a structured array, not a record array. I think it's worthwhile to add this behaviour to make sure it works on numpy structured/record arrays because a lot of functionality fails otherwise. I raised this issue because the documentation states:

You may prefer to store your data in a pandas DataFrame, or a numpy record array… whatever makes you happy.

Personally, I was using this in an environment where a (design) choice was made to use record arrays instead of pandas dataframes.

I'm facing the same issue with this code:

>>> import statsmodels.formula.api as smf
>>> import numpy as np
>>> x = np.linspace(0.001, 5, 200)
>>> y = (0.3 * x**3 + 1.2 * x**2 + 70/x**4) * 1.1 * np.exp(0.1)
>>> data = np.array([y, x], dtype=[('y', np.float64), ('x', np.float64)])
>>> model = smf.ols(formula='y ~ I(x**3) + I(x**2) + I(x**4)', data=data)
Traceback (most recent call last):
  File "DataScienceVenv/lib/python3.7/site-packages/patsy/compat.py", line 36, in call_and_wrap_exc
    return f(*args, **kwargs)
  File "DataScienceVenv/lib/python3.7/site-packages/patsy/eval.py", line 166, in eval
    + self._namespaces))
  File "<string>", line 1, in <module>
  File "DataScienceVenv/lib/python3.7/site-packages/patsy/eval.py", line 48, in __getitem__
    return d[key]
  File "DataScienceVenv/lib/python3.7/site-packages/patsy/eval.py", line 48, in __getitem__
    return d[key]
ValueError: no field of name I

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "DataScienceVenv/lib/python3.7/site-packages/statsmodels/base/model.py", line 170, in from_formula
    missing=missing)
  File "DataScienceVenv/lib/python3.7/site-packages/statsmodels/formula/formulatools.py", line 67, in handle_formula_data
    NA_action=na_action)
  File "DataScienceVenv/lib/python3.7/site-packages/patsy/highlevel.py", line 310, in dmatrices
    NA_action, return_type)
  File "DataScienceVenv/lib/python3.7/site-packages/patsy/highlevel.py", line 165, in _do_highlevel_design
    NA_action)
  File "DataScienceVenv/lib/python3.7/site-packages/patsy/highlevel.py", line 70, in _try_incr_builders
    NA_action)
  File "DataScienceVenv/lib/python3.7/site-packages/patsy/build.py", line 696, in design_matrix_builders
    NA_action)
  File "DataScienceVenv/lib/python3.7/site-packages/patsy/build.py", line 443, in _examine_factor_types
    value = factor.eval(factor_states[factor], data)
  File "DataScienceVenv/lib/python3.7/site-packages/patsy/eval.py", line 566, in eval
    data)
  File "DataScienceVenv/lib/python3.7/site-packages/patsy/eval.py", line 551, in _eval
    inner_namespace=inner_namespace)
  File "DataScienceVenv/lib/python3.7/site-packages/patsy/compat.py", line 43, in call_and_wrap_exc
    exec("raise new_exc from e")
  File "<string>", line 1, in <module>
patsy.PatsyError: Error evaluating factor: ValueError: no field of name I
    y ~ I(x**3) + I(x**2) + I(x**4)
                            ^^^^^^^
>>>

The documentation for statsmodels.formula.api.ols explicitly says (emphasis mine):

data must define __getitem__ with the keys in the formula terms args and kwargs are passed on to the model instantiation. E.g., a numpy structured or rec array, a dictionary, or a pandas DataFrame.

Yet in fact, structured arrays don't work, or not all features of the formula interface can be used with them, which is highly confusing.

Just an an FTI, statsmodels no longer officially supports recarrays. Any references remaining as vestigial and should be removed.