Builtins for numpy recarrays
Opened this issue · 4 comments
When trying to use the patsy builtin identity matrix I()
adding two features the numpy recarray throws an error while the pandas equivalent executes without a problem. Code to reproduce the error:
from patsy import dmatrix
import numpy as np
import pandas as pd
recarray = np.array([(1.0, 2), (3.0, 4)], dtype=[('x', float), ('y', int)])
df = pd.DataFrame.from_records(recarray)
result_df = dmatrix("I(x+y)-1",df)
result_rec = dmatrix("I(x+y)-1",recarray)
python 3.6.5
patsy 0.5.0
pandas 0.23.4
numpy 1.14.1
traceback:
Traceback (most recent call last):
File "/anaconda3/lib/python3.6/site-packages/patsy/compat.py", line 36, in call_and_wrap_exc
return f(*args, **kwargs)
File "/anaconda3/lib/python3.6/site-packages/patsy/eval.py", line 166, in eval
+ self._namespaces))
File "<string>", line 1, in <module>
File "/anaconda3/lib/python3.6/site-packages/patsy/eval.py", line 48, in __getitem__
return d[key]
File "/anaconda3/lib/python3.6/site-packages/patsy/eval.py", line 48, in __getitem__
return d[key]
ValueError: no field of name I
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Applications/PyCharm.app/Contents/helpers/pydev/pydev_run_in_console.py", line 52, in run_file
pydev_imports.execfile(file, globals, locals) # execute the script
File "/Applications/PyCharm.app/Contents/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/Users/daviddejong/PycharmProjects/quick_test/patsy_test.py", line 7, in <module>
result_rec = dmatrix("I(x+y)-1",recarray)
File "/anaconda3/lib/python3.6/site-packages/patsy/highlevel.py", line 291, in dmatrix
NA_action, return_type)
File "/anaconda3/lib/python3.6/site-packages/patsy/highlevel.py", line 165, in _do_highlevel_design
NA_action)
File "/anaconda3/lib/python3.6/site-packages/patsy/highlevel.py", line 70, in _try_incr_builders
NA_action)
File "/anaconda3/lib/python3.6/site-packages/patsy/build.py", line 696, in design_matrix_builders
NA_action)
File "/anaconda3/lib/python3.6/site-packages/patsy/build.py", line 443, in _examine_factor_types
value = factor.eval(factor_states[factor], data)
File "/anaconda3/lib/python3.6/site-packages/patsy/eval.py", line 566, in eval
data)
File "/anaconda3/lib/python3.6/site-packages/patsy/eval.py", line 551, in _eval
inner_namespace=inner_namespace)
File "/anaconda3/lib/python3.6/site-packages/patsy/compat.py", line 43, in call_and_wrap_exc
exec("raise new_exc from e")
File "<string>", line 1, in <module>
patsy.PatsyError: Error evaluating factor: ValueError: no field of name I
I(x+y)-1
^^^^^^
Looks like the issue is that recarray
objects don't actually follow the standard Python "mapping" interface, which is what patsy expects from the data object – in particular, trying to access an undefined field name raises ValueError
, instead of KeyError
, so patsy can't tell whether I
is supposed to be a field name or what.
Is this a problem? Pandas dataframes are better than recarrays in pretty every way...?
Apparently, it's raising a ValueError to maintain backwards compatibility as explained here. Maybe it's worthwhile to check the dtype before catching the exception. Also, the example is a structured array, not a record array. I think it's worthwhile to add this behaviour to make sure it works on numpy structured/record arrays because a lot of functionality fails otherwise. I raised this issue because the documentation states:
You may prefer to store your data in a pandas DataFrame, or a numpy record array… whatever makes you happy.
Personally, I was using this in an environment where a (design) choice was made to use record arrays instead of pandas dataframes.
I'm facing the same issue with this code:
>>> import statsmodels.formula.api as smf
>>> import numpy as np
>>> x = np.linspace(0.001, 5, 200)
>>> y = (0.3 * x**3 + 1.2 * x**2 + 70/x**4) * 1.1 * np.exp(0.1)
>>> data = np.array([y, x], dtype=[('y', np.float64), ('x', np.float64)])
>>> model = smf.ols(formula='y ~ I(x**3) + I(x**2) + I(x**4)', data=data)
Traceback (most recent call last):
File "DataScienceVenv/lib/python3.7/site-packages/patsy/compat.py", line 36, in call_and_wrap_exc
return f(*args, **kwargs)
File "DataScienceVenv/lib/python3.7/site-packages/patsy/eval.py", line 166, in eval
+ self._namespaces))
File "<string>", line 1, in <module>
File "DataScienceVenv/lib/python3.7/site-packages/patsy/eval.py", line 48, in __getitem__
return d[key]
File "DataScienceVenv/lib/python3.7/site-packages/patsy/eval.py", line 48, in __getitem__
return d[key]
ValueError: no field of name I
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "DataScienceVenv/lib/python3.7/site-packages/statsmodels/base/model.py", line 170, in from_formula
missing=missing)
File "DataScienceVenv/lib/python3.7/site-packages/statsmodels/formula/formulatools.py", line 67, in handle_formula_data
NA_action=na_action)
File "DataScienceVenv/lib/python3.7/site-packages/patsy/highlevel.py", line 310, in dmatrices
NA_action, return_type)
File "DataScienceVenv/lib/python3.7/site-packages/patsy/highlevel.py", line 165, in _do_highlevel_design
NA_action)
File "DataScienceVenv/lib/python3.7/site-packages/patsy/highlevel.py", line 70, in _try_incr_builders
NA_action)
File "DataScienceVenv/lib/python3.7/site-packages/patsy/build.py", line 696, in design_matrix_builders
NA_action)
File "DataScienceVenv/lib/python3.7/site-packages/patsy/build.py", line 443, in _examine_factor_types
value = factor.eval(factor_states[factor], data)
File "DataScienceVenv/lib/python3.7/site-packages/patsy/eval.py", line 566, in eval
data)
File "DataScienceVenv/lib/python3.7/site-packages/patsy/eval.py", line 551, in _eval
inner_namespace=inner_namespace)
File "DataScienceVenv/lib/python3.7/site-packages/patsy/compat.py", line 43, in call_and_wrap_exc
exec("raise new_exc from e")
File "<string>", line 1, in <module>
patsy.PatsyError: Error evaluating factor: ValueError: no field of name I
y ~ I(x**3) + I(x**2) + I(x**4)
^^^^^^^
>>>
The documentation for statsmodels.formula.api.ols
explicitly says (emphasis mine):
data must define
__getitem__
with the keys in the formula terms args and kwargs are passed on to the model instantiation. E.g., a numpy structured or rec array, a dictionary, or a pandas DataFrame.
Yet in fact, structured arrays don't work, or not all features of the formula
interface can be used with them, which is highly confusing.
Just an an FTI, statsmodels no longer officially supports recarrays. Any references remaining as vestigial and should be removed.