pydata/patsy

In case of KeyErrors, Patsy doesn't try ints

khs opened this issue · 1 comments

khs commented

Pandas DataFrames can have columns that are integers, rather than strings. A Patsy ModelDesc can only be made up of strings. When Patsy encounters a KeyError in column names, it could test to see if the input is a valid int, and if so if said int is a valid column identifier in the dataframe. This would enable developers to avoid renaming their columns to comply with their statistical software. It's avoidable with a single line (df.columns=[str(i) for i in df.columns]), but I don't think that the cost in time is excessive to have us sacrifice useful behavior, given that an error will be thrown otherwise.

This is an interesting problem, but not one that I think should be solved in patsy (or its successor in Formulaic). It is better to avoid, I think, any ambiguity in the behaviour of the design/model matrix generator, which would inevitably result if we looked up names by both strings and integers and floats based on their string representations. Since we don't want formulas to look like: "field" + "field", we lose the ability to distinguish between "0" and 0, and so always assume that the incoming fieldname is a string. If we were to try the integers too, then things would be weird for a dataframe with both an integer column and its string representation... and just not worth the extra confusion. I'm going to close this one out for now, since I am not planning to change patsy's behaviour at all going forward, except to fix bugs. If this is still important to you, let's have that discussion of in Formulaic.