Better error message for DataFrame.apply
datapythonista opened this issue · 8 comments
The next example seems reasonable to me:
df = pandas.DataFrame({'col1': ['1', '2', '3'],
'col2': ['9', '9', '9']})
df.apply(int)
And looks like it should convert the data in the DataFrame to integers, by calling the int()
function for every element.
This would be true for Series.apply
, but DataFrame.apply
parameter is a function that receives a whole Series
at a time, not individual (scalar) values. The function that receives one value at a time is DataFrame.applymap
.
This is how pandas is designed, and while probably a bit confusing is reasonable. So, the previous example actually fails. The error is:
TypeError: ("cannot convert the series to <class 'int'>", 'occurred at index col1')
Feel free to disagree, but personally I think the error message doesn't do a great job at telling the user what's wrong, or give hints on how to fix it. I think something like the next should be more useful:
TypeError: The function `int` passed to `DataFrame.apply` should expect a `Series` as the argument. To apply a function that receives a single item at a time use `DataFrame.applymap`.
While this may look straight-forward, this is easy and surely not as easy as replacing the error message. The current reported message is reported by the Series
when is trying to be converted to an integer by int(pandas.Series())
, so it has nothing to do with apply
.
I think it's doable to have an appropriate error message, but not sure about the implications.
Feel free to discuss your proposals on how to fix it here, or to try your approach and open a PR, and have the discussion there.
I agree with you. I've been noticing better error messages and suggestions on how to fix them in some libraries like sklearn.
It's better that an error message is useful than succinct. If we come up with very useful and succinct,great.
I will experiment with this and get back
I don't know if there is a pandas convention on how to add suggestion to error messages.
TypeError: ("cannot convert series to <class 'int'>", 'occurred at index col1',"to convert individual value to <class 'int'>,try DataFrame.applymap")
The proposals are done with issues in the pandas repository. In this case, the problem is not that much about which is the exact error message, but rather how to be able to identify the problem in the code. As it's explained at the end of the description, the error message being raised is not from the apply
function, but it's a "generic" Series error.
Hmmmmm
How is it going @WuraolaOyewusi? I'm interested in collaborating on this.
@datapythonista would it be useful to add a test at the beginning of df.apply
asserting that the function to apply is not int
or float
(otherwise throw the error you described)? Or are you thinking of an even more abstract solution like throwing a type of error if the error comes from the Series
class, and another type if it comes from apply
?
int
and float
are examples, you can also have math.log
that expects a scalar, and should fail with the same new error. The solution is not trivial. An idea would be that when the Series
is casted to something it can't, it raises a subclass of TypeError
(e.g. InvalidCastError
), then from apply
this specific exception could be captured, and then I guess it's save to tell the user that the function should receive a Series
as parameter but it doesn't.
@datapythonista makes sense! I will try to implement something similar and tell you how it goes.
@martinagvilas . I haven't figured out a way to go about this yet.