To consider: mentioning circumstances of unexpected change of data.frame behaviour

Question

To consider: mentioning circumstances of unexpected change of data.frame behaviour

adamryczkowski opened this issue 8 years ago · 7 comments

Many people already know, that there are many incompatible APIs to the concept of "data.frame": the original data.frame, the data.table and the tibble.
Most of us know, that various library function may return different flavours of data.frame (e.g. readxl::read_excel returns tibble although it doesn't advertise that in the docs).
But I bet bug in this innocent code will come as a surprise for many of us:

Assume filename is a string that contains filename of Excel full of numbers; first row contain integers.

library(readxl)
df<-readxl::read_excel(filename) #Returns a data.frame with sneaky little tibble class included... 
#But user may never used tibbles before, and assumes it is a data.frame (what else can it be?). 

indices<-df[,1] #This will be numeric if df is interpreted as data.frame.
df<-df[indices,2] #We simply select needed rows. This works if indices is a numeric.

#Now do some random and innocent computation, like...
#...calling
Hmisc::label

#Now try to run the code again
indices<-df[,1] #This time indices will be a tibble.
df<-df[indices,2] #This line fails miserably, because indices are no longer integers, but one-element list of integers (a tibble).

If you think this issue is worth mentioning in your book - let me know. I will try to find time to write about it.

Answer 1 · 2016-11-05T11:07:55.000Z

I think this is worth mentioning. I've had fails where data produced with dplyr as tibble class are used in the @data slot of a SpatialPolygonsDataFrame in sp when dplyr is not loaded. Definitely worth flagging the danger. cc @edzer who may be interested in how sf will handle tibbles. Guess: well, based on this: https://github.com/edzer/sfr/blob/19be7fc3ec1edcb2333b24099c2b90f5d3df4816/R/tidy.R

In any case, yes a PR on the subject would be appreciated!

Answer 2 · 2016-11-05T12:13:52.000Z

Just to mention, that the book has now been copyedited, so we're not allowed to make any further changes to the "official" version.

Answer 3 · 2016-11-05T12:32:29.000Z

Good point - not this version of it anyway, there may be future versions.

Answer 4 · 2016-11-05T13:44:19.000Z

Is this an example case where formal classes work better than S3? Or should package readxl have attached tibble?

Answer 5 · 2016-11-05T13:45:19.000Z

Good questions, but I have not the answers...

Answer 6 · 2016-11-05T13:52:41.000Z

I've already filed a bug report about not mentioning the return type from the readxl function.

I think the only good solution is to actually return pure data.frame from the readxl function. It doesn't change anything performance-wise, because (unlike e.g. the data.table) the binary representation of the 'tbl' class looks like it is a vanilla data.frame anyway with just an extra class thrown into its class attributes.

Answer 7 · 2016-11-06T22:26:59.000Z

readxl should have imported tibble. That was my mistake.