csgillespie/efficientR

To consider: mentioning circumstances of unexpected change of data.frame behaviour

adamryczkowski opened this issue · 7 comments

Many people already know, that there are many incompatible APIs to the concept of "data.frame": the original data.frame, the data.table and the tibble.
Most of us know, that various library function may return different flavours of data.frame (e.g. readxl::read_excel returns tibble although it doesn't advertise that in the docs).
But I bet bug in this innocent code will come as a surprise for many of us:

Assume filename is a string that contains filename of Excel full of numbers; first row contain integers.

library(readxl)
df<-readxl::read_excel(filename) #Returns a data.frame with sneaky little tibble class included... 
#But user may never used tibbles before, and assumes it is a data.frame (what else can it be?). 

indices<-df[,1] #This will be numeric if df is interpreted as data.frame.
df<-df[indices,2] #We simply select needed rows. This works if indices is a numeric.

#Now do some random and innocent computation, like...
#...calling
Hmisc::label

#Now try to run the code again
indices<-df[,1] #This time indices will be a tibble.
df<-df[indices,2] #This line fails miserably, because indices are no longer integers, but one-element list of integers (a tibble).

If you think this issue is worth mentioning in your book - let me know. I will try to find time to write about it.

I think this is worth mentioning. I've had fails where data produced with dplyr as tibble class are used in the @data slot of a SpatialPolygonsDataFrame in sp when dplyr is not loaded. Definitely worth flagging the danger. cc @edzer who may be interested in how sf will handle tibbles. Guess: well, based on this: https://github.com/edzer/sfr/blob/19be7fc3ec1edcb2333b24099c2b90f5d3df4816/R/tidy.R

In any case, yes a PR on the subject would be appreciated!

Just to mention, that the book has now been copyedited, so we're not allowed to make any further changes to the "official" version.

Good point - not this version of it anyway, there may be future versions.

edzer commented

Is this an example case where formal classes work better than S3? Or should package readxl have attached tibble?

Good questions, but I have not the answers...

I've already filed a bug report about not mentioning the return type from the readxl function.

I think the only good solution is to actually return pure data.frame from the readxl function. It doesn't change anything performance-wise, because (unlike e.g. the data.table) the binary representation of the 'tbl' class looks like it is a vanilla data.frame anyway with just an extra class thrown into its class attributes.

readxl should have imported tibble. That was my mistake.