To consider: mentioning circumstances of unexpected change of data.frame behaviour
adamryczkowski opened this issue · 7 comments
Many people already know, that there are many incompatible APIs to the concept of "data.frame": the original data.frame
, the data.table
and the tibble
.
Most of us know, that various library function may return different flavours of data.frame (e.g. readxl::read_excel
returns tibble
although it doesn't advertise that in the docs).
But I bet bug in this innocent code will come as a surprise for many of us:
Assume filename is a string that contains filename of Excel full of numbers; first row contain integers.
library(readxl)
df<-readxl::read_excel(filename) #Returns a data.frame with sneaky little tibble class included...
#But user may never used tibbles before, and assumes it is a data.frame (what else can it be?).
indices<-df[,1] #This will be numeric if df is interpreted as data.frame.
df<-df[indices,2] #We simply select needed rows. This works if indices is a numeric.
#Now do some random and innocent computation, like...
#...calling
Hmisc::label
#Now try to run the code again
indices<-df[,1] #This time indices will be a tibble.
df<-df[indices,2] #This line fails miserably, because indices are no longer integers, but one-element list of integers (a tibble).
If you think this issue is worth mentioning in your book - let me know. I will try to find time to write about it.
I think this is worth mentioning. I've had fails where data produced with dplyr as tibble
class are used in the @data
slot of a SpatialPolygonsDataFrame
in sp when dplyr is not loaded. Definitely worth flagging the danger. cc @edzer who may be interested in how sf will handle tibbles
. Guess: well, based on this: https://github.com/edzer/sfr/blob/19be7fc3ec1edcb2333b24099c2b90f5d3df4816/R/tidy.R
In any case, yes a PR on the subject would be appreciated!
Just to mention, that the book has now been copyedited, so we're not allowed to make any further changes to the "official" version.
Good point - not this version of it anyway, there may be future versions.
Is this an example case where formal classes work better than S3? Or should package readxl
have attached tibble
?
Good questions, but I have not the answers...
I've already filed a bug report about not mentioning the return type from the readxl
function.
I think the only good solution is to actually return pure data.frame
from the readxl
function. It doesn't change anything performance-wise, because (unlike e.g. the data.table
) the binary representation of the 'tbl' class looks like it is a vanilla data.frame anyway with just an extra class thrown into its class
attributes.
readxl should have imported tibble. That was my mistake.