nhs-r-community/NHSRdatasets

convert datasets to tibbles

Closed this issue · 6 comments

in keeping with tidydata should we convert dataset's to tibbles? the ae_attendances dataset already is a tibble, but the other two are not.

converting to a tibble shouldn't cause any issues as a tibble is a superset of the dataframe class, but it vastly improves (IMO) the EDA phase by not cluttering the screen when printing the data

@Lextuga007 / @chrismainey what are your thoughts?

I have no problems with that. I'm not familiar with using tibbles so should I just pass the final data frame through to a tibble format at the end of construction?

Hmmm... I'm not maybe as 'tidyverse' as many, and I don't think this is a huge priority but I see @Lextuga007 as the owner of this dataset, so it's up to you. I don't explicitly use tibbles (although you do if you the dplyr), so I'm happy either way, but they may be more confusing for new R users who may not see why they both are and are not data.frames.

it would just be a case of running the dataframe into the as_tibble() function. I would actually disagree that tibbles are more confusing to new R users. If they are learning the tidyverse they will be used to tibbles, and probably just think of them as dataframes. But when you get a dataframe that blows past the console view size because it prints the first 1000 rows and all columns I would say that can cause confusion! tibbles are also far more safe in terms of subsetting (see https://blog.rstudio.com/2016/03/24/tibble-1-0-0/)

Yeah, I see your point. I find the opposite for personal use, and I'd normally talk to new users about data.frames as 'special' lists of vectors of the same length, then present tibble and data.table as 'modern extensions.' I often want to see more rows and columns that tibble prints and it irritates, me, but I've no objection to changing to tibbles. They will presumable be slightly larger in size, but shouldn't be any sort of issue. It's just a philosophy thing for me. If it was about speed and storage, I'd go data.table, but if we are preaching tidyverse (which is no bad thing) tibble is the way. I'm already old fashioned it seems ;-) .

so file size should really be an issue: the difference between a tibble and a data.frame is a tibble is a dataframe with two additional classes (tbl_df and tbl), and I think it stores the column types as well. This article does a really good job of explaining the benefits and disadvantages.

In terms of printing more rows, use the format options to the tibble print function. Or use glimpse(). Or View() in RStudio. :-)

I think that this may actually need to be expanded into a separate issue for a style guide for datasets, I think it would be useful to have consistency on any dataset provided by the package, downside being we would be imposing some rules on contributions.

Cool. I'm happy to go with tibbles. Will do that now, as I've got 10 mins spare.