allisonhorst/palmerpenguins

Include csv files in repository?

eddelbuettel opened this issue ยท 9 comments

Thanks for your efforts in providing this new data set as a standard. I just cloned the repo and noticed one thing missing that I wanted to use for an example: a stored csv.

One thing frequently shown when teaching data wrangling is taught is remote download from a URL just as you do here in your data-raw/ directory. And while the package is nicely set-up according to CRAN packaging standards and cleanly provides its data, it only provides to R users of the package which is more limiting than it could be and excludes other users.

Would you consider also writing the data as a csv file so that is could be slurped with a remote csv read? This would offer two benefits not currently covered. One is more minor: you can "standardize" on a file name by using one, so it will always be palmerpenguins.csv rather than some variant, and two, more importantly, you do not close the door to data science users not starting from an R package.

Disk space is reasonably cheap, and the vignettes/ directory alone is 3mb. The csv export of the data set I just made (for a demo use) clocks in at 14kb, or less that 1/2 of a percent. So we'd have the space, and I think we'd loose nothing by also offering a downloadable csv. I am more ambivalent of how to best ship it in a package. The data set is so small that I would probably include it as a csv but given that the whole LazyLoad machinery is set up there is no reason to change this. But having a download target csv would be a nice net gain for some users not currently reached. Thanks for your considerations.

Come to think about it there is at least one more reason. E.g. when we prepare Debian packages from CRAN packages we have to explain for each binary file (that is a .rda) where the source comes from. I did a quick check, and I appear to currently have around 138 Debian packages I maintain unpacked on my box containing a NAMESPACE file (as quick proxy for a CRAN package), and 51 of them require such a file! Here is an example for viridislite. Now, downstream packaging is not normative for CRAN or other best practices, but ... it would still be nice to have a csv for that reason alone. At least for some downstream packagers :)

A csv would also make it readily available for our friends in the Python and Julia world.

btw: I'm already using it in the video I uploaded for my useR2020muc talk ๐Ÿ‘

A csv would also make it readily available for our friends in the Python and Julia world.

That is what I had in mind when I wrote "it only provides it to R users of the package which is more limiting than it could be and excludes other users" above. Or to R users who are trying to remotely slurp a csv file which has long been supported by R's Connections API.

Thanks @eddelbuettel & @markvanderloo, we agree & will be adding the csv shortly.

Forgot to mention that should you need or want it I'd be happy to send a one-line PR to add the export to csv to the processing file....

Hi, I was using the CSV file a few days ago to make a basic ML example in multiple programming languages. It would be great if you could put it back up! Thanks!

Hi,

This file is back here now to stay: https://github.com/allisonhorst/palmerpenguins/blob/master/data-raw/penguins.csv

Thank you!
Alison/Allison

For anyone else who wondered where the csv file had moved to, it is now here: https://github.com/allisonhorst/palmerpenguins/tree/master/inst/extdata

Not yet on my box must I trust once updated, it will:

 R> system.file("extdata", "penguins.csv", package="palmerpenguins")                 
 [1] "" 
 R>  
``