Subsetting (and similar operations) lose attributes
Opened this issue · 7 comments
Since the default underlying data structure is a data.frame (or data table), when I subset my data, it loses all its attributes that .csvy provides.
When generating the structure, could you perhaps change the class so that attributes are preserved with something like what is done here: http://stackoverflow.com/questions/10404224/how-to-delete-a-row-from-a-data-frame-without-losing-the-attributes
Ideally, the class would only be changed if attributes were needed. For instance, if there are no file-level attributes, it would stay as a normal data.frame, and if there are no column-level attributes, it would stay as a normal vector.
I have mixed feelings about how to handle this. This is a general issue with data.frames as an object class and I don't want csvy to have custom classes just to handle this. I'll leave this open in case anyone has a really good solution.
For what it's worth, dplyr
functions seem to preserve attributes. Using a custom iris_test.csvy
file with this header:
#---
#class: data.frame
#fields:
#- name: Sepal.Length
# class: numeric
# thing: 'good'
#- name: Sepal.Width
# class: numeric
# thing: 'ok'
#- name: Petal.Length
# class: numeric
# thing: 'bad'
#- name: Petal.Width
# class: numeric
# thing: 'ok'
#- name: Species
# class: factor
# levels:
# - setosa
# - versicolor
# - virginica
#---
"Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"
5.1,3.5,1.4,0.2,"setosa"
4.9,3,1.4,0.2,"setosa"
4.7,3.2,1.3,0.2,"setosa"
4.6,3.1,1.5,0.2,"setosa"
5,3.6,1.4,0.2,"setosa"
...I get the following results using dplyr
functions:
import::from("csvy", "read_csvy", "write_csvy")
import::from("magrittr", "%>%")
import::from("tibble", "as_tibble")
import::from("dplyr", "slice", "filter")
z <- read_csvy("iris_test.csvy") %>% as_tibble()
str(z)
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 150 obs. of 5 variables:
# $ Sepal.Length: atomic 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# ..- attr(*, "thing")= chr "good"
# $ Sepal.Width : atomic 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# ..- attr(*, "thing")= chr "ok"
# $ Petal.Length: atomic 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# ..- attr(*, "thing")= chr "bad"
# $ Petal.Width : atomic 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# ..- attr(*, "thing")= chr "ok"
# $ Species : atomic setosa setosa setosa setosa ...
# ..- attr(*, "levels")= chr "setosa" "versicolor" "virginica"
str(z %>% slice(1:5))
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 5 obs. of 5 variables:
# $ Sepal.Length: atomic 5.1 4.9 4.7 4.6 5
# ..- attr(*, "thing")= chr "good"
# $ Sepal.Width : atomic 3.5 3 3.2 3.1 3.6
# ..- attr(*, "thing")= chr "ok"
# $ Petal.Length: atomic 1.4 1.4 1.3 1.5 1.4
# ..- attr(*, "thing")= chr "bad"
# $ Petal.Width : atomic 0.2 0.2 0.2 0.2 0.2
# ..- attr(*, "thing")= chr "ok"
# $ Species : atomic setosa setosa setosa setosa ...
# ..- attr(*, "levels")= chr "setosa" "versicolor" "virginica"
str(z %>% filter(Sepal.Length > 5, Sepal.Width > 4))
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 3 obs. of 5 variables:
# $ Sepal.Length: atomic 5.7 5.2 5.5
# ..- attr(*, "thing")= chr "good"
# $ Sepal.Width : atomic 4.4 4.1 4.2
# ..- attr(*, "thing")= chr "ok"
# $ Petal.Length: atomic 1.5 1.5 1.4
# ..- attr(*, "thing")= chr "bad"
# $ Petal.Width : atomic 0.4 0.1 0.2
# ..- attr(*, "thing")= chr "ok"
# $ Species : atomic setosa setosa setosa
# ..- attr(*, "levels")= chr "setosa" "versicolor" "virginica"
Also, subsetting tbl_df
using [
and [[
seems to preserve attributes unless you're selecting rows.
str(z[, 1])
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 150 obs. of 1 variable:
# $ Sepal.Length: atomic 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# ..- attr(*, "thing")= chr "good"
str(z[[1]])
# atomic [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# - attr(*, "thing")= chr "good"
str(z[1])
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 150 obs. of 1 variable:
# $ Sepal.Length: atomic 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# ..- attr(*, "thing")= chr "good"
Selecting rows with [
drops the attributes.
str(z[1, ])
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1 obs. of 5 variables:
# $ Sepal.Length: num 5.1
# $ Sepal.Width : num 3.5
# $ Petal.Length: num 1.4
# $ Petal.Width : num 0.2
# $ Species : chr "setosa"
Sorry, one more: Further subsetting of the vectors will drop the attributes:
str(z[[1]])
# atomic [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# - attr(*, "thing")= chr "good"
str(z[[1]][1:5])
# num [1:5] 5.1 4.9 4.7 4.6 5
Haven't played with it, but looks like the sticky
package was designed to solve this exact problem.
I've recently started working with the sticky
package for similar issues in other data. I was referring to the row-drop loss of attributes.
At least for my use cases, I think of the result when loading a .csvy
file as different than a data.frame because the reason for having .csvy
in my use case is "I want to load data with column attributes". The "with column attributes" part is the important part to me.
My preference would be to use the sticky
package to preserve column attributes through subsetting within this package. @leeper, does that seem reasonable to you?
@billdenney For your use case (which I frequently see as well), perhaps you could have a look at a new package I've started developing -- metar
-- which tries to provide a more comprehensive set of tools for working with data frame metadata. It's heavily inspired by csvy
, but makes a few fundamentally different design decisions that I didn't want to force on csvy
as an outsider (for instance, more reliance on the tidyverse
packages, particularly purrr
and rlang
).
To note data.table
doesn't drop attributes on subsets either.
str(d1)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : chr "setosa" "setosa" "setosa" "setosa" ...
..- attr(*, "levels")= chr "setosa" "versicolor" "virginica"
- attr(*, "profile")= chr "tabular-data-package"
- attr(*, "name")= chr "iris"
str(d1[1:5])
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : chr "setosa" "setosa" "setosa" "setosa" ...
..- attr(*, "levels")= chr "setosa" "versicolor" "virginica"
str(as.data.table(d1)[1:5])
Classes ‘data.table’ and 'data.frame': 5 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2
$ Species : chr "setosa" "setosa" "setosa" "setosa" ...
..- attr(*, "levels")= chr "setosa" "versicolor" "virginica"
- attr(*, "profile")= chr "tabular-data-package"
- attr(*, "name")= chr "iris"
- attr(*, ".internal.selfref")=<externalptr>
I'm having a similar issue with a package I'm developing, wondering if attributes are the right place to store metadata and information about columns given that they seem easy to lose (short of creating a new class)..