leeper/csvy

Subsetting (and similar operations) lose attributes

Opened this issue · 7 comments

Since the default underlying data structure is a data.frame (or data table), when I subset my data, it loses all its attributes that .csvy provides.

When generating the structure, could you perhaps change the class so that attributes are preserved with something like what is done here: http://stackoverflow.com/questions/10404224/how-to-delete-a-row-from-a-data-frame-without-losing-the-attributes

Ideally, the class would only be changed if attributes were needed. For instance, if there are no file-level attributes, it would stay as a normal data.frame, and if there are no column-level attributes, it would stay as a normal vector.

I have mixed feelings about how to handle this. This is a general issue with data.frames as an object class and I don't want csvy to have custom classes just to handle this. I'll leave this open in case anyone has a really good solution.

For what it's worth, dplyr functions seem to preserve attributes. Using a custom iris_test.csvy file with this header:

#---
#class: data.frame
#fields:
#- name: Sepal.Length
#  class: numeric
#  thing: 'good'
#- name: Sepal.Width
#  class: numeric
#  thing: 'ok'
#- name: Petal.Length
#  class: numeric
#  thing: 'bad'
#- name: Petal.Width
#  class: numeric
#  thing: 'ok'
#- name: Species
#  class: factor
#  levels:
#  - setosa
#  - versicolor
#  - virginica
#--- 
"Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"
5.1,3.5,1.4,0.2,"setosa"
4.9,3,1.4,0.2,"setosa"
4.7,3.2,1.3,0.2,"setosa"
4.6,3.1,1.5,0.2,"setosa"
5,3.6,1.4,0.2,"setosa"

...I get the following results using dplyr functions:

import::from("csvy", "read_csvy", "write_csvy")
import::from("magrittr", "%>%")
import::from("tibble", "as_tibble")
import::from("dplyr", "slice", "filter")

z <- read_csvy("iris_test.csvy") %>% as_tibble()
str(z)
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	150 obs. of  5 variables:
#  $ Sepal.Length: atomic  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#   ..- attr(*, "thing")= chr "good"
#  $ Sepal.Width : atomic  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#   ..- attr(*, "thing")= chr "ok"
#  $ Petal.Length: atomic  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#   ..- attr(*, "thing")= chr "bad"
#  $ Petal.Width : atomic  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#   ..- attr(*, "thing")= chr "ok"
#  $ Species     : atomic  setosa setosa setosa setosa ...
#   ..- attr(*, "levels")= chr  "setosa" "versicolor" "virginica"
str(z %>% slice(1:5))
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	5 obs. of  5 variables:
#  $ Sepal.Length: atomic  5.1 4.9 4.7 4.6 5
#   ..- attr(*, "thing")= chr "good"
#  $ Sepal.Width : atomic  3.5 3 3.2 3.1 3.6
#   ..- attr(*, "thing")= chr "ok"
#  $ Petal.Length: atomic  1.4 1.4 1.3 1.5 1.4
#   ..- attr(*, "thing")= chr "bad"
#  $ Petal.Width : atomic  0.2 0.2 0.2 0.2 0.2
#   ..- attr(*, "thing")= chr "ok"
#  $ Species     : atomic  setosa setosa setosa setosa ...
#   ..- attr(*, "levels")= chr  "setosa" "versicolor" "virginica"
str(z %>% filter(Sepal.Length > 5, Sepal.Width > 4))
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	3 obs. of  5 variables:
#  $ Sepal.Length: atomic  5.7 5.2 5.5
#   ..- attr(*, "thing")= chr "good"
#  $ Sepal.Width : atomic  4.4 4.1 4.2
#   ..- attr(*, "thing")= chr "ok"
#  $ Petal.Length: atomic  1.5 1.5 1.4
#   ..- attr(*, "thing")= chr "bad"
#  $ Petal.Width : atomic  0.4 0.1 0.2
#   ..- attr(*, "thing")= chr "ok"
#  $ Species     : atomic  setosa setosa setosa
#   ..- attr(*, "levels")= chr  "setosa" "versicolor" "virginica"

Also, subsetting tbl_df using [ and [[ seems to preserve attributes unless you're selecting rows.

str(z[, 1])
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	150 obs. of  1 variable:
#  $ Sepal.Length: atomic  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#   ..- attr(*, "thing")= chr "good"
str(z[[1]])
#  atomic [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#  - attr(*, "thing")= chr "good"
str(z[1])
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	150 obs. of  1 variable:
#  $ Sepal.Length: atomic  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#   ..- attr(*, "thing")= chr "good"

Selecting rows with [ drops the attributes.

str(z[1, ])
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	1 obs. of  5 variables:
#  $ Sepal.Length: num 5.1
#  $ Sepal.Width : num 3.5
#  $ Petal.Length: num 1.4
#  $ Petal.Width : num 0.2
#  $ Species     : chr "setosa"

Sorry, one more: Further subsetting of the vectors will drop the attributes:

str(z[[1]])
#  atomic [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#  - attr(*, "thing")= chr "good"
str(z[[1]][1:5])
#  num [1:5] 5.1 4.9 4.7 4.6 5

Haven't played with it, but looks like the sticky package was designed to solve this exact problem.

I've recently started working with the sticky package for similar issues in other data. I was referring to the row-drop loss of attributes.

At least for my use cases, I think of the result when loading a .csvy file as different than a data.frame because the reason for having .csvy in my use case is "I want to load data with column attributes". The "with column attributes" part is the important part to me.

My preference would be to use the sticky package to preserve column attributes through subsetting within this package. @leeper, does that seem reasonable to you?

@billdenney For your use case (which I frequently see as well), perhaps you could have a look at a new package I've started developing -- metar -- which tries to provide a more comprehensive set of tools for working with data frame metadata. It's heavily inspired by csvy, but makes a few fundamentally different design decisions that I didn't want to force on csvy as an outsider (for instance, more reliance on the tidyverse packages, particularly purrr and rlang).

To note data.table doesn't drop attributes on subsets either.

str(d1)
'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : chr  "setosa" "setosa" "setosa" "setosa" ...
  ..- attr(*, "levels")= chr  "setosa" "versicolor" "virginica"
 - attr(*, "profile")= chr "tabular-data-package"
 - attr(*, "name")= chr "iris"
str(d1[1:5])
'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : chr  "setosa" "setosa" "setosa" "setosa" ...
  ..- attr(*, "levels")= chr  "setosa" "versicolor" "virginica"
str(as.data.table(d1)[1:5])
Classes ‘data.table’ and 'data.frame':	5 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2
 $ Species     : chr  "setosa" "setosa" "setosa" "setosa" ...
  ..- attr(*, "levels")= chr  "setosa" "versicolor" "virginica"
 - attr(*, "profile")= chr "tabular-data-package"
 - attr(*, "name")= chr "iris"
 - attr(*, ".internal.selfref")=<externalptr> 

I'm having a similar issue with a package I'm developing, wondering if attributes are the right place to store metadata and information about columns given that they seem easy to lose (short of creating a new class)..