insightsengineering/random.cdisc.data

adding snapshots for randomly picked rows (in a way to be constant and representative)

Closed this issue · 6 comments

This would be useful imo so to see instantly what has been changed and avoiding involuntary changes to other columns. At the moment, it is a bit more complicated to review the changes between versions. What do you think @edelarua? Can I do this before completing the review of your lubridate PR #250?

@Melkiades I have been thinking about reproducibility as well and I think we will first have to implement a seed for each sample call as we did in chevron for this to be of use, since right now any changes/added variables in a dataset will change the values in any subsequent columns even if those columns are not altered (if I understand the seed issue correctly). If you think this would be useful you can open another issue for this. Let me know if you want me to work on any of these issues today as well! I do think these snapshots would be nice to have.

You could also implement the snapshots first for the purpose of reviewing the lubridate PR and then we can add in the seeds after if that’s easier for you.

I do not remember exactly why we did that... I did some tests, and that is rarely what we want:

> set.seed(1); v <- letters[1:3]
> tibble(rep(NA, 3)) %>% 
+     mutate(a = rnorm(3)) %>% 
+     mutate(b = rnorm(3)) %>% 
+     mutate(vs = sample(v)) %>% 
+     mutate(vs2 = sample(v))
# A tibble: 3 × 5
  `rep(NA, 3)`      a      b vs    vs2  
  <lgl>         <dbl>  <dbl> <chr> <chr>
1 NA           -0.626  1.60  a     b    
2 NA            0.184  0.330 c     c    
3 NA           -0.836 -0.820 b     a    
> set.seed(1); v <- letters[1:3]
> tibble(rep(NA, 3)) %>% 
+     mutate(a = rnorm(3)) %>% 
+     mutate(b = rnorm(3)) %>% 
+     mutate(vs = sample(v)) %>% 
+     mutate(vs2 = sample(v))
# A tibble: 3 × 5
  `rep(NA, 3)`      a      b vs    vs2  
  <lgl>         <dbl>  <dbl> <chr> <chr>
1 NA           -0.626  1.60  a     b    
2 NA            0.184  0.330 c     c    
3 NA           -0.836 -0.820 b     a    
> set.seed(1); v <- letters[1:3]
> tibble(rep(NA, 3)) %>% 
+     mutate(a = rnorm(3)) %>% 
+     mutate(b = {set.seed(1); rnorm(3)}) %>% 
+     mutate(vs = {set.seed(1); sample(v)}) %>% 
+     mutate(vs2 = {set.seed(1); sample(v)})
# A tibble: 3 × 5
  `rep(NA, 3)`      a      b vs    vs2  
  <lgl>         <dbl>  <dbl> <chr> <chr>
1 NA           -0.626 -0.626 a     a    
2 NA            0.184  0.184 b     b    
3 NA           -0.836 -0.836 c     c    

Even if the latter is what we want, the first column (aka b) can be just copied as it would be done for vs2. If the sampling order is what is to be preserved across columns, then it would be enough to store that beforehand and sort it, right?

@Melkiades we would set a different seed for each column since we still want random data. The problem is more that if we have a dataset already with column AVAL, and then decide to add a column before this one, then all the values of AVAL will be different:

eg.

set.seed(1)
df <- data.frame(AVAL = sample(1:5, 5))
df
  AVAL
1    2
2    5
3    4
4    3
5    1

becomes:

set.seed(1)
df <- data.frame(STRATA = sample(LETTERS[1:5], 5), AVAL = sample(1:5, 5))
df
  STRATA AVAL
1      B    5
2      E    4
3      D    2
4      C    3
5      A    1

Whereas if we set the seed for AVAL directly, we can preserve the values:

set.seed(1)
df <- data.frame(AVAL = {set.seed(101); sample(1:5, 5)})
df
  AVAL
1    2
2    1
3    3
4    5
5    4

becomes:

set.seed(1)
df <- data.frame(
    STRATA = {set.seed(102); sample(LETTERS[1:5], 5)}, 
    AVAL = {set.seed(101); sample(1:5, 5)}
)
df
  STRATA AVAL
1      C    2
2      B    1
3      E    3
4      D    5
5      A    4

I see what you mean. Now the fact that new columns are added after the others make sense. I think it is still better not set seed every call of sample or rnorm but to have the disclaimer that new columns must be added on the right. What do you think?

We could do that and explain briefly. The snapshots will be useful for new variables, but not necessarily for refactoring (like the implementation of lubridate) since we should expect many columns to change due to the behavior of the seed.

Yep, now it is clear. Thanks Emily! I suggest waiting for my review and your PR to be merged then. I will diff on the relevant columns manually sigh