r-lib/slider

feature request: variable window size

Closed this issue · 3 comments

I'd like to be able to read in a dataframe that has a column for the window size (e.g. 2 * .before + 1, same for .after). This would allow a more flexible approach to the rolling window linear model that I want to fit, where the number of datapoints used for the regression can be manually tweaked over time.

Is this currently possible with some syntax that I missed?

library(tidyverse)
library(slider)

dat <- tibble(a = 1:10, b = 11:20, c = rnorm(10),
              w = c(1, 1, 2, 2, 2, 1, 1, 1, 1, 1))

dat %>%
  mutate(roll = slide(dat, ~ lm(a ~ c, data = .x), .before = w, .after = w))
#> Error: Problem with `mutate()` column `roll`.
#> ℹ `roll = slide(dat, ~lm(a ~ c, data = .x), .before = w, .after = w)`.
#> ✖ `.before` must have size 1, not 10.

Created on 2021-08-02 by the reprex package (v2.0.0)

from working with tidyverse functions, this is the expected behaviour:

  • if length(.before) == 1, recycle without a warning
  • if length(.before) > 1 & !length(.before) == nrow(dat), recycle with a warning
  • if length(.before) == nrow(dat) use unique values without warning

It looks like you could use hop(), which allows you to pass .starts and .stops manually

library(tidyverse)
library(slider)

dat <- tibble(a = 1:10, b = 11:20, c = rnorm(10),
              w = c(1, 1, 2, 2, 2, 1, 1, 1, 1, 1))

# Looks like you are requesting a window including:
# - the current value
# - `w` before the current value
# - `w` after the current value
# dat %>% mutate(roll = slide(dat, ~ lm(a ~ c, data = .x), .before = w, .after = w))

dat <- dat %>%
  mutate(
    starts = row_number() - w,
    stops = row_number() + w
  ) 

head(dat)
#> # A tibble: 6 x 6
#>       a     b       c     w starts stops
#>   <int> <int>   <dbl> <dbl>  <dbl> <dbl>
#> 1     1    11  0.0207     1      0     2
#> 2     2    12  0.821      1      1     3
#> 3     3    13  0.493      2      1     5
#> 4     4    14 -0.782      2      2     6
#> 5     5    15 -0.0876     2      3     7
#> 6     6    16  0.0400     1      5     7

# Look at the indices that will be generated to slice `dat` with
hop(seq_len(nrow(dat)), dat$starts, dat$stops, identity)
#> [[1]]
#> [1] 1 2
#> 
#> [[2]]
#> [1] 1 2 3
#> 
#> [[3]]
#> [1] 1 2 3 4 5
#> 
#> [[4]]
#> [1] 2 3 4 5 6
#> 
#> [[5]]
#> [1] 3 4 5 6 7
#> 
#> [[6]]
#> [1] 5 6 7
#> 
#> [[7]]
#> [1] 6 7 8
#> 
#> [[8]]
#> [1] 7 8 9
#> 
#> [[9]]
#> [1]  8  9 10
#> 
#> [[10]]
#> [1]  9 10

mutate(dat, roll = hop(cur_data(), starts, stops, ~ lm(a ~ c, data = .x)))
#> # A tibble: 10 x 7
#>        a     b       c     w starts stops roll  
#>    <int> <int>   <dbl> <dbl>  <dbl> <dbl> <list>
#>  1     1    11  0.0207     1      0     2 <lm>  
#>  2     2    12  0.821      1      1     3 <lm>  
#>  3     3    13  0.493      2      1     5 <lm>  
#>  4     4    14 -0.782      2      2     6 <lm>  
#>  5     5    15 -0.0876     2      3     7 <lm>  
#>  6     6    16  0.0400     1      5     7 <lm>  
#>  7     7    17  0.158      1      6     8 <lm>  
#>  8     8    18 -1.14       1      7     9 <lm>  
#>  9     9    19  0.164      1      8    10 <lm>  
#> 10    10    20  0.235      1      9    11 <lm>

I think the reason I don't allow >1 length .before and .after in slide() and slide_index() was because I want to minimize the risk of generating non-ascending indices, this is especially important for slide_index(), which requires ascending indices in its internal algorithm. Like, this would be problematic:

index <- c(1, 1)
before <- c(0, 1)
after <- c(1, 1)

starts <- index - before
stops <- index + after

# so the `starts` of c(1, 0) end up being in decreasing order, which isn't
# allowed for `slide_index()`
starts
#> [1] 1 0
stops
#> [1] 2 2

Thanks a lot for the fast and elaborate reply!

With slide I could specify .complete to say to also calculate the lm if the window is not complete, is this also the default in hop?

I think this resolves the issue I filed. I hope you don't mind me asking further questions about my use-case below. Please let me know if I should file a new issue in stead or write my question on stackoverflow or something similar.

I'm running into serious crashing issues, likely due to missing data and me not knowing how to handle them. I'll get back here if I can narrow it down to a reproducible example that crashes. It might just be that I'm running out of memory, as I'm applying a rolling regression on a grouped (3 groups) tibble with >20k rows and 207 columns with many missing values in the desired x and y variables. The goal is to apply a rolling calibration of standard measurements to the samples that are interspersed to the standards. I'll report back here if that's okay with a reprex if I find the time to make one. Feel free to close the issue if you want.

You can't specify .complete with hop() because you have full control over the start and stop indices (i.e. they aren't generated on the fly like with slide() and slide_index()), so it wouldn't make sense to declare that some are complete and others are not. You can try to filter out the ones that don't meet your "complete" criteria before calling hop() if you need that. Or your .f function could return NULL if the data that it receives doesn't meet some minimum size criteria (i.e. if there aren't 10 rows of data, don't run the regression).

I doubt that the crash is slider specific, it may just be that your function either can't handle the missing data very well. If you can come up with a smallish reprex that suggests that slider is the problem, feel free to open another issue.