DavisVaughan/ivs

`iv_between()` (and friends) variant that returns a vector the size of `haystack`

Closed this issue · 3 comments

Like https://stackoverflow.com/questions/74874828/identify-intervals-where-a-given-vector-of-dates-occurs

i.e. you want to return TRUE if a haystack[i] interval contained any of the needles vector. Another way to say it is: if any of the needles were between haystack[i], return TRUE.

Right now it is: if needles[i] is between any of the haystack intervals, return TRUE, giving us something the size of needles.

This seems somewhat straightforward though. Use iv_locate_between() to get all matches, dropping unmatched x values, and use the $haystack location column to identify ones that surround at least one date.

We should see if this works for all cases I guess? Consider duplicates in both inputs and missing values, but I think it has promise enough that we might not need this. Maybe we can just add an example. It is better than yet another family of functions

df1<-data.frame(diveno=c(1,2,3,4,5), 
                start=c("2018-08-01 08:20:40","2018-08-01 08:40:50", "2018-08-01 10:01:00","2018-08-01 15:45:30","2018-08-01 17:06:00"),
                fin=c("2018-08-01 08:39:20","2018-08-01 08:53:40","2018-08-01 10:16:30","2018-08-01 15:58:20", "2018-08-01 17:18:20"))

df1$start <- as.POSIXct(df1$start,format="%Y-%m-%d %H:%M:%S",tz="CET")
df1$fin <- as.POSIXct(df1$fin,format="%Y-%m-%d %H:%M:%S",tz="CET")


df2<-data.frame(date=c("2018-08-01 08:30:00", "2018-08-01 15:47:00", "2018-08-02 17:10:00"))
df2$date <- as.POSIXct(df2$date,format="%Y-%m-%d %H:%M:%S",tz="CET")

df1
#>   diveno               start                 fin
#> 1      1 2018-08-01 08:20:40 2018-08-01 08:39:20
#> 2      2 2018-08-01 08:40:50 2018-08-01 08:53:40
#> 3      3 2018-08-01 10:01:00 2018-08-01 10:16:30
#> 4      4 2018-08-01 15:45:30 2018-08-01 15:58:20
#> 5      5 2018-08-01 17:06:00 2018-08-01 17:18:20
df2
#>                  date
#> 1 2018-08-01 08:30:00
#> 2 2018-08-01 15:47:00
#> 3 2018-08-02 17:10:00

locs <- ivs::iv_locate_between(
  needles = df2$date, 
  haystack = ivs::iv(df1$start, df1$fin), 
  no_match = "drop"
)

df1$surrounds <- FALSE
df1$surrounds[locs$haystack] <- TRUE
df1
#>   diveno               start                 fin surrounds
#> 1      1 2018-08-01 08:20:40 2018-08-01 08:39:20      TRUE
#> 2      2 2018-08-01 08:40:50 2018-08-01 08:53:40     FALSE
#> 3      3 2018-08-01 10:01:00 2018-08-01 10:16:30     FALSE
#> 4      4 2018-08-01 15:45:30 2018-08-01 15:58:20      TRUE
#> 5      5 2018-08-01 17:06:00 2018-08-01 17:18:20     FALSE

Created on 2022-12-21 with reprex v2.0.2.9000

Another one
https://stackoverflow.com/questions/75463306/count-the-number-of-timestamps-in-a-given-vector-that-fall-within-an-interval-in

Ideally it would look like this

library(dplyr)
library(ivs)
library(lubridate)

table <- tibble( 
  start = as.Date(c("2022-08-02", "2022-10-06", "2023-01-11")), 
  end = as.Date(c("2022-08-04", "2023-02-06", "2023-02-04"))
)

events <- c(
  ymd("2022-08-07"), 
  ymd("2022-10-17"), 
  ymd("2023-01-17"), 
  ymd("2023-02-02")
)

table %>%
  mutate(range = iv(start, end), .keep = "unused") %>%
  mutate(count = iv_count_between(range, events))

Maybe the iv_*_between() functions just require that one of the two inputs should be an iv?

Deprecate iv_between() family in favor of two families like:

  • iv_within()
  • iv_contains()

Where iv_within(needles, haystack) replaces iv_between() but needles is allowed to be a vector or an iv.

And iv_contains() is new but allows the same thing for haystack.

Both are nice because they are named after the type options in iv_overlaps(type =)


Needs to use c(>=, <) for vector needles and c(>=, <=) for iv needles for iv_within()