Feature request: add downsampling and upsampling
jjesusfilho opened this issue · 10 comments
O love this package. It is very helpful with the preparation for mixed models analysis and also with cross-validation. It would be great if we could also balance nested data by downsampling or upsampling before running the model. Anyway, thank you for the great work you have done.
Hi @jjesusfilho
Thanks a lot for your comment and request :)
It would definitely make sense to add these features to the package. It should be reasonably easy to implement as separate functions.
Will try to find time for it, though it could be a couple of months.
Please let me know, if you have other ideas :)
@jjesusfilho
It seems like caret has functions for this: https://rdrr.io/cran/caret/man/downSample.html
I haven't tried it, but seems like it should do what you want, right?
Otherwise, let me know!
@jjesusfilho
If you have time, please check out the new functions balance(), upsample() and downsample() and give me some comments on them, before I add them to the manual, etc. :)
balance() is the main function, that uses up- and downsampling to fix the group sizes to either a specific number (e.g. 3 rows per group) or to the min, max, mean or median group size in the dataset.
upsample() is a wrapper with size="min", and downsample() is a wrapper with size="max".
Please let me know, if this is the kind of thing you wanted, and whether you have ideas for improvement :)
Btw. they work in dplyr pipelines as well, so you can do something like the following:
data %>%
balance("min", "condition") %>%
fold(...)
@LudvigOlsen, thank you very much for quickly addressing the request. When I suggested the features it was because I found that your package is better suited than caret to deal with clustered data.
The caret package has these functions and other great preprocessing ones, but it doesn't work with mixed models. The purpose of downsampling or upsampling in the caret package is to balance the data based on the response categorical variable before running the model.
Your function partition is great because we can split the data based on the cat_col (response variable) and the grouping variable(id_col). I thought that your package could have functions to also balance the response variable by respecting the grouping variable (id_col). It seems to me that these new functions don't do that.
Usually I do something like this:
`df<-data.frame(grouping=c(1,2,2,3,3,3),response=c("no","yes","yes","yes","yes","yes"),
x1=c("a","b","c","b","b","a"),x2=c(1.4,5.4,3.2,4,6,2))
df_1<-df %>%
dplyr::count(response) %>%
dplyr::arrange(n) %>%
dplyr::slice(1) %>%
dplyr::left_join(df) %>%
dplyr::select(-n)
sampling<-df %>%
dplyr::anti_join(df_1) %>%
dplyr::select(grouping) %>%
dplyr::pull(grouping) %>%
sample(nrow(df_1))
df_2 <- df[which(is.element(df$grouping,sampling)),]
balanced_df<-dplyr::bind_rows(df_1,df_2)`
@jjesusfilho
I did think about balancing on the id_col, but couldn't figure out the best way to do so (didn't try that hard though). I like the idea, and it seems that there are multiple ways to do it, which could be useful in different scenarios.
It seems that your way deletes one of the grouping variables and leaves the other alone. Another way could be to make sure that an equal amount of responses is removed from each group.
In situations where you need the entire recording of a group - e.g. if it is a participant and you want to use the time/trial column in the model - your way would be a good choice.
With upsampling I also see the usefulness in a method that doesn't sample with replacement but simply repeats the data (Say you have to increase a group 4,3 times. You would repeat the data 4 times , and then sample the .3.). Not sure I would use it myself, but might as well add the possibility to do it.
Let me think about it for some time. I like the idea I implemented in balance(), where it's not necessarily either up- or downsampling, but a mixture. And in that case, finding the best implementation of various potential methods might take a day or two :)
@LudvigOlsen, my feature request was based on the need to keep the entire recording of a group as you said. This is because I work with judicial decisions on lawsuits, so I can get rid of some decisions as long as I keep all the nested variables in the remaining ones.
Court decisions in criminal cases are very imbalanced (lots of nos and a few yes), so balancing is crucial because most of the models use a default threshold of 0.50 and end up classifying the cases in the most frequent class, but usually, researchers are more interested in predicting the least frequent class. As a result, accuracy and true positive rate are lower. I use both upsampling and downsampling, though I prefer downsampling.
After reading your responses, I realized that balancing multilevel data is harder than I initially thought. Anyway, I am glad that you got excited about addressing this issue because it will help a lot of people challenged with this kind of data.
@jjesusfilho
Thanks for motivating the method with some context. It really helps understanding the needs.
In your case, would you prefer:
- Downsampling to the same amount of ID's. I.e. if there are 3 ID's in the smallest group, downsampling would simply pick 3 ID's (with all their rows) from each group.
- Downsampling would respect the ID's, but take into account the number of rows per ID, and try to match the total number of rows in the smallest group. This could mean that some groups have more or fewer ID's, than the smallest group. (Not sure how to implement this one yet, but conceptually.)
@jjesusfilho
Please check out the added id_methods in balance() :)
I think the "n_ids" method is the one you're using. Otherwise, let me know :)
I might add some more methods at some point.
So far so good. I used the function balance() with my dataset and both up and downsampling worked like a charm. Thank you very much for working this out. I will propagate your package in my R community.
Cheers.
@jjesusfilho
Great! :)
I will find time to add it to the documentation.
I have some more things I want to take a look at before uploading it to CRAN, so it might only be in the github version for a while.