Stratified sampling improvements for continuous variables and rare groups
Opened this issue · 5 comments
Current implementation only works for variables that can reasonably be treated by a basic use of pandas groupby. In the future, need to establish other schemes for stratifying continuous variables as well as considering the current case that rare variables will by default be assigned to the split with the highest proportion (usually train).
Good discussion here https://scottclowe.com/2016-03-19-stratified-regression-partitions/
Related: scikit-learn/scikit-learn#4757
Related: scikit-learn/scikit-learn#4757
Hi @zkurtz ! Thanks for making us aware, we really appreciate all contributions and it's something we'll look into some more :) Out of curiosity, how did you come across this issue?
It was actually by accident, either by google search or github search ... the first time I landed on this page I was mistakenly under the impression that it was an scikit-learn issue. Looks like an interesting project!
Thanks for dropping by! Feel free to take a look around as this package can help reduce a lot of the boilerplate code that comes from starting up deep learning workloads :)