Stratified sampling improvements for continuous variables and rare groups

Question

Stratified sampling improvements for continuous variables and rare groups

Opened this issue 5 years ago · 5 comments

Current implementation only works for variables that can reasonably be treated by a basic use of pandas groupby. In the future, need to establish other schemes for stratifying continuous variables as well as considering the current case that rare variables will by default be assigned to the split with the highest proportion (usually train).

Answer 1 · 2019-06-17T16:44:17.000Z

Good discussion here https://scottclowe.com/2016-03-19-stratified-regression-partitions/

Answer 2 · 2020-03-02T03:40:25.000Z

Related: scikit-learn/scikit-learn#4757

Answer 3 · 2020-03-04T01:54:03.000Z

Related: scikit-learn/scikit-learn#4757

Hi @zkurtz ! Thanks for making us aware, we really appreciate all contributions and it's something we'll look into some more :) Out of curiosity, how did you come across this issue?

Answer 4 · 2020-03-04T13:13:11.000Z

It was actually by accident, either by google search or github search ... the first time I landed on this page I was mistakenly under the impression that it was an scikit-learn issue. Looks like an interesting project!

Answer 5 · 2020-03-04T15:25:00.000Z

Thanks for dropping by! Feel free to take a look around as this package can help reduce a lot of the boilerplate code that comes from starting up deep learning workloads :)