Design synthetic drift utilities
tms-bananaquit opened this issue · 2 comments
tms-bananaquit commented
Section 3 of Souza et. al. 2020 gives a good summary of potential approaches.
The "floor" could be including examples of using these methods to inject drift.
The "ceiling" would be developing independent utilities to make some of the work easier. Even if not, they may be worth making note of, in case the code-base reaches a point where replicating the examples "by hand" is annoying, e.g. pipeline-like objects.
indialindsay commented
Comments from gitlab issue:
- Page 20 has a list. Having a utility with a couple dials to moderate these approaches in reasonable ways, e.g. "proportion of labels flip-flopped" might be useful.
- Convert ARFF format to something we can use? https://sites.google.com/view/uspdsrepository
- Forest covertype seems like a candidate for temporospatial data we could use as a test.
- General approaches for ensembling models of different ages?
- Conceptually, building an incremental learner on top of a "batch learner," so that the same tools and approach can be used, at least temporarily, in a context which is now recognized as streaming.
- Need to read through section 4 more thoroughly.
- can consider adding a random walk / brownian noise as described in this paper. it is intended for time series data but we should be able to modify it for streaming / batch
anmol-srivastava-mitre commented
Adding a note to myself about using toolz
or the @curry
operator to make any function I develop, able to be passed along as a pipeline. (Just an idea):
while not finished:
data = pipe(*list_of_fns, data)
list_of_fns = [join_class_function, swap_class_function]
def join_class_function():
...