Implement Ensemble for Streaming Detectors
Anmol-Srivastava opened this issue · 3 comments
Anmol-Srivastava commented
Task
Implement an ensemble inheriting from Ensemble
and StreamingDetector
which is meant for the streaming case.
Description TBD. Stems from #77
Random notes:
class StreamingEnsemble(Ensemble, StreamingDetector):
def update():
step_detector_1_forward() # can use pipeline idea scrapped earlier
if detector_1.reached_burn_in_point:
# do something
Anmol-Srivastava commented
Here's the current look:
df = fetch_rainfall_data()[['temperature', 'dew_point', 'sea_level_pressure']]
y_trues = fetch_rainfall_data().rain
k1 = KdqTreeStreaming(window_size=10, bootstrap_samples=1) # only works with >1 X features
p1 = PCACD(window_size=10, sample_period=0.1) # have to set step to integer
d1 = DDM(n_threshold=10)
e1 = EDDM(n_threshold=10)
l1 = LinearFourRates(burn_in=2, num_mc=1)
m1 = MD3 # TODO - no idea how to include this one
s1 = STEPD(window_size=10)
a1 = ADWINOutcome() # default threshold sizes are too large
detectors = {
'k1': k1, 'p1': p1, # data drift
'd1': d1, 'e1': e1, 'l1': l1, 's1': s1, 'a1': a1, # concept drift
# change detection
}
s = StreamingEnsemble(detectors, 'simple-majority')
for i in range(21):
s.update(X=df.iloc[[i]], y_true=y_trues[i], y_pred=np.random.randint(0,2))
It runs! The following issues remain for me to look at:
-
add MD3 to this example (can cheat off of concept drift examples) -
use large enough # of rows such that burn-ins, thresholds come into play (and sketch logic to handle that) - should change detectors be in here? many are univariate, which may mean some internal parsing in the ensemble code, since we're passing in entire rows above?
Tagging @tms-bananaquit for initial thoughts
tms-bananaquit commented
- Definitely need to think about MD3, but I'm not sure we should commit to fixing it with this issue. One problem is that it's supposed to loop calling
give_oracle_label
rather thanupdate
once it hits a warning state, so if we do handle it in the streaming ensemble, it's sort of a commitment to a particular structure for semi-supervised detectors which we may have to back out later. On theset_reference
problem, at least, that only needs to be explicitly called afterinit
. - I think the extra logic on the burn-in and similar may need to come in with fancier options for the
evaluators
, at least to encompass the two approaches noted in issue 25, which are a) "if any detector enters a drift state, reset and retrain all detectors" and b) "wait for a second detector to confirm before having the ensemble alarm".- Neither of these specifically accounts for "out of sync" burn-in. I think having these on comparable resolutions is something we could impose on the "outside" as part of testing, but I'm not sure about a solution right now, or whether it's even desirable to bake one into e.g. validation.
- Re: change detectors, I think you already solved this problem in the
Ensemble
superclass, with thecolumns
argument. There might be some way to make it more convenient to have null entries where it's not applicable, and fork on the input:- PCACD and KdqTreeStreaming probably would respond to null by operating on the entire
X
- concept drift detectors just ignore
X
entirely, so presumably non-nullcolumns
entries would also be ignored?
- PCACD and KdqTreeStreaming probably would respond to null by operating on the entire
- I think the note on the diversity of drift detectors from #25 is something we can incorporate into examples, rather than explicitly prohibiting the user from combining certain detectors one way or the other.
Anmol-Srivastava commented
@tms-bananaquit Ok, so I've decided then to ignore MD3, and add the evaluators (I may discuss those with you separately) without thinking much of burn-in for this issue.
Need a bit to wrap my head around the change detectors bit, working on that now.