jdurbin/wekaMine

Support incremental runs.

Opened this issue · 0 comments

Need support for incremental cluster runs. For example, would like to do a quick pass over a data set where only build one type of model, linear models say. Then would like to go back over that dataset to consider a wider range of algorithms/parameters. Currently I have to craft one config file, do the run, then craft an orthogonal config file with new stuff only. At minimum I'd like to just be able to add stuff to config file incrementally and have wekaMine figure out what has already been evaluated and what has not.

Note: This conflicts with the feature of putting an eval loop on the outside of the model selection loop.

If we have this improved model evaluation:

for each CV fold Ci
     for each model
         evaluate over CV folds of Ci -> ROC
     best model = argmax(ROClist) -> best model list.
best overall model = argmax(best model list)

Maybe it's not too difficult. If we save the CV folds Ci, we could do a run where we evaluate each of, say, linear models over each CV folds Ci. The results put into a best model list that is annotated with the CV fold number i. If we want to build and use these intermediate models, we just pull out the best model over all CV folds so far. If we add some more models to the evaluation, also run those over each model and add their results to the best models list.

generate CV folds Ci -> save
for each CV fold Ci
    for each model in set1
          evaluate over CV fold Ci -> roclist_set1
    best temporary model = argmax(roclist)  // this would be a mere intermediate result. 
 bestoveralltemporarymodel = argmax(besttemporarymodellist)     

// Now add set2 models and repeat...
for each CV fold Ci
    for each model in set2
           evaluate over CV fold Ci -> roclist_set2
   roclist = roclist_set1+roclist_set2
   best model = argmax(roclist) -> bestmodellist
bestoverall model = argmax(bestmodellist)

My gut feeling is that this should be a separate script as it would clutter up the basic wekaMine script too much.