Example/Help with Dealing with MultiLabel + Stacked Classifier Case
DMTSource opened this issue · 7 comments
The docs shows an example and explanation of working with stacked classifiers and how to use attr_dict + predict_proba to avoid ovefit. I am attempting to implement this usage of predict_proba after working with my model in a case of multilabel classification, but I am facing some challenges with this upgrade.
In the docs the issue is fixed with the use of drop_first_col lambda which fixes the issue for a single class inference with predict proba. I have created a similar lambda which works to do the same task but for each class(see example below of predict_proba output to illustrate).
multilabel_proba_reduced = Lambda(lambda prpr: np.array([prpr[i][:, 1:].flatten() for i in range(n_classes)]).T)
Which returns (n_samples, n_classes) from the predict_proba operation.
After I got the model to work in training, the prediction step is now throwing errors. I think im close but my solution, similar to the doc example's 'drop_first_col' lambda, is hidden inside the overridden 'fit_predict' function. This is why I am guessing things break at predict time, as the operation is not in the graph outside of fitting. When I attempt fix this like in the example via lambda, I ran into many issues trying to get things right for training step, and went in circles.
To illustrate the primary difference, we get the 2 probabilities for each class, so a y sample takes the form:
y_test[0] == [1 1 0]
y_test[1] == [0 1 0]
Then the output from predict_proba takes on the form:
[
array([[ 0.46147748, 0.53852252],
[ 0.46147748, 0.53852252],
[ 0.52721207, 0.47278793]]),
array([[ 0.55917461, 0.44082539],
[ 0.44082539, 0.55917461],
[ 0.50852903, 0.49147097]])
]
Describe the solution you'd like
I have that can run but it crashes on predict step after training, the error is share as well below the code in a comment. Any suggestions for getting around my confusing with tying to tie together the first and second level classifiers would be very much appreciated!
PLEASE SEE THE FULL EXAMPLE CODE HERE
https://gist.github.com/DMTSource/368b09e2c7f780f1355606f6e716d197
Terminal error here:
https://gist.github.com/DMTSource/368b09e2c7f780f1355606f6e716d197#gistcomment-3677803
There is a lot going on this script and will take me a bit to look at it in detail. At first glance I don't even understand why the fit step is succeeding in the first place. I have my hands a bit full at the moment, I'll get back to you possibly over the weekend.
Sorry about the script it does appear to be a bit of a mess. I will try to clean it up and rephrase some things to hopefully save you some time:
In the above script I had to throw in a bunch of column stacks to get the outputs of RandomForestClassifier work due to errors with the multi label output. But as you touched on, my magically getting fit to work in this way was not successful.
Please ignore the first script, here is the same process/attempt but I applied it as closely as I could to the "Stacked classifiers (standard protocol)" example so its easier to follow what I am trying to do i.e. work with multilabel outputs via the predict_proba route.
https://gist.github.com/DMTSource/26b6a386a6ba54f23d0ae0a9d22ddbfa
I think my big issue issue/mistake(as mentioned before) is instead of a Lambda function in the graph, I sort out the multilabel extraction(leave out 1 of the 2 values that sum to 1) inside the 'fit_predict' operation. This probably means when its time to predict(where the crash occurs) there is no operation in the graph to perform this reduction and we see a shape error of some kind. But I am having trouble trying to use Lambda in this manner, as the first layer of classifiers then complain about their outputs being the wrong shape due to the multi label output.
Thank you for simplifying the script!
Yes, you are right. The output of fit_predict
(a single array) and the output of predict
(a list of arrays) is not consistent. During fit, since fit_predict
(with the custom stacking of multi-output probabilities) is used ColumnStack
does not complain because is only receiving a single array. But during predict, predict_proba
is used instead and that returns a list of several arrays which conflicts with the expected number of outputs (just one).
I guess there are other ways if you play around somehow with Lambda
s or ColumnStack
s, but I think the easiest way is to override the API of RandomForestClassifier
:
import numpy as np
import sklearn.ensemble
import sklearn.linear_model
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_predict, train_test_split
from baikal import Input, Model, make_step
from baikal.plot import plot_model
from baikal.steps import ColumnStack
def stack_multioutput_proba(mop):
return np.column_stack([c[:, 1:] for c in mop]) # or :-1 if you prefer to drop the last
def predict_proba_stacked(self, X):
# NOTE: plain super() does not work
mop = super(RandomForestClassifier, self).predict_proba(X)
return stack_multioutput_proba(mop)
def fit_predict(self, X, y):
self.fit(X, y)
cvp = cross_val_predict(self, X, y, method="predict_proba") # note that this NOT predict_proba_stacked
return stack_multioutput_proba(cvp)
attr_dict = {"fit_predict": fit_predict, "predict_proba_stacked": predict_proba_stacked}
RandomForestClassifier = make_step(sklearn.ensemble.RandomForestClassifier, attr_dict)
ExtraTreesClassifier = make_step(sklearn.ensemble.ExtraTreesClassifier)
# ------- Random Multilabel dataset
np.random.set_state(np.random.RandomState(0).get_state())
X = np.random.random((1000, 50)) # feature array
y_p = np.random.randint(0,2, (1000, 2)) # miltilabel aray, ex sample: [1 1] aka both classes detected
X_train, X_test, y_train, y_test = train_test_split(
X, y_p, test_size=0.2, random_state=0
)
# ------- Build model
x = Input()
y_t = Input()
y_p1 = RandomForestClassifier(random_state=0)(x, y_t, compute_func="predict_proba_stacked")
y_p2 = RandomForestClassifier(random_state=0)(x, y_t, compute_func="predict_proba_stacked")
stacked_features = ColumnStack()([y_p1, y_p2])
y_p = ExtraTreesClassifier(random_state=0)(stacked_features, y_t)
model = Model(x, y_p, y_t)
plot_model(model, filename="stacked_classifiers_standard.png", dpi=96)
# ------- Train model
model.fit(X_train, y_train)
# ------- Evaluate model
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
print("F1 score on train data:", f1_score(y_train, y_train_pred, average=None))
print("F1 score on test data:", f1_score(y_test, y_test_pred, average=None))
Essentially, override the original API of RandomForestClassifier
to something that is more easily handled by your application. In this case note that instead of overriding predict_proba
I created another predict_proba_stacked
that stacks the outputs. This is because cross_val_predict
(a function native of scikti-learn) expects the original predict_proba
that gives a list of outputs.
Also note that since I'm using super
perhaps it would be easier and more readable to just use the sub-classing style (inheriting from Step
and the classifier class) instead of using make_step
.
By the way, note that steps accept a n_outputs
argument that is meant precisely for these cases. That argument allows you to specify the number of outputs you expect from the step (2 in this example). I haven't tried it, but if you specify n_outputs=2
, you should be able to do it without overriding predict_proba
and without stacking the outputs within fit_predict
. fit_predict
could just return the list of arrays just like predict_proba
, and then do the column stacking with Lambda
s and ColumnStack
steps.
If the either of the above solutions work and help you achieve what you want, it would be nice to add a new example of this use case :)
For completeness here is the same model implemented using n_outputs
and Lambda
steps. You can confirm that it produces the same results as the script above. On second thought, this style seems simpler.
import numpy as np
import sklearn.ensemble
import sklearn.linear_model
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_predict, train_test_split
from baikal import Input, Model, make_step
from baikal.plot import plot_model
from baikal.steps import ColumnStack, Lambda
def fit_predict(self, X, y):
self.fit(X, y)
return cross_val_predict(self, X, y, method="predict_proba")
attr_dict = {"fit_predict": fit_predict}
RandomForestClassifier = make_step(sklearn.ensemble.RandomForestClassifier, attr_dict)
ExtraTreesClassifier = make_step(sklearn.ensemble.ExtraTreesClassifier)
# ------- Random Multilabel dataset
np.random.set_state(np.random.RandomState(0).get_state())
n_outputs = 2
X = np.random.random((1000, 50)) # feature array
y_p = np.random.randint(0, n_outputs, (1000, n_outputs)) # miltilabel aray, ex sample: [1 1] aka both classes detected
X_train, X_test, y_train, y_test = train_test_split(
X, y_p, test_size=0.2, random_state=0
)
# ------- Build model
# The model is built similarly as the naive case. The difference is that during fit
# baikal will detect and use the fit_predict method above.
x = Input()
y_t = Input()
y_p1 = RandomForestClassifier(random_state=0, n_outputs=n_outputs)(x, y_t, compute_func="predict_proba")
y_p2 = RandomForestClassifier(random_state=0, n_outputs=n_outputs)(x, y_t, compute_func="predict_proba")
stack_multioutput_proba = Lambda(lambda mop: np.column_stack([c[:, 1:] for c in mop]))
y_p1 = stack_multioutput_proba(y_p1)
y_p2 = stack_multioutput_proba(y_p2)
stacked_features = ColumnStack()([y_p1, y_p2])
y_p = ExtraTreesClassifier(random_state=0)(stacked_features, y_t)
model = Model(x, y_p, y_t)
plot_model(model, filename="stacked_classifiers_standard_2.png", dpi=96)
# ------- Train model
model.fit(X_train, y_train)
# ------- Evaluate model
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
print("F1 score on train data:", f1_score(y_train, y_train_pred, average=None))
print("F1 score on test data:", f1_score(y_test, y_test_pred, average=None))
The solutions above should solve the issue so I'll close this. If the issue is not solved feel free to reopen.