Develop AdaBoost.R2
martin0258 opened this issue · 9 comments
Scope
This issue is for developing AdaBoost.R2.
Task
- Algorithm Pesudocode (from paper)
- Class Design
- Pesudocode Routine Design
- Developer Testing (consider related test in sklearn)
- @martin0258, debug differences between actual/expected predictions of lm, ⏰ Wed 22:00
- @martin0258, debug differences between actual/expected predictions of rpart
- @martin0258, check why there are 4 estimators when using linear regression in sklearn
- @martin0258, evaluate possibility of porting DecisionTreeRegressor to R
- @martin0258, add weighted sampling
- @martin0258, add square and exponential loss functions
- @martin0258, debug the warning from predict.lm
- @martin0258, consider using test data from paper
- @martin0258, add test on UCI concrete data, ⏰ Mon 12:00
- @martin0258, read WEKA ARFF data format, ⏰ Mon 14:00
- @martin0258, check M5P implementation in R, ⏰ Mon 12:00
- @martin0258, check training error, ⏰ Tue 15:00
- @martin0258, tune base learner (nnet) performance, ⏰ Tue 15:00
- @martin0258, check sklearn AdaBoost.R2 performance on UCI datasets
- @martin0258, add
verbose
(whether to print message) - @martin0258, add 25 instances from target data set into training data
Task Detail
debug differences between actual/expected predictions of lm
The power of base estimator is the key.
import numpy as np
from numpy.testing import assert_array_equal
from sklearn.ensemble import AdaBoostRegressor
from sklearn import linear_model
# Toy sample
X = [[-2, -1], [-1, -1], [-1, -2], [1, 1], [1, 2], [2, 1]]
y_regr = [-1, -1, -1, 1, 1, 1]
T = [[-1, -1], [2, 2], [3, 2]]
y_t_regr = [-1, 1, 1]
base_estimator = linear_model.LinearRegression()
clf = AdaBoostRegressor(random_state=0, base_estimator=base_estimator)
clf.fit(X, y_regr)
prediction = clf.predict(T)
print 'num of estimators: ', len(clf.estimators_)
assert_array_equal(clf.predict(T), y_t_regr)
"""
Outputs of code above
-------------------------------
num of estimators: 4
(mismatch 100.0%)
x: array([-0.99999997, 1.99999994, 2.99999991])
y: array([-1, 1, 1])
"""
The avg loss of 1st iteration should >= 0.5, but why there are 4 estimators in the end?
# expected avg loss of 1st iteration
base_estimator = linear_model.LinearRegression()
base_estimator.fit(X, y_regr)
error = np.abs(base_estimator.predict(X) - y_regr)
error_vector = error / error.max()
sample_weight = np.array([ 1./ 6 for i in range(len(X))])
avg_loss = (error_vector * sample_weight).sum()
print 'avg loss:', avg_loss # 0.55555555555555625
Task Detail
check why there are 4 estimators when using linear regression in sklearn
Because it does weight sampling with replacement instead of training with sample weights!
Task Detail
add weighted sampling
Let's add a parameter for user to choose between:
- weighted sampling
- model built-in support for sample weighting
We assume a model supports sample weighting if it has a parameter weights
(e.g., lm, rpart, nnet). If a model does not have the parameter, we'll use weighted sampling.
Task Detail
debug the warning from predict.lm
The execution and warning message are as below:
source('C:/Users/Martin Ku/Projects/GitHub/ntu-research/src/adaboostR2.R')
Early termination at iteration 4 because avg loss >= 0.5
Error: prediction not equal to test.y
Mean relative difference: 1
_In addition: Warning message:
In predict.lm(predictor, data) :
prediction from a rank-deficient fit may be misleading_
keyword: collinear, rank deficiency
Task Detail
add test on UCI concrete data
We need to use the divided data (split into 1 target and 2 source concepts) from the author website.
We omit all cases having missing values (via na.omit
).
Task Detail
check sklearn AdaBoost.R2 performance on UCI datasets
- Read arff data using python
- How to handle missing values in
new-autompg1.arff
?
Task Detail
add 25 instances from target data set into training data
Q: Which 25 instances to pick?
A: (1) randomly, (2) first 25 instances ✅
Task Detail
tune base learner (nnet) performance,
Q: What to tune?
A: Based on Examples section of its doc, we should tune 4 parameters as below:
- size
- rang (recommendation:
rang * max(| x |)
is about 1, but what is| x |
?)- It has randomness so we need to compute RMSE over 30 runs (as paper).
- decay (what is weight decay?)
- maxit
Q: What does the initial/final value mean from nnet trace message? (Shouldn't it be iter 10, 20, 30?)
# weights: 64
initial value 833385.353669
iter 10 value 148498.517369
iter 10 value 148498.517369
iter 10 value 148498.517369
final value 148498.517369
converged