Develop AdaBoost.R2

Question

Develop AdaBoost.R2

martin0258 opened this issue 11 years ago · 9 comments

Answer 1 · 2014-02-26T12:35:25.000Z

Task Detail

debug differences between actual/expected predictions of lm

The power of base estimator is the key.

import numpy as np
from numpy.testing import assert_array_equal
from sklearn.ensemble import AdaBoostRegressor
from sklearn import linear_model

# Toy sample
X = [[-2, -1], [-1, -1], [-1, -2], [1, 1], [1, 2], [2, 1]]
y_regr = [-1, -1, -1, 1, 1, 1]
T = [[-1, -1], [2, 2], [3, 2]]
y_t_regr = [-1, 1, 1]

base_estimator = linear_model.LinearRegression()
clf = AdaBoostRegressor(random_state=0, base_estimator=base_estimator)
clf.fit(X, y_regr)
prediction = clf.predict(T)
print 'num of estimators: ', len(clf.estimators_)
assert_array_equal(clf.predict(T), y_t_regr)

"""
Outputs of code above
-------------------------------
num of estimators:  4

(mismatch 100.0%)
 x: array([-0.99999997,  1.99999994,  2.99999991])
 y: array([-1,  1,  1])
"""

The avg loss of 1st iteration should >= 0.5, but why there are 4 estimators in the end?

# expected avg loss of 1st iteration
base_estimator = linear_model.LinearRegression()
base_estimator.fit(X, y_regr)
error = np.abs(base_estimator.predict(X) - y_regr)
error_vector = error / error.max()
sample_weight = np.array([ 1./ 6 for i in range(len(X))])
avg_loss = (error_vector * sample_weight).sum()
print 'avg loss:', avg_loss  # 0.55555555555555625

Answer 2 · 2014-02-27T14:53:54.000Z

Task Detail

check why there are 4 estimators when using linear regression in sklearn

Because it does weight sampling with replacement instead of training with sample weights!

Answer 3 · 2014-02-28T09:19:04.000Z

Task Detail

add weighted sampling

Let's add a parameter for user to choose between:

weighted sampling
model built-in support for sample weighting

We assume a model supports sample weighting if it has a parameter weights (e.g., lm, rpart, nnet). If a model does not have the parameter, we'll use weighted sampling.

Answer 4 · 2014-02-28T10:13:40.000Z

Task Detail

debug the warning from predict.lm

The execution and warning message are as below:

source('C:/Users/Martin Ku/Projects/GitHub/ntu-research/src/adaboostR2.R')

Early termination at iteration 4 because avg loss >= 0.5
Error: prediction not equal to test.y
Mean relative difference: 1
_In addition: Warning message:
In predict.lm(predictor, data) :
prediction from a rank-deficient fit may be misleading_

keyword: collinear, rank deficiency

Answer 5 · 2014-02-28T15:59:29.000Z

Task Detail

add test on UCI concrete data

We need to use the divided data (split into 1 target and 2 source concepts) from the author website.
We omit all cases having missing values (via na.omit).

Answer 6 · 2014-03-03T04:10:50.000Z

Task Detail

check M5P implementation in R

Answer 7 · 2014-03-04T00:04:06.000Z

Task Detail

check sklearn AdaBoost.R2 performance on UCI datasets

Read arff data using python
How to handle missing values in new-autompg1.arff?

Answer 8 · 2014-03-10T13:56:52.000Z

Task Detail

add 25 instances from target data set into training data

Q: Which 25 instances to pick?
A: (1) randomly, (2) first 25 instances ✅

Answer 9 · 2014-03-11T00:56:29.000Z

Task Detail

tune base learner (nnet) performance,

Q: What to tune?
A: Based on Examples section of its doc, we should tune 4 parameters as below:

size
rang (recommendation: rang * max(| x |) is about 1, but what is | x |?)
- It has randomness so we need to compute RMSE over 30 runs (as paper).
decay (what is weight decay?)
maxit

Q: What does the initial/final value mean from nnet trace message? (Shouldn't it be iter 10, 20, 30?)

# weights:  64
initial  value 833385.353669 
iter  10 value 148498.517369
iter  10 value 148498.517369
iter  10 value 148498.517369
final  value 148498.517369 
converged

Scope

Task

Task Detail

Task Detail

Task Detail

Task Detail

Task Detail

Task Detail

Task Detail

Task Detail

Task Detail