martin0258/university-research

Develop AdaBoost.R2

martin0258 opened this issue · 9 comments

Scope

This issue is for developing AdaBoost.R2.

Task

  • Algorithm Pesudocode (from paper)
  • Class Design
  • Pesudocode Routine Design
  • Developer Testing (consider related test in sklearn)
  • @martin0258, debug differences between actual/expected predictions of lm, ⏰ Wed 22:00
  • @martin0258, debug differences between actual/expected predictions of rpart
  • @martin0258, check why there are 4 estimators when using linear regression in sklearn
  • @martin0258, evaluate possibility of porting DecisionTreeRegressor to R
  • @martin0258, add weighted sampling
  • @martin0258, add square and exponential loss functions
  • @martin0258, debug the warning from predict.lm
  • @martin0258, consider using test data from paper
  • @martin0258, add test on UCI concrete data, ⏰ Mon 12:00
  • @martin0258, read WEKA ARFF data format, ⏰ Mon 14:00
  • @martin0258, check M5P implementation in R, ⏰ Mon 12:00
  • @martin0258, check training error, ⏰ Tue 15:00
  • @martin0258, tune base learner (nnet) performance, ⏰ Tue 15:00
  • @martin0258, check sklearn AdaBoost.R2 performance on UCI datasets
  • @martin0258, add verbose (whether to print message)
  • @martin0258, add 25 instances from target data set into training data

Task Detail

debug differences between actual/expected predictions of lm

The power of base estimator is the key.

import numpy as np
from numpy.testing import assert_array_equal
from sklearn.ensemble import AdaBoostRegressor
from sklearn import linear_model

# Toy sample
X = [[-2, -1], [-1, -1], [-1, -2], [1, 1], [1, 2], [2, 1]]
y_regr = [-1, -1, -1, 1, 1, 1]
T = [[-1, -1], [2, 2], [3, 2]]
y_t_regr = [-1, 1, 1]

base_estimator = linear_model.LinearRegression()
clf = AdaBoostRegressor(random_state=0, base_estimator=base_estimator)
clf.fit(X, y_regr)
prediction = clf.predict(T)
print 'num of estimators: ', len(clf.estimators_)
assert_array_equal(clf.predict(T), y_t_regr)

"""
Outputs of code above
-------------------------------
num of estimators:  4

(mismatch 100.0%)
 x: array([-0.99999997,  1.99999994,  2.99999991])
 y: array([-1,  1,  1])
"""

The avg loss of 1st iteration should >= 0.5, but why there are 4 estimators in the end?

# expected avg loss of 1st iteration
base_estimator = linear_model.LinearRegression()
base_estimator.fit(X, y_regr)
error = np.abs(base_estimator.predict(X) - y_regr)
error_vector = error / error.max()
sample_weight = np.array([ 1./ 6 for i in range(len(X))])
avg_loss = (error_vector * sample_weight).sum()
print 'avg loss:', avg_loss  # 0.55555555555555625

Task Detail

check why there are 4 estimators when using linear regression in sklearn

Because it does weight sampling with replacement instead of training with sample weights!

Task Detail

add weighted sampling

Let's add a parameter for user to choose between:

  • weighted sampling
  • model built-in support for sample weighting

We assume a model supports sample weighting if it has a parameter weights (e.g., lm, rpart, nnet). If a model does not have the parameter, we'll use weighted sampling.

Task Detail

debug the warning from predict.lm

The execution and warning message are as below:

source('C:/Users/Martin Ku/Projects/GitHub/ntu-research/src/adaboostR2.R')

Early termination at iteration 4 because avg loss >= 0.5
Error: prediction not equal to test.y
Mean relative difference: 1
_In addition: Warning message:
In predict.lm(predictor, data) :
prediction from a rank-deficient fit may be misleading
_

keyword: collinear, rank deficiency

Task Detail

add test on UCI concrete data

We need to use the divided data (split into 1 target and 2 source concepts) from the author website.
We omit all cases having missing values (via na.omit).

Task Detail

check M5P implementation in R

Task Detail

check sklearn AdaBoost.R2 performance on UCI datasets

  • Read arff data using python
  • How to handle missing values in new-autompg1.arff?

Task Detail

add 25 instances from target data set into training data

Q: Which 25 instances to pick?
A: (1) randomly, (2) first 25 instances ✅

Task Detail

tune base learner (nnet) performance,

Q: What to tune?
A: Based on Examples section of its doc, we should tune 4 parameters as below:

  • size
  • rang (recommendation: rang * max(| x |) is about 1, but what is | x |?)
    • It has randomness so we need to compute RMSE over 30 runs (as paper).
  • decay (what is weight decay?)
  • maxit

Q: What does the initial/final value mean from nnet trace message? (Shouldn't it be iter 10, 20, 30?)

# weights:  64
initial  value 833385.353669 
iter  10 value 148498.517369
iter  10 value 148498.517369
iter  10 value 148498.517369
final  value 148498.517369 
converged