cui-unige/mcc4mcc

Review bad performance of learning algorithms

saucisson opened this issue · 9 comments

For @mencattini , in branch issue-7.

Basing on the master code, i have :

INFO:root:Algorithm: svm
INFO:root:  Min     : 0.47989789406509253
INFO:root:  Max     : 0.5175494575622208
INFO:root:  Mean    : 0.4987236758136567
INFO:root:  Median  : 0.4987236758136567
INFO:root:  Stdev   : 0.02662367587109529
INFO:root:  Variance: 0.0007088201168891415
INFO:root:Learning using algorithm: 'ada boost'.
100%|███████████████████████████████████████████| 2/2 [00:00<00:00,  2.51it/s]
INFO:root:Algorithm: ada boost
INFO:root:  Min     : 0.4639438417358009
INFO:root:  Max     : 0.46713465220165923
INFO:root:  Mean    : 0.46553924696873006
INFO:root:  Median  : 0.46553924696873006
INFO:root:  Stdev   : 0.002256243717889439
INFO:root:  Variance: 5.090635714515559e-06
INFO:root:Learning using algorithm: 'linear-svm'.
100%|███████████████████████████████████████████| 2/2 [00:18<00:00,  9.34s/it]
INFO:root:Algorithm: linear-svm
INFO:root:  Min     : 0.4269304403318443
INFO:root:  Max     : 0.4524569240587109
INFO:root:  Mean    : 0.4396936821952776
INFO:root:  Median  : 0.4396936821952776
INFO:root:  Stdev   : 0.018049949743115436
INFO:root:  Variance: 0.00032580068572899296
INFO:root:Learning using algorithm: 'decision-tree'.
100%|███████████████████████████████████████████| 2/2 [00:00<00:00, 61.98it/s]
INFO:root:Algorithm: decision-tree
INFO:root:  Min     : 0.553286534779834
INFO:root:  Max     : 0.5666879387364391
INFO:root:  Mean    : 0.5599872367581366
INFO:root:  Median  : 0.5599872367581366
INFO:root:  Stdev   : 0.00947622361513566
INFO:root:  Variance: 8.979881400405476e-05
INFO:root:Learning using algorithm: 'random-forest'.
100%|███████████████████████████████████████████| 2/2 [00:00<00:00,  5.88it/s]
INFO:root:Algorithm: random-forest
INFO:root:  Min     : 0.5398851308232291
INFO:root:  Max     : 0.5500957243139758
INFO:root:  Mean    : 0.5449904275686024
INFO:root:  Median  : 0.5449904275686024
INFO:root:  Stdev   : 0.007219979897246221
INFO:root:  Variance: 5.2128109716639555e-05
INFO:root:Learning using algorithm: 'neural-network'.
100%|███████████████████████████████████████████| 2/2 [00:13<00:00,  6.60s/it]
INFO:root:Algorithm: neural-network
INFO:root:  Min     : 0.47032546266751757
INFO:root:  Max     : 0.47415443522654754
INFO:root:  Mean    : 0.4722399489470326
INFO:root:  Median  : 0.4722399489470326
INFO:root:  Stdev   : 0.0027074924614673033
INFO:root:  Variance: 7.330515428902277e-06

We need to brainstorm about data. I change the way to split the sets using train_test_split() from sklearn.model_selection

The most used tool in the whole data has 2881 entries. The whole datas are 6266. So the ratio is : 45.978%.

First idea :

  • We remove this tool from the machine learning to reduce the unbalanced factor. On the futur mcc tool, we will test this tool every time. So if the data stays on the same proportion, we will have 45-46% of accuracy without any computation.
  • We apply the machine learning on the rest of classes. With this version it gives me 50-50%.

It means that the probabilty that our futur tools is the most used tool or the guessing tool is :
0.45 * 1 + 0.50 * 0.55 = 0.725
where 0.45 is the probability of the most used tool on the whole data (i.e: 1), 0.5 is the probability to guessing right on the small data (i.e: 1 - 0.45).

The pseudo-code could be :

  • training :
  • remove the majority class
  • train the classification on the n-1 classes

test:

  • receive a problem description (annoteted in cross validation or not in real use)
  • apply the classificater
  • get a class c
  • run the training majority class and the class c on problem (here class is equivalent to tool)

Second idea :
Train the classifier in an other way :

  • Do a one-vs-all classification with the majority class. It mean we try to guess if it's the majority class or not.
    • If yes, we have our class
    • Else, it's the rest.
  • Train a classifier for the rest.

The first idea is immediatly usable but less robust than the other.

@saucisson
The over and under-sampling don't improve the results.

The first idea gives me 74.8% on 100 iterations, but mcc will have to run two tools.
The second idea isn't that good with 56.4% on 100 iterations which is close to the other algorithms.

Which direction do you want to go ?
Do we keep the data with only the best tools ?

Don't know yet...

Have you tried with the --duplicates=false option? It may change the ratio.

I think the next step is to implement custom algorithm.score functions to know better if the chosen tool is good, relatively good or bad. The idea is to put the identifier of the entry (model, instance, tool, ...) within the df, but drop it during learning.

Then, the score function could use this information to compute if the obtained tool is good or not, by comparing with the sorted lists of tools given during the "Analyzing known data" step.

I will create an issue-10 branch as soon as i have merged the issue-7 within master, and solved #11 for these changes.

One idea would be to fill the learning algorithms with the full data, and then apply the scoring of the mcc to each examination/instance.
This would give a score for each algorithm, and we could use the best one.

No sure to understand.
I had an idea too, instead of guessing the tool, we can try to guess the rank.
The X would be :

  • the problem description (still encoding like that {-1,0,+1})
  • the tool

The Y :

  • the rank

We learn on the whole data and we try to guess the rank for a given problem and a given tool.
For the mcc tool, in the futur, we will receive a problem, we will create artificially n differents instances for all the tools and just return the tool that reach the best rank.

I'm not sure to be clear, so ask if you don't understand.

They are now efficient, congratulations!