anhnongdan/Spark1.6_Problems

Benchmarking Kmeans Clustering - pyspark tiny/small

Closed this issue · 4 comments

57744 rows with 21 features: 3:30'
(4 cluster as below)

2018-01-15 12:00:44.462798
Cluster Centers: 
[ 0.49377442  1.61198229  1.59184684  0.77210211  2.51198229  2.12154207
  0.50455848  1.67856213  1.45303464  0.48958062  1.65298255  1.48598593
  0.45939047  1.49434749  1.28111487  0.49796822  1.63568638  1.48470956
  0.4823131   1.63777025  1.47809325]
[  3.40468909  13.05657492  11.47018349   5.29892966  19.16233435
  16.08639144   3.63634047  14.47120285  11.31931702   3.62793068
  14.45438328  11.38786952   3.20846075  12.85066259   9.26554536
   3.75050968  14.30733945  11.32492355   3.65468909  14.02879715
  10.98521916]
[ 1.51369223  5.43314186  5.07681005  2.41250334  8.69489714  7.30610473
  1.5918381   5.85526316  4.91597649  1.50374032  5.83275448  4.9159097
  1.40074806  5.20264494  4.13572001  1.56071333  5.66090035  4.87897408
  1.53840502  5.71179535  4.91270371]
[  7.8209607   34.7860262   24.60262009  10.9628821   46.1419214
  34.09388646   7.93449782  37.69868996  25.20960699   8.26200873
  37.569869    26.21615721   7.5349345   32.52401747  20.48471616
   8.06768559  39.56550218  25.6069869    7.99344978  38.66157205
  25.01091703]
2018-01-15 12:04:15.743430

35306 rows with 168 features: (without scaling features)
4 cluster ~ 18 mis

2018-01-15 13:40:00.799205
Cluster Centers: 
[ 0.13166421  0.08689249  0.07157585  0.09307806  0.14786451  0.41030928
  1.24918999  2.01354934  2.37437408  2.37083947  2.25891016  2.11340206
  1.53343152  1.53343152  2.00147275  2.2910162   2.50751105  2.69337261
  2.3808542   2.05125184  1.69131075  1.13784978  0.58615611  0.2730486
  0.13873343  0.0892489   0.07687776  0.10427099  0.15905744  0.46185567
  1.22297496  1.97496318  2.28276878  2.25920471  2.20913108  1.90544919
  1.42974963  1.40677467  1.81207658  2.11958763  2.25920471  2.52665685
  2.37938144  2.15022091  1.60824742  1.03328424  0.47216495  0.25773196
  0.12459499  0.08571429  0.07157585  0.09955817  0.13696613  0.44683358
  1.32459499  2.14167894  2.57702504  2.69484536  2.6005891   2.34050074
  1.54226804  1.68630339  2.26686303  2.66921944  2.70986745  2.765243
  2.48011782  2.06833579  1.65891016  1.0167894   0.50927835  0.29690722
  0.15346097  0.09131075  0.09513991  0.09631811  0.15405007  0.42916053
  1.28865979  2.16023564  2.59263623  2.69631811  2.58232695  2.20382916
  1.53932253  1.73932253  2.29661267  2.55817378  2.78556701  2.69690722
  2.5808542   2.17231222  1.67393225  1.10368189  0.49631811  0.23446244
  0.13166421  0.0916053   0.06715758  0.08041237  0.13814433  0.38291605
  1.14344624  1.94226804  2.34756996  2.34963181  2.26008837  1.93991163
  1.29779087  1.54874816  2.01266568  2.40765832  2.5437408   2.39528719
  1.8005891   1.4017673   0.98733432  1.00559647  0.56553756  0.21561119
  0.11988218  0.08335788  0.07805596  0.09808542  0.15139912  0.44123711
  1.3622975   2.21649485  2.65891016  2.72017673  2.61472754  2.25596465
  1.49366716  1.66539028  2.23976436  2.62385862  2.72842415  2.73019146
  2.47540501  2.17407953  1.65891016  1.03446244  0.52400589  0.23033873
  0.11310751  0.07835052  0.05891016  0.09572901  0.14756996  0.40147275
  1.29955817  2.23946981  2.57319588  2.58645066  2.57584683  2.22680412
  1.49513991  1.66804124  2.27658321  2.54874816  2.71281296  2.63946981
  2.42091311  2.07452135  1.66067747  0.95670103  0.48100147  0.22209131]
....
[ 0.02418858  0.01438121  0.0118744   0.01811945  0.02907028  0.09657841
  0.30429237  0.40003518  0.40913889  0.39537338  0.40372944  0.41261325
  0.31555106  0.28371009  0.32997625  0.36806227  0.43842906  0.53214883
  0.56887149  0.51965872  0.39145923  0.25195708  0.1322016   0.05246724
  0.02388073  0.01644824  0.01240215  0.01825139  0.03122526  0.10634181
  0.2972557   0.43535051  0.4730847   0.47836221  0.46468467  0.41934207
  0.31986103  0.30578767  0.35588002  0.39862785  0.44669716  0.50703668
  0.56099921  0.50140734  0.40223415  0.23885126  0.11047586  0.04419914
  0.02062626  0.01438121  0.01341367  0.01710793  0.02779488  0.10172399
  0.3192893   0.40434515  0.41731903  0.41643944  0.43073269  0.4233002
  0.31216466  0.30033424  0.35662767  0.4048729   0.4652564   0.50598118
  0.52432052  0.47906588  0.38895241  0.22759258  0.1124989   0.04732166
  0.02045035  0.01218225  0.01284194  0.01490896  0.02599173  0.09389568
  0.29765151  0.40210221  0.42000176  0.41789076  0.41454833  0.42206878
  0.315683    0.3006421   0.35878265  0.40601636  0.46041868  0.52440848
  0.54908083  0.49124813  0.37109684  0.23181458  0.10427478  0.04112059
  0.01886709  0.01161052  0.00927962  0.0129299   0.02154983  0.09231243
  0.29347348  0.36577535  0.37180051  0.36612719  0.36700677  0.36771044
  0.27262732  0.27306711  0.32368722  0.37426335  0.43944058  0.48997273
  0.44102384  0.36243293  0.26022517  0.23656434  0.11636907  0.05026827
  0.02247339  0.01345765  0.01095083  0.0142053   0.02594775  0.10185592
  0.31379189  0.40214619  0.41533996  0.41894626  0.40790747  0.40408127
  0.30033424  0.27975196  0.34070719  0.39757235  0.4417275   0.49507432
  0.5430117   0.50312253  0.38684141  0.23159469  0.10977219  0.0440672
  0.02176972  0.01394142  0.01262204  0.0150409   0.02704723  0.09921717
  0.29963057  0.38794089  0.40764359  0.4120855   0.3999912   0.41947401
  0.30904213  0.29734365  0.34079514  0.38033248  0.4502155   0.50540945
  0.53065353  0.49256751  0.37751781  0.23005541  0.11065177  0.04723371]
2018-01-15 13:57:49.923602

Kmeans clustering is pretty computational intensive:

screen shot 2018-01-15 at 1 51 31 pm

Additional cases:
44786 records with 168 features: 15 mis ( 3 clusters)
44786 records with 168 features: 14 mis ( 3 clusters)

-> The number of clusters significantly affect run time.

On spark - small kernel:
~ With all 168 dimensions
with standard scaling, 46596 records and 3 cluster:
start: 2018-01-16 23:20:28.826417
done fiting: 2018-01-16 23:37:05.285336

no scaling, 46596 records and 5 cluster:
start: 2018-01-16 23:42:52.857258
done fiting: 2018-01-16 23:56:47.832736

3 clusters:
start: 2018-01-17 00:03:22.591677
done fiting: 2018-01-17 00:18:29.722218