Benchmarking Kmeans Clustering - pyspark tiny/small
Closed this issue · 4 comments
anhnongdan commented
57744 rows with 21 features: 3:30'
(4 cluster as below)
2018-01-15 12:00:44.462798
Cluster Centers:
[ 0.49377442 1.61198229 1.59184684 0.77210211 2.51198229 2.12154207
0.50455848 1.67856213 1.45303464 0.48958062 1.65298255 1.48598593
0.45939047 1.49434749 1.28111487 0.49796822 1.63568638 1.48470956
0.4823131 1.63777025 1.47809325]
[ 3.40468909 13.05657492 11.47018349 5.29892966 19.16233435
16.08639144 3.63634047 14.47120285 11.31931702 3.62793068
14.45438328 11.38786952 3.20846075 12.85066259 9.26554536
3.75050968 14.30733945 11.32492355 3.65468909 14.02879715
10.98521916]
[ 1.51369223 5.43314186 5.07681005 2.41250334 8.69489714 7.30610473
1.5918381 5.85526316 4.91597649 1.50374032 5.83275448 4.9159097
1.40074806 5.20264494 4.13572001 1.56071333 5.66090035 4.87897408
1.53840502 5.71179535 4.91270371]
[ 7.8209607 34.7860262 24.60262009 10.9628821 46.1419214
34.09388646 7.93449782 37.69868996 25.20960699 8.26200873
37.569869 26.21615721 7.5349345 32.52401747 20.48471616
8.06768559 39.56550218 25.6069869 7.99344978 38.66157205
25.01091703]
2018-01-15 12:04:15.743430
anhnongdan commented
35306 rows with 168 features: (without scaling features)
4 cluster ~ 18 mis
2018-01-15 13:40:00.799205
Cluster Centers:
[ 0.13166421 0.08689249 0.07157585 0.09307806 0.14786451 0.41030928
1.24918999 2.01354934 2.37437408 2.37083947 2.25891016 2.11340206
1.53343152 1.53343152 2.00147275 2.2910162 2.50751105 2.69337261
2.3808542 2.05125184 1.69131075 1.13784978 0.58615611 0.2730486
0.13873343 0.0892489 0.07687776 0.10427099 0.15905744 0.46185567
1.22297496 1.97496318 2.28276878 2.25920471 2.20913108 1.90544919
1.42974963 1.40677467 1.81207658 2.11958763 2.25920471 2.52665685
2.37938144 2.15022091 1.60824742 1.03328424 0.47216495 0.25773196
0.12459499 0.08571429 0.07157585 0.09955817 0.13696613 0.44683358
1.32459499 2.14167894 2.57702504 2.69484536 2.6005891 2.34050074
1.54226804 1.68630339 2.26686303 2.66921944 2.70986745 2.765243
2.48011782 2.06833579 1.65891016 1.0167894 0.50927835 0.29690722
0.15346097 0.09131075 0.09513991 0.09631811 0.15405007 0.42916053
1.28865979 2.16023564 2.59263623 2.69631811 2.58232695 2.20382916
1.53932253 1.73932253 2.29661267 2.55817378 2.78556701 2.69690722
2.5808542 2.17231222 1.67393225 1.10368189 0.49631811 0.23446244
0.13166421 0.0916053 0.06715758 0.08041237 0.13814433 0.38291605
1.14344624 1.94226804 2.34756996 2.34963181 2.26008837 1.93991163
1.29779087 1.54874816 2.01266568 2.40765832 2.5437408 2.39528719
1.8005891 1.4017673 0.98733432 1.00559647 0.56553756 0.21561119
0.11988218 0.08335788 0.07805596 0.09808542 0.15139912 0.44123711
1.3622975 2.21649485 2.65891016 2.72017673 2.61472754 2.25596465
1.49366716 1.66539028 2.23976436 2.62385862 2.72842415 2.73019146
2.47540501 2.17407953 1.65891016 1.03446244 0.52400589 0.23033873
0.11310751 0.07835052 0.05891016 0.09572901 0.14756996 0.40147275
1.29955817 2.23946981 2.57319588 2.58645066 2.57584683 2.22680412
1.49513991 1.66804124 2.27658321 2.54874816 2.71281296 2.63946981
2.42091311 2.07452135 1.66067747 0.95670103 0.48100147 0.22209131]
....
[ 0.02418858 0.01438121 0.0118744 0.01811945 0.02907028 0.09657841
0.30429237 0.40003518 0.40913889 0.39537338 0.40372944 0.41261325
0.31555106 0.28371009 0.32997625 0.36806227 0.43842906 0.53214883
0.56887149 0.51965872 0.39145923 0.25195708 0.1322016 0.05246724
0.02388073 0.01644824 0.01240215 0.01825139 0.03122526 0.10634181
0.2972557 0.43535051 0.4730847 0.47836221 0.46468467 0.41934207
0.31986103 0.30578767 0.35588002 0.39862785 0.44669716 0.50703668
0.56099921 0.50140734 0.40223415 0.23885126 0.11047586 0.04419914
0.02062626 0.01438121 0.01341367 0.01710793 0.02779488 0.10172399
0.3192893 0.40434515 0.41731903 0.41643944 0.43073269 0.4233002
0.31216466 0.30033424 0.35662767 0.4048729 0.4652564 0.50598118
0.52432052 0.47906588 0.38895241 0.22759258 0.1124989 0.04732166
0.02045035 0.01218225 0.01284194 0.01490896 0.02599173 0.09389568
0.29765151 0.40210221 0.42000176 0.41789076 0.41454833 0.42206878
0.315683 0.3006421 0.35878265 0.40601636 0.46041868 0.52440848
0.54908083 0.49124813 0.37109684 0.23181458 0.10427478 0.04112059
0.01886709 0.01161052 0.00927962 0.0129299 0.02154983 0.09231243
0.29347348 0.36577535 0.37180051 0.36612719 0.36700677 0.36771044
0.27262732 0.27306711 0.32368722 0.37426335 0.43944058 0.48997273
0.44102384 0.36243293 0.26022517 0.23656434 0.11636907 0.05026827
0.02247339 0.01345765 0.01095083 0.0142053 0.02594775 0.10185592
0.31379189 0.40214619 0.41533996 0.41894626 0.40790747 0.40408127
0.30033424 0.27975196 0.34070719 0.39757235 0.4417275 0.49507432
0.5430117 0.50312253 0.38684141 0.23159469 0.10977219 0.0440672
0.02176972 0.01394142 0.01262204 0.0150409 0.02704723 0.09921717
0.29963057 0.38794089 0.40764359 0.4120855 0.3999912 0.41947401
0.30904213 0.29734365 0.34079514 0.38033248 0.4502155 0.50540945
0.53065353 0.49256751 0.37751781 0.23005541 0.11065177 0.04723371]
2018-01-15 13:57:49.923602
anhnongdan commented
anhnongdan commented
Additional cases:
44786 records with 168 features: 15 mis ( 3 clusters)
44786 records with 168 features: 14 mis ( 3 clusters)
-> The number of clusters significantly affect run time.
anhnongdan commented
On spark - small kernel:
~ With all 168 dimensions
with standard scaling, 46596 records and 3 cluster:
start: 2018-01-16 23:20:28.826417
done fiting: 2018-01-16 23:37:05.285336
no scaling, 46596 records and 5 cluster:
start: 2018-01-16 23:42:52.857258
done fiting: 2018-01-16 23:56:47.832736
3 clusters:
start: 2018-01-17 00:03:22.591677
done fiting: 2018-01-17 00:18:29.722218