Python-DP-Means-Clustering ========================== Comparing DP-means and K-means clustering algorithms "cluster.py" has implementations of k-means and dp-means clustering algorithms. Implementations were intended to be straight-forward, understandable and give full output for diagnostics, rather than optimized implmentations. For more information on the dp-means, see Revisiting k-means: New Algorithms via Bayesian Nonparametrics at http://arxiv.org/abs/1111.0352/ CLUSTERING ========== > ./cluster.py -h Usage: cluster.py [options] Options: -h, --help show this help message and exit -k CLUSTERS, --kmeans-clusters=CLUSTERS If present, use kmeans with number of clusters specified -l LAM, --lamda=LAM If preset, use dpmeans with lambda parameters specified -x XVAL, --cross-validate=XVAL Number of records to hold out for cross validations. Data will be random-ordered for you. -s, --cross-validate-stop Stop when cross-validation error rises. > cat input/c3_s20_f2.csv | ./cluster.py -k2 Tolerance reached at step 8 Iterations completed: 8 Final error: 2.994711 elapsed time: 6.582022 ms > cat input/c3_s20_f2.csv | ./cluster.py -k2 Tolerance reached at step 3 Iterations completed: 3 Final error: 2.994711 elapsed time: 2.923965 ms > cat input/c3_s20_f2.csv | ./cluster.py -l2 Tolerance reached at step 4 Iterations completed: 4 Final error: 0.520926 elapsed time: 5.388021 ms > cat input/c3_s20_f2.csv | ./cluster.py -l5 Tolerance reached at step 2 Iterations completed: 2 Final error: 2.994711 elapsed time: 2.490044 ms To plot errors for the last run (dp-means in this case), use "plotResult.r" This script reads ./output/results.csv and ./output/error.csv. > ./plotResult.r Loading required package: methods Loading required package: grid V1 V2 V3 V4 cluster Min. :-4.997 Min. :-4.11117 Min. :0.000 Iter-0 :65 0:103 1st Qu.:-3.413 1st Qu.:-3.01148 1st Qu.:0.000 Iter-1 :65 1: 90 Median :-2.562 Median :-2.34123 Median :2.000 Iter-2 :65 2: 59 Mean :-1.606 Mean :-1.68030 Mean :1.918 Iter-3 :65 3: 12 3rd Qu.: 1.425 3rd Qu.:-0.01388 3rd Qu.:4.000 Iter-4 :65 4:126 Max. : 2.224 Max. : 2.00472 Max. :4.000 Iter-Final:65 V1 V2 Min. :0 Min. :0.5209 1st Qu.:1 1st Qu.:0.5209 Median :2 Median :0.5388 Mean :2 Mean :0.6489 3rd Qu.:3 3rd Qu.:0.5777 Max. :4 Max. :1.0860 See training output images created in ./img/iters.png and ./img/error.png OPTIMAL DP-MEANS ================ Finds the optimal value of lambda only from data. cat input/c4_s300_f2.csv | ./DPopt.py ... Final error: 18.510049 Final cross-validation error: 18.223702 Tolerance reached at step 6 Iterations completed: 6 Final error: 14.329098 Final cross-validation error: 14.262425 lambda: 5.48775 with error: 14.26242 Code holds back 20% of data for training optimization. There are no parameters to set unless you anticipate more than the default max number of clusters (set in code). CREATE TEST DATA ================ > ./createTestData.py -h Usage: createTestData.py [options] Options: -h, --help show this help message and exit -s SAMPLE, --sample-size=SAMPLE Sample size per cluster -f FEATURES, --features=FEATURES Number of features -c CLUSTERS, --clusters=CLUSTERS Sample size -o OVERLAP, --overlap=OVERLAP 0 - distinct, 1 - scale = sig > ./createTestData.py -s6 -c1 -f3 2.80484810546906,-5.107337369680055,1.7687444192348534 4.045291632153071,-4.955840347993885,1.5936351799326172 3.503220395140305,-5.008280722637208,1.5695863487866264 3.2134872837791812,-4.809839458886229,1.3158740999089755 3.8383496901618197,-4.745260338782687,1.74511375801971 3.3736868708580805,-5.2559718245077045,1.4113521104252063 CLUSTERING TESTS ================ Example test run on data set with 3 features, 100 points per cluster, with 4 clusters. > ./test.py | tee output/test.all.csv | grep -v Inter > output/test.csv ... Tolerance reached at step 1 Iterations completed: 1 Final error: 0.091798 Tolerance reached at step 1 Iterations completed: 1 Final error: 0.091798 Tolerance reached at step 1 Iterations completed: 1 Final error: 0.091798 Tolerance reached at step 1 Iterations completed: 1 Final error: 0.091798 Tolerance reached at step 1 Iterations completed: 1 Final error: 0.091798 Tolerance reached at step 1 Iterations completed: 1 Final error: 0.091798 ... > ./test.py -h Usage: test.py [options] Options: -h, --help show this help message and exit -f FILE, --file=FILE Input file name -i ITER, --iterations=ITER Iterations to use in searching for min error. Default 20. Plot the test results, > ./plotTest.r Loading required package: methods Loading required package: grid V1 V2 V3 V4 dp-means:12 Min. : 0.5774 Min. : 1.046 Min. : 345.6 k-means :12 1st Qu.: 2.7424 1st Qu.: 2.722 1st Qu.: 1438.7 Median : 4.8094 Median : 3.758 Median : 4243.7 Mean : 5.1264 Mean : 7.159 Mean : 4779.7 3rd Qu.: 6.9462 3rd Qu.: 6.242 3rd Qu.: 5946.4 Max. :12.0000 Max. :33.695 Max. :12496.4 method dp-means:12 k-means :12 See ./img/test_errors.png and ./img/test_times.png for comparative error and times for k-means and dp-means. NOTE: lambda is chosen based on relevant scale of the data. In this example, the data set was created to fall between -5 and 5, so the range is 10. The maximum lambda is there 10, while the smallest lambda could be chosen as the smallest expected cluster size.