8667046)

In this iteration create simple Java tools which typically comprise a single class with a main method.
We will keep Utility classes to a minimum
We will adopt the simplest coding style possible, so it lends itself to easy readability
We should refrain from using frameworks for logging and configuration to keep readability up (so use System.out.println instead of log4j, SLF4K or say the Properties or class bundled with the standard JDK instead of the Spring framework

Create a 2D synthetic data set with two Gaussians
Run K-Means Clustering animation
Show how the centroids start out
Show how the centroids drift apart
What did the machine learn?
Unpack the algorithm and discuss how it works
Discuss the concept of distance and how it relates to the concept of similarity
Ask the question: is there something out of place -- do some points belong to the wrong cluster

Create the same 2D Gaussians except multiply one dimension by a 1000 (say measurement in inches vs. yards)
Run K-Means Clustering animation
What just happened?
Let's revisit the Pythagorean theorem
Let's normalize both dimensions (use a simple normalization -- consider the Z-score)
Run K-Means Clustering animation again

Our fifth learning problem -- high dimensional cubes are spiky (Intro2MLSpikyCubes.java)

Let's discuss what an N-dimensional cube looks like
Let's generate 2 N-dimensional Gaussians where N > 20
Apply K-Means
What happened?
As a bonus exercise compare the percentage of the volume contained in the corners of 3-dimensional cube vs. a 20-dimensional cube
Revisit the notion of distance
Discuss why higher dimensions may be a problem
Discussion
What was the machine learning?
What is the model?
What do you think about Box’s famous quote: "All models are wrong, but some are useful."

Introduction to Statistical Learning -- http://www-bcf.usc.edu/~gareth/ISL/ -- get the full book for free (pdf format)
Lecture 13.2 — Clustering | KMeans Algorithm — [ Machine Learning | Andrew Ng ] -- https://www.youtube.com/watch?v=hDmNF9JG3lo
https://github.com/haifengl/smile/tree/master/shell/src/universal/data/clustering

MacQueen, J. (1967) Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, eds L. M. Le Cam & J. Neyman, 1, pp. 281–297. Berkeley, CA: University of California Press. -- https://bit.ly/2LeL0Lp
Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. An Efficient k-Means Clustering Algorithm: Analysis and Implementation. IEEE TRANS. PAMI, 2002. -- https://bit.ly/2KzdZIY
D. Arthur and S. Vassilvitskii. "K-means++: the advantages of careful seeding". ACM-SIAM symposium on Discrete algorithms, 1027-1035, 2007. -- https://stanford.io/2rT1IrV
Anna D. Peterson, Arka P. Ghosh and Ranjan Maitra. A systematic evaluation of different methods for initializing the K-means clustering algorithm. 2010.
Forgy, E. W. (1965) Cluster analysis of multivariate data: efficiency vs interpretability of classifications. Biometrics 21, 768–769.
Hartigan, J. A. and Wong, M. A. (1979). A K-means clustering algorithm. Applied Statistics 28, 100–108. -- https://bit.ly/2hDNuFB
Lloyd, S. P. (1957, 1982) Least squares quantization in PCM. Technical Note, Bell Laboratories. Published in 1982 in IEEE Transactions on Information Theory 28, 128–137. -- https://bit.ly/2KD64dE

armaninspace/ML4JavaDevs