Repository for clusterization task
- input - Demo.dim
- task - make non-hierarchical clusterization of 3D dataset
- output - python 3 compatible code
Every file has its magic number. So with function get_magic_number I get that magic number of 'Demo.bin' is PK\x03. It's zip's file magic number, so I unzipped Demo.bin, and received text data. Get pandas dataframe with method get_df_from_directory_files. Unzipping, df making and magic number functions are in data_extraction.py file.
-
Use
pd.DataFrame.hist()
for getting information about distribution -
remove outliers - method
data_preprocessing.remove_outliers
-
Use sklearn methods for minmax normalization
I tried few approaches for K-means:
- with one-hot encoding - because of third column in data
- without one-hot encoding
For getting number of clusters I used silhouette analysis.
My optimal number of clusters is 6.
Because in this case we can see classes on the different levels of the z-axis.