/tada

Primary LanguagePythonMIT LicenseMIT

TAbular DAta soft clustering

new plan January, 2017

Mind Storming

  • Dealing with nested clusters, maybe add an extra dimension (Maybe similar to SVM?).
  • Merge close clusters (e.g. mayHighC and novHighC), maybe we should also take into accound nested clusters.
  • Maybe we can use the average of the membership of each cluster. We can also use the same training sample and predict it with the clustering to see if two clusters is close to eachother.

Assumptions

  • We are dealing with numerical numbers only

Algorithm

  1. From the training data, set the center of clusters using the hard k-means way (with membership being 0 or 1). As in the beginning the training set belongs to a single type
  2. Classify the column of interest using its data via FCM.
  3. Compute the average of the membership matrix to each cluster. and consider it the membership of this column.

Things to consider

  • think about the use of statistical tests.
  • check if the average is a good measure.
  • Use a test bed for comparison.
  • Why it should performs better.
  • What about the relation between columns.
  • What about the subclasses.
  • learning from new data sources e.g. (is the error is too high, or it is far from any of the clusters, we add this new cluster)
  • allow human intervention to correct the classification.
  • verify the case where similar columns co-exists in the same dataset e.g. city_name_english, city_name_spanish

** The below are not part of the new idea **

Detailed Plan

  • learning different statistical tests.
  • Check out this paper 14.3 Are Two Distributions Different?. Which seems to be really relavant.
  • Implement statistical tests.
  • Use them as features in K-means and visualize them.
  • Use softclustering and mixed guassian mode to define the clusers.
  • Label the cluseters.
  • Compare our algorithm with the gold standard.

Highlevel Plan

  1. Work first with numerical data: So first, for each column I'll create n-features, each feature will some kind of statistical test e.g. t-test, Kolmogorov–Smirnov, ...etc.
  2. Include Categorical data, (see the papers "Approximation Algorithms for k-models clustering" and "Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values")

Assumptions

  • There are no similar columns in the training set, e.g. city_name_english, city_name_french

Ideas

  • Run K means multiple times and obtain the probabilites (Juan manuel's idea)
  • Use PCA from the collection of tests to reduce the dimensionality.
  • Add further (dummy) data to the training set if the user classified one column in the data set to be the same as one column in the training set that is not very similar.
  • Maybe use the central limit theorem for the training samples incase we are using Standard Error of the Mean.
  • Use Cohen's d, the distance between the population mean and the sample mean in terms of the sample standard deviation S. as a feature. The problem is that it is significantly affected by the outliers, and hence, we can eleminate them (or the max 10% and min 10%).
  • Use t-values to computer the standard deviation instead of the z, so we cas save time. It is also possible to take something like 10 samples each of size 100.
  • Maybe try to use R^2 (how correleted is the data) as a feature.
  • For the probability, we can use the corresponding porbability in the z/t score. but I'm not sure if the distribution should be normal for that.
  • Enhancing Cluster Labeling Using Wikipedia.
  • It would be nice to include something from this paper to justfy the use of statistical tests On the Appropriateness of Statistical Tests in Machine Learning.
  • Use Continuous Reinforcement Learning to learn the types corrected manually by the domain expert. I have to idea how this would be done, but it is an interesting approach.

Open Questions

  • What about sources that are neither RDF not CSV e.g. (databases)
  • Should we distinguish between temperature of cities and temperatures in the cosmos domain? (I will for now).

Tests to be checked