TAbular DAta soft clustering

new plan January, 2017

Mind Storming

Dealing with nested clusters, maybe add an extra dimension (Maybe similar to SVM?).
Merge close clusters (e.g. mayHighC and novHighC), maybe we should also take into accound nested clusters.
Maybe we can use the average of the membership of each cluster. We can also use the same training sample and predict it with the clustering to see if two clusters is close to eachother.

Assumptions

We are dealing with numerical numbers only

Algorithm

From the training data, set the center of clusters using the hard k-means way (with membership being 0 or 1). As in the beginning the training set belongs to a single type
Classify the column of interest using its data via FCM.
Compute the average of the membership matrix to each cluster. and consider it the membership of this column.

Things to consider

think about the use of statistical tests.
check if the average is a good measure.
Use a test bed for comparison.
Why it should performs better.
What about the relation between columns.
What about the subclasses.
learning from new data sources e.g. (is the error is too high, or it is far from any of the clusters, we add this new cluster)
allow human intervention to correct the classification.
verify the case where similar columns co-exists in the same dataset e.g. city_name_english, city_name_spanish

The below are not part of the new idea

Detailed Plan

learning different statistical tests.
Check out this paper 14.3 Are Two Distributions Different?. Which seems to be really relavant.
Implement statistical tests.
Use them as features in K-means and visualize them.
Use softclustering and mixed guassian mode to define the clusers.
Label the cluseters.
Compare our algorithm with the gold standard.

Highlevel Plan

Work first with numerical data: So first, for each column I'll create n-features, each feature will some kind of statistical test e.g. t-test, Kolmogorov–Smirnov, ...etc.
Include Categorical data, (see the papers "Approximation Algorithms for k-models clustering" and "Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values")

Assumptions

There are no similar columns in the training set, e.g. city_name_english, city_name_french

Ideas

Run K means multiple times and obtain the probabilites (Juan manuel's idea)
Use PCA from the collection of tests to reduce the dimensionality.
Add further (dummy) data to the training set if the user classified one column in the data set to be the same as one column in the training set that is not very similar.
Maybe use the central limit theorem for the training samples incase we are using Standard Error of the Mean.
Use Cohen's d, the distance between the population mean and the sample mean in terms of the sample standard deviation S. as a feature. The problem is that it is significantly affected by the outliers, and hence, we can eleminate them (or the max 10% and min 10%).
Use t-values to computer the standard deviation instead of the z, so we cas save time. It is also possible to take something like 10 samples each of size 100.
Maybe try to use R^2 (how correleted is the data) as a feature.
For the probability, we can use the corresponding porbability in the z/t score. but I'm not sure if the distribution should be normal for that.
Enhancing Cluster Labeling Using Wikipedia.
It would be nice to include something from this paper to justfy the use of statistical tests On the Appropriateness of Statistical Tests in Machine Learning.
Use Continuous Reinforcement Learning to learn the types corrected manually by the domain expert. I have to idea how this would be done, but it is an interesting approach.

Open Questions

What about sources that are neither RDF not CSV e.g. (databases)
Should we distinguish between temperature of cities and temperatures in the cosmos domain? (I will for now).

runzbuzz/tada

TAbular DAta soft clustering

new plan January, 2017

Mind Storming

Assumptions

Algorithm

Things to consider

The below are not part of the new idea

Detailed Plan

Highlevel Plan

Assumptions

Ideas

Open Questions

Tests to be checked

runzbuzz/tada

TAbular DAta soft clustering

new plan January, 2017

Mind Storming

Assumptions

Algorithm

Things to consider

** The below are not part of the new idea **

Detailed Plan

Highlevel Plan

Assumptions

Ideas

Open Questions

Tests to be checked

The below are not part of the new idea