This Project is to identify the missing data pattern on different percentage of missing values (1%, 2%, 5%, 10%). The project intuition is from the survey analysis research paper which use the same method of data imputation for the survey accomplishment.
Identify missing data pattern, Use Unsupervised algorithm (Self Organizing Map) for fill missing datasets. Tuning algorithmic parameters and get the best accuracy score for the given sets. The evaluation parameters are different according to data sets. There are 16 Numerical, 4 categorical and 4 Mixed data sets in which AE is used for Categorical whereas for numerical data set I use NRMS.
The limit for the good quality of data prefers NRMS < 0.1 and AE < 1.
Programming language: MATLAB 2018a (Math work) Software Toolkit: Orange (data mining, machine learning)
There are 24 real modified datasets given by the professor. And in the group, our goal is to identify the missing data pattern of 19 types of different dataset with [1%,5%,10% ,20%] using data mining and find the possible answer with the help of the Unsupervised algorithm.
We choose SOM (self organize map) to find the answer with clustering, gradient descent and identify the no of iteration for best accuracy of the model by comparing with true data set.
Use gaussian function and train the model using parameters of neighborhood function with measurement of BMU (neuron u) and neuron v.
[1] Initial Width (w) [2] learning rate (∝), for better time profile and fast optimization
These are the variables needed, with vectors in ‘bold’ S is the current iteration ƛ is the iteration limit t is the index of the target input data vector in the input data set D(t) is a target input data vector v is the index of the node in the map Wu is the current weight vector of node u is the index of the best matching unit (BMU) in the map θ is a restraint due to distance from BMU, usually called the neighbourhood function, and ∝(s) is a learning restraint due to iteration progress.
- Rendomize the node weight vector in map
- Randomly pick an input vector D(t)
- Traverse each node in the map
- Use Euclidena Distance Formula to find the similarity between imput vector and map's node weight vector.
- Track node position which procude the smallest distance able to show best matching unit.
- Update weight vectors of nodes in the neighbourhood of BMU (include itself) by pulling them closer to input vector.
- Increase S and repeat the process form step 2 when s< ƛ (lambda)
1] Space complexity: For dealing with the big datasets as well as iterative process produce shortage of the local environment space.
2] Data cleaning, Pattern recognition was complex task
3] Computational complexity: T = O(NC) = O(S2), with increase of the vector and matrix size the complexity increase due to depend on no. of rows and columns. In term of row and columns its 0(m*n)
- Get the value of NRMS near to 0.1 and AE below 1, and overall accuracy is between 75 to 90 percent.
- Sucessfully Identify SOM effectiveness decrementation with increase of error.
- Project Report
- Project Presentation demo
- Evaluation table
- Codes