The repository contains various Big Data algorithms which we implemented as a part of our 'Algorithms and Optimization for Big Data' Course at School of Engineering and Applied Science, Ahmedabad University.
List:
This code consists of the implementation of simple linear regression and multiple linear regression using gradient descent on batch data. Dataset used: Sklearn boston dataset.
This code consists of the implementation of simple linear regression and multiple linear regression using gradient descent on stream data. The data is streamed using the yield function in python. Dataset used: Sklearn boston dataset.
Normal Equation is an analytical approach to Linear Regression with a Least Square Cost Function. It is another way of doing minimization which is performed without restoring to an iterative algorithm. Dataset used: Sklearn Boston dataset
IMSR (Incremental Mathematical Stream Regression) is an Online Regression technique for streaming Big Data. In the context of stream data, it is required to continuously update the regression model as new data streams in, on the other hand, it is impossible to scan the entire data set multiple times due to the huge volume of the data. Hence, this technique solves this problem of traditional linear regression to update the regression model optimally and efficiently. Dataset used: Sklearn Boston dataset
ASR (Approximate Stream Regression) is another Online Regression technique for streaming Bug Data. This technique also solves the problem of traditional linear regression to update the regression model optimally and efficiently.Dataset used: Sklearn Boston dataset
Collaborative filtering is a recommendation approach where rating of user u for item i is computed using ratings of item i given by other like minded users of u. This is an implementation of SGD for Matrix Factoriation used in Collaborative FIltering approach. This file creates a sparse random Ratings matrix whose rows are denoted by number of users and columns denote the number of items. Dataset used: Random Generation of the users items sparse matrix.
6. Collaborative Filtering- Streaming Distributed Stochastic Gradient Descent for Matrix Factorization
This is an implementation of the paper Parallel Collaborative Filtering for Streaming Data. It is a collaborative filtering approach (recommendation approach) which is done by distributing the streaming big data among different workers. This file creates a sparse random Ratings matrix whose rows are denoted by number of users and columns denote the number of items. Further, it applies collaborative filtering approach using the algorithm given in the paper by streaming the random matrix data. Dataset used: Random Generation of the users items sparse matrix.
The STREAM framework is based on the k-medians clustering methodology. The core idea is to break the stream into chunks, each of which is of manageable size and fits into main memory. Dataset used: Sklearn iris dataset.
This code consists of the implementation of the two clustering algorithms - K-means and K-mediods. Dataset used: Sklearn iris dataset.
This algorithm is an implementation of the paper A Framework for Clustering Evolving Data Streams. This paper discusses a fundamentally different philosophy for data stream clustering which is guided by application-centered requirements. Dataset used: Sklearn digits dataset.
In decision tree learning, ID3 (Iterative Dichotomiser 3) is an algorithm used to generate a decision tree from a dataset. ID3 is the precursor to the C4.5 algorithm, and is typically used in the machine learning and natural language processing domains. The decision tree is formed based on splitting the nodes on attributes using Information gain to identify the relative importance of features(attributes) in the dataset.
Classification and Regression Trees or CART for short is a term introduced by Leo Breiman to refer to Decision Tree algorithms that can be used for classification or regression predictive modeling problems. This algorithm uses Gini index to identify the relative importance of features in the dataset and thereby overcome major limitations of ID3 algorithm.
Hoeffding Tree or VFDT is the standard decision tree algorithm for data stream classification. VFDT uses the Hoeffding bound to decide the minimum number of arriving instances to achieve certain level of confidence in splitting the node. This confidence level determines how close the statistics between the attribute chosen by VFDT and the attribute chosen by decision tree for batch learning. Dataset used: data file given in the same folder.
Jeet H. Shah, Mihir Kanjaria, Muskan Matwani