This is the course web-page for Advanced Data Analytics being taught at IIT Bhilai, India in the Monsoon Semester of 2021.
Course Instructor: Dr. Gagan Raj Gupta
- Getting data is becoming easier day by day, but we have too much to analyze (e.g. web, transactional data, text)
- Data has errors of various types (missing, incorrect etc.), is incomplete and is hard to clean (e.g. user reviews/ratings, distorted images)
- Data is usually high-dimensional (involving lot of columns or features) (e.g. text, images, videos, graphs)
- Data usually has complex correlations and i.i.d. assumptions don't always work very well (e.g. graph data, time-series data)
- Data is incomplete (matrix completion, compressed sensing, signal re-construction)
- Data is being generated at a great speed and it is too expensive to store all of it (e.g. user or machine transactions, queries)
- Data (packets) on the network is encrypted
We are often asked to answer difficult questions from this messy data
We have to make decisions (often in real-time).
In this course, we want to learn how that is being done and solve real-life problems that we are interested in.
- Equip students with the mathematical toolkit (linear algebra, statistics, optimization), needed for understanding and implementing the important data analysis and ML algorithms
- Explain new paradigms of algorithm design for handling complex datasets including streaming algorithms
- Explore robust and state of the art (SOTA) techniques in large scale ML
- Introduce Graph neural networks (GNN) and its applications to Knowledge Graphs and Bio-informatics
- Introduce DTW, Matrix Profile and related techniques for analyzing complex time-series data (clustering, anomaly detection, pattern mining, prediction)
- Provide hands-on experience to students in analyzing datasets in diverse fields (NLP, Image/Video, Graphs, Networks, Bio-informatics, Finance)
- Basic knowledge of Python (most assignments will be based on Python)
- Knowledge of basic computer science principles and skills
- Math
- Linear Algebra ( Matrix-factorization, Eigenvalues, Column and row spaces, Norms)
- Probability theory (Conditional, Bayes Rule, Concentration Inequalities, Distributions, Gaussian, Multi-variate)
- Basic Data Structures, Algorithms and Asymptotic Analysis (graphs, heaps, lists, dynamic programming)
- Calculus (Multi-variate)
If you don't meet one or more pre-requisites, be prepared to spend more time before or during the course in learning them.
# | Week | Topics covered in class | Text Book Reference |
---|---|---|---|
1 | Aug 2 | Math of Data: Algebraic, Geometric and Statistical Views; High-dimensional geometry: Curse of dimensionality, Gaussian Annulus theorem, Volume of unit ball, orthogonal directions | FoDS Chapter 2 |
2 | Aug 9 | Projection Techniques: Auto-encoder view, Best fit subspaces; PCA (maximize variance), Variants of PCA, Eigenfaces; SVD and applications, Power-iteration methods; Random projections: JL Lemma; Linear Regression as projection to column space using Normal Equations, Pseudo-inverse and QR methods; Data visualization | FoDS Chapter 3, MML Chapter 10 |
3 | Aug 16 | Locality Sensitive Hashing (LSH): Shingling, Min-hash, LSH, tradeoff with r and b; LSH for other metrics (Cosine, Euclidean); Compressed sensing: Solving Under-determined system of linear equations using Convex Optimization (L1 norm), Sparsity, Incoherence, Restricted Isometry Property (sparse vectors) | MMDS Chapter 3, Reference Papers |
4 | Aug 23 | Streaming Data Analytics: Limitations of Random Sampling, Reservoir Sampling, Sliding Window Queries, DGIM algorithm, Recent Itemsets, Bloom Filters, Count Distinct, Frequency Estimation (MG, Space Saving, Count-Min), Moment Estimation | MMDS Chapter 4, SSBD |
5 | Aug 31 | Intro to Machine Learning Algorithms: Examples of supervised, unsupervised and re-inforcement learning, Requirements of good ML, Feature Extraction, Learning with Prototypes, K nearest neighbors | CIML Ch2,3 |
6 | Sep 13 | Linear Models, Loss functions for Regression and Classification, Perceptron, Average Perceptron, Online ML, SVM (Hard Margin and Soft margin), Decision Trees, Information and Entropy | MMDS Ch12 |
7 | Sep 20 | Ensemble Methods, Bagging, Boosting: AdaBoost, Gradient Boosting, Optimization Formulations of ML problems, Convexity, Computing Gradients, Gradient Descent and Variants | MMDS Ch12, MML Ch5,7 |
8 | Sep 27 | Neural Networks: Learning non-linear boundaries, Neural Nets as encoders or feature learners, Learning multiple patterns, Designing Neural Networks: Interconnections, Wide or Deep?, Activation, Loss, Auto-diff, Backprop examples, Preventing Overfitting, CNN Architecture, Resnet | MMDS Ch13 , MML Ch5,7Reading |
9 | Oct 4 | Graphs, Degree, Degree Distribution, Distance, Diameter, Radius, Connected Components, Centrality Measures, Page Rank | MMDS Ch 5 |
10 | Oct 18 | Graph Embeddings, Random Walk Techniques, Biased Random Walks: DFS vs BFS | GRL |
11 | Nov 1 | Text Embeddings: Node2vec, GNN:Graph Sage, Knowledge Graphs | |
Nov 2 | Graph Convolutional Networks | Semi-Supervised Classification with Graph Convolutional Networks | 1 Blog by Kipf |
Nov 8 | Scalable Graph Neural Networks | Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks | 2 Video Repo |
Nov 8 | GNN for Map Distances | Graph Neural Networks for Traffic Forecasting | 3 GNN 4 Traffic Deep Mind Blog |
Nov 9 | Time Series Distance Mertics | Making Time-series Classification More Accurate Using Learned Constraints | 4 Wiki |
Nov 9 | GNN application | Modeling polypharmacy side effects with graph convolutional networks | 5 |
Nov 15 | Knowledge Graph Embedding | RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space | 6 Video |
Nov 15 | GNN Scaling | GNNAutoScale: Scalable and Expressive Graph Neural Networks via Historical Embeddings | 7 |
Nov 16 | Matrix Profile 1 | Matrix Profile I: All Pairs Similarity Joins for Time Series: A Unifying View that Includes Motifs, Discords and Shapelets | 8 |
Nov 16 | Matrix Profile 2 | Matrix Profile VII: Time Series Chains: A New Primitive for Time Series Data Mining | 9 |
Nov 22 | Matrix Profile 3 | Matrix Profile XXI: MERLIN: Parameter-Free Discovery of Arbitrary Length Anomalies in Massive Time Series Archives | 10 |
Nov 22 | Object Detection | Meta-DETR: Image-Level Few-Shot Object Detection with Inter-Class Correlation Exploitation | 11 |
Nov 23 | GAN | Generative Adversarial Networks | 12 |
Nov 23 | Variational Auto-encoders | An Introduction to Variational Autoencoders | 13 |
Nov 29 | Image classification | CNN-RNN: A Unified Framework for Multi-label Image Classification | 14 |
Nov 29 | Sequence Modeling | Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling | 15 |
- Lectures (Mon 2:30 p.m., Tue 4:00 p.m.) on https://iitbhilai.webex.com/meet/b204
- Extra Class on Sat, Aug 28 (11 am) : Check link on whatsapp group/email
-
Course Textbook
- Mining of Massive Datasets (MMDS) : http://www.mmds.org/
- Mathematics for Machine Learning (MML) : https://mml-book.github.io/
-
Reference Materials
- Foundations of Data Science (FoDS) : https://www.cs.cornell.edu/jeh/book.pdf
- A Course in Machine Learning (CIML) : http://ciml.info/
- Graph Representation Learning (GRL) : https://www.cs.mcgill.ca/~wlh/grl_book/
- Small Summaries of Big Data (SSBD) : http://dimacs.rutgers.edu/~graham/ssbd.html
-
Research Papers (will be added to the folder)
-
Handouts (short notes on various important topics)
-
Sample code snippets will be posted in the Handouts section for students to practice data analysis programming
-
Useful datasets will also be provided for practice