/DS-503

Course Webpage for DS503 being taught at IIT Bhilai

Primary LanguageJupyter NotebookMIT LicenseMIT

DS-503 (Advanced Data Analytics)

This is the course web-page for Advanced Data Analytics being taught at IIT Bhilai, India in the Monsoon Semester of 2021.
Course Instructor: Dr. Gagan Raj Gupta

Motivation

  • Getting data is becoming easier day by day, but we have too much to analyze (e.g. web, transactional data, text)
  • Data has errors of various types (missing, incorrect etc.), is incomplete and is hard to clean (e.g. user reviews/ratings, distorted images)
  • Data is usually high-dimensional (involving lot of columns or features) (e.g. text, images, videos, graphs)
  • Data usually has complex correlations and i.i.d. assumptions don't always work very well (e.g. graph data, time-series data)
  • Data is incomplete (matrix completion, compressed sensing, signal re-construction)
  • Data is being generated at a great speed and it is too expensive to store all of it (e.g. user or machine transactions, queries)
  • Data (packets) on the network is encrypted

We are often asked to answer difficult questions from this messy data

We have to make decisions (often in real-time).

In this course, we want to learn how that is being done and solve real-life problems that we are interested in.

Course Objectives

  • Equip students with the mathematical toolkit (linear algebra, statistics, optimization), needed for understanding and implementing the important data analysis and ML algorithms
  • Explain new paradigms of algorithm design for handling complex datasets including streaming algorithms
  • Explore robust and state of the art (SOTA) techniques in large scale ML
  • Introduce Graph neural networks (GNN) and its applications to Knowledge Graphs and Bio-informatics
  • Introduce DTW, Matrix Profile and related techniques for analyzing complex time-series data (clustering, anomaly detection, pattern mining, prediction)
  • Provide hands-on experience to students in analyzing datasets in diverse fields (NLP, Image/Video, Graphs, Networks, Bio-informatics, Finance)

Pre-requisites

  • Basic knowledge of Python (most assignments will be based on Python)
  • Knowledge of basic computer science principles and skills
  • Math
    • Linear Algebra ( Matrix-factorization, Eigenvalues, Column and row spaces, Norms)
    • Probability theory (Conditional, Bayes Rule, Concentration Inequalities, Distributions, Gaussian, Multi-variate)
    • Basic Data Structures, Algorithms and Asymptotic Analysis (graphs, heaps, lists, dynamic programming)
    • Calculus (Multi-variate)

If you don't meet one or more pre-requisites, be prepared to spend more time before or during the course in learning them.

Detailed Course Schedule

# Week Topics covered in class Text Book Reference
1 Aug 2 Math of Data: Algebraic, Geometric and Statistical Views; High-dimensional geometry: Curse of dimensionality, Gaussian Annulus theorem, Volume of unit ball, orthogonal directions FoDS Chapter 2
2 Aug 9 Projection Techniques: Auto-encoder view, Best fit subspaces; PCA (maximize variance), Variants of PCA, Eigenfaces; SVD and applications, Power-iteration methods; Random projections: JL Lemma; Linear Regression as projection to column space using Normal Equations, Pseudo-inverse and QR methods; Data visualization FoDS Chapter 3, MML Chapter 10
3 Aug 16 Locality Sensitive Hashing (LSH): Shingling, Min-hash, LSH, tradeoff with r and b; LSH for other metrics (Cosine, Euclidean); Compressed sensing: Solving Under-determined system of linear equations using Convex Optimization (L1 norm), Sparsity, Incoherence, Restricted Isometry Property (sparse vectors) MMDS Chapter 3, Reference Papers
4 Aug 23 Streaming Data Analytics: Limitations of Random Sampling, Reservoir Sampling, Sliding Window Queries, DGIM algorithm, Recent Itemsets, Bloom Filters, Count Distinct, Frequency Estimation (MG, Space Saving, Count-Min), Moment Estimation MMDS Chapter 4, SSBD
5 Aug 31 Intro to Machine Learning Algorithms: Examples of supervised, unsupervised and re-inforcement learning, Requirements of good ML, Feature Extraction, Learning with Prototypes, K nearest neighbors CIML Ch2,3
6 Sep 13 Linear Models, Loss functions for Regression and Classification, Perceptron, Average Perceptron, Online ML, SVM (Hard Margin and Soft margin), Decision Trees, Information and Entropy MMDS Ch12
7 Sep 20 Ensemble Methods, Bagging, Boosting: AdaBoost, Gradient Boosting, Optimization Formulations of ML problems, Convexity, Computing Gradients, Gradient Descent and Variants MMDS Ch12, MML Ch5,7
8 Sep 27 Neural Networks: Learning non-linear boundaries, Neural Nets as encoders or feature learners, Learning multiple patterns, Designing Neural Networks: Interconnections, Wide or Deep?, Activation, Loss, Auto-diff, Backprop examples, Preventing Overfitting, CNN Architecture, Resnet MMDS Ch13 , MML Ch5,7Reading
9 Oct 4 Graphs, Degree, Degree Distribution, Distance, Diameter, Radius, Connected Components, Centrality Measures, Page Rank MMDS Ch 5
10 Oct 18 Graph Embeddings, Random Walk Techniques, Biased Random Walks: DFS vs BFS GRL
11 Nov 1 Text Embeddings: Node2vec, GNN:Graph Sage, Knowledge Graphs
Nov 2 Graph Convolutional Networks Semi-Supervised Classification with Graph Convolutional Networks 1 Blog by Kipf
Nov 8 Scalable Graph Neural Networks Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks 2 Video Repo
Nov 8 GNN for Map Distances Graph Neural Networks for Traffic Forecasting 3 GNN 4 Traffic Deep Mind Blog
Nov 9 Time Series Distance Mertics Making Time-series Classification More Accurate Using Learned Constraints 4 Wiki
Nov 9 GNN application Modeling polypharmacy side effects with graph convolutional networks 5
Nov 15 Knowledge Graph Embedding RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space 6 Video
Nov 15 GNN Scaling GNNAutoScale: Scalable and Expressive Graph Neural Networks via Historical Embeddings 7
Nov 16 Matrix Profile 1 Matrix Profile I: All Pairs Similarity Joins for Time Series: A Unifying View that Includes Motifs, Discords and Shapelets 8
Nov 16 Matrix Profile 2 Matrix Profile VII: Time Series Chains: A New Primitive for Time Series Data Mining 9
Nov 22 Matrix Profile 3 Matrix Profile XXI: MERLIN: Parameter-Free Discovery of Arbitrary Length Anomalies in Massive Time Series Archives 10
Nov 22 Object Detection Meta-DETR: Image-Level Few-Shot Object Detection with Inter-Class Correlation Exploitation 11
Nov 23 GAN Generative Adversarial Networks 12
Nov 23 Variational Auto-encoders An Introduction to Variational Autoencoders 13
Nov 29 Image classification CNN-RNN: A Unified Framework for Multi-label Image Classification 14
Nov 29 Sequence Modeling Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling 15

Meeting Times

Books/References/Practice materials

Similar Courses