Lecturer: Hossein Hajiabolhassan
The Webpage of the Course: Algorithms For Data Science
Data Science Center, Shahid Beheshti University
- Main TextBooks
- Slides and Papers
- Lecture 1: Introduction to Data Science
- Lecture 2: Toolkit Lab: Jupyter NoteBook
- Lecture 3: Toolkit Lab: Git & GitHub
- Lecture 4: Introduction to Data Mining
- Lecture 5: MapReduce and the New Software Stack
- Lecture 6: Link Analysis
- Lecture 7: Toolkit Lab: Orange & Weka
- Lecture 8: Representative-Based Clustering
- Lecture 9: Hierarchical Clustering
- Lecture 10: Density-Based Clustering
- Lecture 11: Spectral and Graph Clustering
- Lecture 12: Clustering Validation
- Lecture 13: Probabilistic Classification
- Lecture 14: Decision Tree Classifier
- Class Time and Location
- Grading
- Prerequisites
- Account
- Academic Honor Code
- Questions
- Miscellaneous
-
Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, and Jeff Ullman.
Reading: Chapter 1, Chapter 2 (Sections: 2.1, 2.2, & 2.3), and Chapter 5 -
Data Mining and Analysis: Fundamental Concepts and Algorithms by Mohammed J. Zaki and Wagner Meira Jr.
Reading: Chapters 13, 14, 15 (Section 15.1), 16, 17, 18, and 19
Recommended Slides & Papers:
-
Required Reading:
- Slide: Introduction to Data Science by Zico Kolter
- Slide: Introduction to Data Science by Kevin Markham
- Paper: Analyzing the Analyzers: An Introspective Survey of Data Scientists and Their Work
- Slide: Introduction to Data Science by Zico Kolter
-
Required Reading:
- Slide: Practical Data Science: Jupyter NoteBook Lab by Zico Kolter
-
Required Reading:
- Slide: An Introduction to Git by Politecnico di Torino
- Slide: GIT for Beginners by Anthony Baire
- Slide: An Introduction to Git by Politecnico di Torino
-
Required Reading:
- Chapter 1 of Mining of Massive Datasets
- Slide: Introduction to Data Mining by U Kang
- Slide: Bonferroni’s Principle by Irene Finocchi
-
Required Reading:
- Chapter 2 of Mining of Massive Datasets
- Slide of Sections 2.1 & 2.2 (Distributed File Systems & MapReduce): Introduction & Mapreduce by Jure Leskovec
- Slide of Section 2.3 (Algorithms Using MapReduc): Relational Algebra with MapReduce by Damiano Carra
- Slide: MapReduce by Paul Krzyzanowski
- Slide: Introduction to Database Systems (Relational Algebra) by Werner Nutt
-
Required Reading:
- Chapter 5 of Mining of Massive Datasets
- Slide of Sections 5.1, 5.2 (PageRank, Efficient Computation of PageRank): Analysis of Large Graphs 1
- Slide of Sections 5.3-5.5 (Topic-Sensitive PageRank, Link Spam, Hubs and Authorities): Analysis of Large Graphs 2
- Slide: The Linear Algebra Aspects of PageRank by Ilse Ipsen
Additional Reading:
- Paper: A Survey on Proximity Measures for Social Networks by Sara Cohen, Benny Kimelfeld, Georgia Koutrika
-
Required Reading:
Additional Reading:
- Weka: Data Mining with Weka
- Free online courses on data mining with machine learning techniques in Weka. Also, you can register the course via FutureLearn Education Platform.
- Weka: Data Mining with Weka
-
Required Reading:
- Chapter 13 of Data Mining & Analysis
Exercises 13.5: Q2, Q4, Q6, Q7 - Slides (Representative-based Clustering): PDF, PPT by Mohammed J. Zaki and Wagner Meira Jr.
- Slide: Clustering by Matt Dickenson
- Slide: Introduction to Machine Learning (Clustering and EM) by Barnabás Póczos & Aarti Singh
- Tutorial: The Expectation Maximization Algorithm by
Sean Borman
- Tutorial: What is Bayesian Statistics? by John W Stevens
Additional Reading:
- Slide: Tutorial on Estimation and Multivariate Gaussians by Shubhendu Trivedi
- Slide: Mixture Model by Jing Gao
- Paper: Fast Exact k-Means, k-Medians and Bregman Divergence Clustering in 1D
- Paper: k-Means Requires Exponentially Many Iterations Even in the Plane by Andrea Vattani
- Book: Understanding Machine Learning: From Theory to Algorithms by Shai Shalev-Shwartz and Shai Ben-David
- Chapter 13 of Data Mining & Analysis
-
Required Reading:
- Chapter 14 of Data Mining & Analysis
Exercises 14.4: Q4 - Slides (Hierarchical Clustering): PDF, PPT by Mohammed J. Zaki and Wagner Meira Jr.
- Slide: Hierarchical Clustering by Jonathan Taylor
- Slide: Data Structures (Heap) by Wing-Kai Hon
Additional Reading:
- Slide: Hierarchical Clustering for Gene Expression Data Analysis by Giorgio Valentini
- Slide: Hierarchical Clustering by Jing Gao
- Slide: Binary Heaps
- A Short Note: Proof for the Complexity of Building a Heap by Hu Ding
- Lecture: Finding Meaningful Clusters in Data by Sanjoy Dasgupta
- Paper: An Impossibility Theorem for Clustering by Jon Kleinberg
- Chapter 14 of Data Mining & Analysis
-
Required Reading:
- Chapter 15 of Data Mining & Analysis
- Slides of Section 15.1 (Density-based Clustering): PDF, PPT by Mohammed J. Zaki and Wagner Meira Jr.
- Slide: Spatial Database Systems by
Ralf Hartmut Güting
-
Required Reading:
- Chapter 16 of Data Mining & Analysis
Exercises 16.5: Q2, Q3, Q6 - Slides (Spectral and Graph Clustering): PDF, PPT by Mohammed J. Zaki and Wagner Meira Jr.
- Slide: Spectral Clustering by Andrew Rosenberg
- Slide: Introduction to Spectral Clustering by Vasileios Zografos and Klas Nordberg
Additional Reading:
- Slide: Spectral Methods by Jing Gao
- Tutorial: A Tutorial on Spectral Clustering by Ulrike von Luxburg
- Tutorial: Matrix Differentiation by
Randal J. Barnes
- Lecture: Spectral Methods by Sanjoy Dasgupta
- Paper: Positive Semidefinite Matrices and Variational Characterizations of Eigenvalues
by Wing-Kin Ma
- Chapter 16 of Data Mining & Analysis
-
Required Reading:
- Chapter 17 of Data Mining & Analysis
- Slides of Section 17.1 (Clustering Validation): PDF, PPT by Mohammed J. Zaki and Wagner Meira Jr.
- Slide: Clustering Analysis by Enza Messina
- Slide: Information Theory by Jossy Sayir
- Slide: Normalized Mutual Information: Estimating Clustering Quality by Bilal Ahmed
Additional Reading:
- Slide: Clustering Evaluation (II) by Andrew Rosenberg
- Slide: Evaluation (I) by Andrew Rosenberg
-
Required Reading:
- Chapter 18 of Data Mining & Analysis
- Slides (Probabilistic Classification): PDF, PPT by Mohammed J. Zaki and Wagner Meira Jr.
- Slide: Naïve Bayes Classifier by Eamonn Keogh
Additional Reading:
- Slide: Bayes Nets for Representing and Reasoning About Uncertainty by Andrew W. Moore
- Slide: A Tutorial on Bayesian Networks by Weng-Keen Wong
-
Required Reading:
- Chapter 19 of Data Mining & Analysis
- Slides (Decision Tree Classifier): PDF, PPT by Mohammed J. Zaki and Wagner Meira Jr.
- Slide: Information Gain by Linda Shapiro
- Practical Data Science by Zico Kolter
- Course: Data Mining by U Kang
- Crash Course in Spark by Daniel Templeton
- Statistical Data Mining Tutorials by Andrew W. Moore
Saturday and Monday 08:00-09:30 AM (Fall 2018), Room 208.
- Homework – 15%
— Will consist of mathematical problems and/or programming assignments. - Midterm – 35%
- Endterm – 50%
Midterm Examination: Monday 1397/09/12, 08:00-10:00
Final Examination: Sunday 1397/10/16, 08:30-10:30
General mathematical sophistication; and a solid understanding of Algorithms, Linear Algebra, and Probability Theory, at the advanced undergraduate or beginning graduate level, or equivalent.
- Video: Professor Gilbert Strang's Video Lectures on linear algebra.
- Learn Probability and Statistics Through Interactive Visualizations: Seeing Theory was created by Daniel Kunin while an undergraduate at Brown University. The goal of this website is to make statistics more accessible through interactive visualizations (designed using Mike Bostock’s JavaScript library D3.js).
- Statistics and Probability: This website provides training and tools to help you solve statistics problems quickly, easily, and accurately - without having to ask anyone for help.
- Jupyter NoteBooks: Introduction to Statistics by Bargava
- Video: Professor John Tsitsiklis's Video Lectures on Applied Probability.
- Video: Professor Krishna Jagannathan's Video Lectures on Probability Theory.
Have a look at some reports of Kaggle or Stanford students (CS224N, CS224D) to get some general inspiration.
It is necessary to have a GitHub account to share your projects. It offers plans for both private repositories and free accounts. Github is like the hammer in your toolbox, therefore, you need to have it!
Honesty and integrity are vital elements of the academic works. All your submitted assignments must be entirely your own (or your own group's).
We will follow the standard of Department of Mathematical Sciences approach:
- You can get help, but you MUST acknowledge the help on the work you hand in
- Failure to acknowledge your sources is a violation of the Honor Code
- You can talk to others about the algorithm(s) to be used to solve a homework problem; as long as you then mention their name(s) on the work you submit
- You should not use code of others or be looking at code of others when you write your own: You can talk to people but have to write your own solution/code
I will be having office hours for this course on Monday (09:30 AM--12:00 AM). If this is not convenient, email me at hhaji@sbu.ac.ir or talk to me after class.