CS 145: Introduction to data mining

Instructor: Yizhou Sun

  • Lecture Time: Tuesday/Thursday 10-11:50am
  • classroom: WG Young CS24
  • Office hours: Monday 2-3 and Tuesday 4:15-5:00 @ zoom

TAs:

  • Zongyue Qin (qinzongyue at cs.ucla.edu), office hours: Monday 9-11am @ BH 3551 (row M)
  • Yewen Wang (wyw10804@gmail.com, please check Yewen's Email Policy before emailing her.), office hours: Wednesday 9-10am @ BH 3551 Conference Room, 10-11am @ zoom
  • Shichang Zhang (myfirstname@cs.ucla.edu), office hours: Friday 10am-12pm @ BH 3551 Conference Room (email me if you can't find the place)

Course Description

This course introduces basic concepts, algorithms, and techniques of data mining on different types of datasets, including (1) vector data, (2) set data, (3) sequence data, (4) text data, and (5) graph data. The class project involves hands-on practice of mining useful knowledge from large data sets. The course is an undergraduate-level computer science course. Also, the course may attract students from other disciplines who need to understand, develop, and use data mining techniques to analyze large amounts of data.

Prerequisites

  • You are expected to have background knowledge in data structures, algorithms, basic linear algebra, and basic statistics.
  • You will also need to be familiar with at least one programming language, and have programming experiences.

Learning Objectives

  • Know what data mining is and learn the basic algorithms
  • Develop skills to apply data mining algorithms to solve real-world applications
  • Gain initial experience in conducting research on data mining

Grading

  • Homework: 30%
  • Midterm exam: 20%
  • Final exam: 15%
  • Course project: 25%
  • Participation: 10%

*All the deadlines are 11:59PM (midnight) of the due dates.

*Late submission policy: you will get original score * , if you are t hours late.

*No copying or sharing of homework!

  • You can discuss general challenges and ideas with others.
  • Suspicious cases will be reported to The Office of the Dean of Students.

Q & A

  • We will be using Piazza for class discussion. The system is highly catered to getting you help fast and efficiently from classmates, the TAs, and myself. Rather than emailing questions to the teaching staff, I encourage you to post your questions on Piazza.
  • Sign up Piazza here: piazza.com/ucla/fall2021/cs145
  • Tips: Answering other students' questions will increase your participation score.

Academic Integrity Policy

"With its status as a world-class research institution, it is critical that the University uphold the highest standards of integrity both inside and outside the classroom. As a student and member of the UCLA community, you are expected to demonstrate integrity in all of your academic endeavors. Accordingly, when accusations of academic dishonesty occur, The Office of the Dean of Students is charged with investigating and adjudicating suspected violations. Academic dishonesty, includes, but is not limited to, cheating, fabrication, plagiarism, multiple submissions or facilitating academic misconduct." For more information, please refer to the guidance .

Tentative Schedule

*Book refers to: Jiawei Han, Micheline Kamber, and Jian Pei, Data Mining: Concepts and Techniques, 3rd edition.

Week Date Topic Further Reading Discussion Session Homework Course Project
Week 0 9/23 Introduction [Slides] and Know Your Data [Slides] Week0 Slides
Week 1 9/28 Linear Regression [Slides] https://cs229.stanford.edu/notes2021fall/cs229-notes1.pdf
Week 1 9/30 Logistic Regression [Slides] https://cs229.stanford.edu/notes2021fall/cs229-notes1.pdf Week 1 Slides HW1 Released
Week 2 10/5 Tree-based Models [Slides]
Week 2 10/7 Neural Networks [Slides] Week 2 Slides HW1 Due (10/6 11:59pm), HW2 Released
Week 3 10/12 Continue with Neural Networks
Week 3 10/14 Practical Issues of Classification [Slides] and K-Means [Slides]
  • Book Chapter 8.5
  • Book Chapter 10.1-10.4
    Week 3 Slides
    Week 4 10/19 Mixture Models [Slides] and Practical Issues of Clustering [Slides] HW2 Due (10/18 11:59pm), HW3 Released
    Week 4 10/21 Text Data: Naive Bayes [Slides] http://www.ccs.neu.edu/home/yzsun/classes/2014Fall_CS6220/Slides/NB.pdf Week 4 Slides
    Week 5 10/26 Text Data: Topic Models [Slides] HW3 Due (10/25 11:59pm), HW4 Released
    Week 5 10/28 Time Series Data [Slides] https://online.stat.psu.edu/stat510 Week 5 Slides
    Week 6 11/2 Continue with Time Series
    Week 6 11/4 Midterm Exam Week 6 Slides HW4 Due 11/7 Midterm Report Due
    Week 7 11/9 Set Data: Frequent Pattern Mining and Association Rules [Slides] Book Chapter 6
    Week 7 11/11 Veterans Day holiday (No Class)
    Week 8 11/16 Set Data: Frequent Pattern Mining and Association Rules (same as above) Book Chapter 6
    Week 8 11/18 Set Data: Frequent Pattern Mining and Association Rules (same as above) Book Chapter 6 HW5 Due (11/18 11:59pm), HW6 Released
    Week 9 11/23 Sequence Data: Sequential Pattern Mining [Slides] Book Chapter 8 Week 8 Slides
    Week 9 11/25 Thanksgiving holiday (No Class)
    Week 10 11/30 Graph Data: Random Walk [Slides], Classification and Clustering [Slides] Week10 Slides
    Week 10 12/2 Bias, Privacy, and Ethics [Slides] 12/5 Kaggle Submission Stop
    Week 11 12/9 Final Exam 12/10 Final Report Due