DSCI 553: Foundations and Applications of Data Mining

University of Southern California FALL 2023

Notice: This repository may be closed at the start of the new semester, so if you find this code useful, please make sure to back it up in advance. I would appreciate it if you could give me a Star. ✨

Overview

This repository contains my coursework and projects for DSCI 553, a comprehensive data mining course taught by Professor Wei-Min Shen. The course focuses on the algorithms and techniques used for analyzing massive datasets, with an emphasis on system building using Apache Spark.

Course Objectives

Understand the fundamental algorithms of data mining and machine learning for large-scale data analysis.
Gain practical experience with big data technologies, particularly Spark.
Develop skills in designing and implementing scalable systems for real-world data mining problems.

Key Topics Covered

Introduction to Data Mining and MapReduce
Large-Scale File Systems and Big Data Technologies
Frequent Itemsets and Association Rules
Mining of Massive Datasets including Streaming Data
Recommendation Systems and Collaborative Filtering
Analysis of Massive Graphs and Social Networks
Link Analysis and Web Advertising
Advanced Topics in Clustering and Classification

Assignments

Some important hints may be contained in README files, don't miss that!

This section provides an overview of the six homework assignments and the final project, all implemented using PySpark. Each assignment was designed to reinforce the concepts covered in the lectures and readings.

Assignment 1: Data Exploration with Spark (7.2/7.7) - Introduction to basic Spark operations and data handling.
Assignment 2: SON Algorithm Implementation (7/7.7) - Building and optimizing the SON algorithm for market basket analysis.
Assignment 3: Recommendation Systems (7/7.7) - Developing a recommendation system using collaborative filtering and matrix factorization techniques.
Assignment 4: Graph Analysis with GraphFrames (7/7.7) - Implementing community detection using Spark's GraphFrames.
Assignment 5: Streaming Algorithms (7/7.7) - Exploring streaming algorithms like Bloom Filtering and Flajolet-Martin on simulated data streams.
Assignment 6: Clustering with the BFR Algorithm (7/7.7) - Application of the BFR clustering algorithm on synthetic datasets to handle large data sets efficiently.

Final Project

The final project involved enhancing the recommendation system developed in Assignment 3, focusing on improving prediction accuracy and computational efficiency. The project is a part of a competition, details of which can be further explored in the comprehensive project documentation.

Competition Project (8/8) - Advanced work on the recommendation system, including detailed objectives, implementation strategies, and performance outcomes. Highlights include the integration of multiple data sources from Yelp and the application of hybrid models to enhance recommendation accuracy.

Technologies Used

Apache Spark: Utilized for large-scale data processing.
Python: Primary programming language for implementing data mining algorithms.
Scala: Additional implementations to complement PySpark scripts.

Resources

Please refer to the resource folder for set up instructions.

KathleenLeeLi/USC-DSCI-553-Fall23