University of Southern California FALL 2023
Notice: This repository may be closed at the start of the new semester, so if you find this code useful, please make sure to back it up in advance. I would appreciate it if you could give me a Star. ✨
This repository contains my coursework and projects for DSCI 553, a comprehensive data mining course taught by Professor Wei-Min Shen. The course focuses on the algorithms and techniques used for analyzing massive datasets, with an emphasis on system building using Apache Spark.
- Understand the fundamental algorithms of data mining and machine learning for large-scale data analysis.
- Gain practical experience with big data technologies, particularly Spark.
- Develop skills in designing and implementing scalable systems for real-world data mining problems.
- Introduction to Data Mining and MapReduce
- Large-Scale File Systems and Big Data Technologies
- Frequent Itemsets and Association Rules
- Mining of Massive Datasets including Streaming Data
- Recommendation Systems and Collaborative Filtering
- Analysis of Massive Graphs and Social Networks
- Link Analysis and Web Advertising
- Advanced Topics in Clustering and Classification
Some important hints may be contained in README files, don't miss that!
This section provides an overview of the six homework assignments and the final project, all implemented using PySpark. Each assignment was designed to reinforce the concepts covered in the lectures and readings.
- Assignment 1: Data Exploration with Spark (7.2/7.7) - Introduction to basic Spark operations and data handling.
- Assignment 2: SON Algorithm Implementation (7/7.7) - Building and optimizing the SON algorithm for market basket analysis.
- Assignment 3: Recommendation Systems (7/7.7) - Developing a recommendation system using collaborative filtering and matrix factorization techniques.
- Assignment 4: Graph Analysis with GraphFrames (7/7.7) - Implementing community detection using Spark's GraphFrames.
- Assignment 5: Streaming Algorithms (7/7.7) - Exploring streaming algorithms like Bloom Filtering and Flajolet-Martin on simulated data streams.
- Assignment 6: Clustering with the BFR Algorithm (7/7.7) - Application of the BFR clustering algorithm on synthetic datasets to handle large data sets efficiently.
The final project involved enhancing the recommendation system developed in Assignment 3, focusing on improving prediction accuracy and computational efficiency. The project is a part of a competition, details of which can be further explored in the comprehensive project documentation.
- Competition Project (8/8) - Advanced work on the recommendation system, including detailed objectives, implementation strategies, and performance outcomes. Highlights include the integration of multiple data sources from Yelp and the application of hybrid models to enhance recommendation accuracy.
- Apache Spark: Utilized for large-scale data processing.
- Python: Primary programming language for implementing data mining algorithms.
- Scala: Additional implementations to complement PySpark scripts.
Please refer to the resource folder for set up instructions.
Other Useful Links: