/USC-DSCI-553-Fall23

This repository covers DSCI553, focusing on data mining—a crucial skill for analyzing massive datasets. The course explores algorithms for uncovering patterns in data, with a practical emphasis. Students will learn to apply data mining techniques to solve real-world problems.

Primary LanguagePythonMIT LicenseMIT

DSCI 553: Foundations and Applications of Data Mining

University of Southern California FALL 2023

Notice: This repository may be closed at the start of the new semester, so if you find this code useful, please make sure to back it up in advance. I would appreciate it if you could give me a Star. ✨

Overview

This repository contains my coursework and projects for DSCI 553, a comprehensive data mining course taught by Professor Wei-Min Shen. The course focuses on the algorithms and techniques used for analyzing massive datasets, with an emphasis on system building using Apache Spark.

Course Objectives

  • Understand the fundamental algorithms of data mining and machine learning for large-scale data analysis.
  • Gain practical experience with big data technologies, particularly Spark.
  • Develop skills in designing and implementing scalable systems for real-world data mining problems.

Key Topics Covered

  1. Introduction to Data Mining and MapReduce
  2. Large-Scale File Systems and Big Data Technologies
  3. Frequent Itemsets and Association Rules
  4. Mining of Massive Datasets including Streaming Data
  5. Recommendation Systems and Collaborative Filtering
  6. Analysis of Massive Graphs and Social Networks
  7. Link Analysis and Web Advertising
  8. Advanced Topics in Clustering and Classification

Assignments

Some important hints may be contained in README files, don't miss that!

This section provides an overview of the six homework assignments and the final project, all implemented using PySpark. Each assignment was designed to reinforce the concepts covered in the lectures and readings.

Final Project

The final project involved enhancing the recommendation system developed in Assignment 3, focusing on improving prediction accuracy and computational efficiency. The project is a part of a competition, details of which can be further explored in the comprehensive project documentation.

  • Competition Project (8/8) - Advanced work on the recommendation system, including detailed objectives, implementation strategies, and performance outcomes. Highlights include the integration of multiple data sources from Yelp and the application of hybrid models to enhance recommendation accuracy.

Technologies Used

  • Apache Spark: Utilized for large-scale data processing.
  • Python: Primary programming language for implementing data mining algorithms.
  • Scala: Additional implementations to complement PySpark scripts.

Resources

Please refer to the resource folder for set up instructions.

Other Useful Links: