Outlier-Detection

Utilizing PySpark for distributed data processing to perform outlier detection and clustering on a dataset containing bid request information related to online advertisements, using K-means and Bisecting K-means algorithms.

Introduction

The code is written in Python and uses PySpark for distributed data processing. The main goal of this project is to detect outliers in a dataset and perform clustering on the data using the K-means and Bisecting K-means algorithms. The dataset contains bid request information related to online advertisements.

Getting Started

Prerequisites

Before running the code, make sure you have the following:

Python installed on your system.
PySpark installed. You can install it using the following command:

pip install pyspark

Instructions

Open the Jupyter Notebook file named "FinalProjectA_318170917_322995358.ipynb".
Execute the code cells step-by-step. The code is written with explanations at each stage.
The data preprocessing, feature engineering, and outlier detection process will be performed first.
After that, the K-means and Bisecting K-means clustering algorithms will be applied.
The results of clustering and outlier detection will be visualized.

Code Explanation

The code can be divided into the following sections:

Data Preparation: The data is read from the JSON files and preprocessed to flatten nested JSON structures.
Feature Selection: The relevant features for clustering and outlier detection are chosen.
One-Hot Encoding: Categorical features are one-hot encoded to be used in the clustering process.
Feature Scaling: The selected features are scaled using StandardScaler for better performance in clustering algorithms.
K-means Clustering: K-means algorithm is applied to cluster the data into different groups.
Bisecting K-means Clustering: Bisecting K-means algorithm is applied to compare clustering results with K-means.
Outlier Detection: Outliers are detected based on the distance from the closest centroid.
Results Visualization: The results of clustering and outlier detection are visualized using PCA.

Results

The project provides two main results:

Clustering Results: The data is clustered into multiple groups using K-means and Bisecting K-means algorithms. The results are visualized using PCA to show the clusters in a 2D space.
Outlier Detection Results: Outliers are detected based on the distance of data points from the closest centroid. The outliers are identified and visualized.