Utilizing PySpark for distributed data processing to perform outlier detection and clustering on a dataset containing bid request information related to online advertisements, using K-means and Bisecting K-means algorithms.
The code is written in Python and uses PySpark for distributed data processing. The main goal of this project is to detect outliers in a dataset and perform clustering on the data using the K-means and Bisecting K-means algorithms. The dataset contains bid request information related to online advertisements.
Before running the code, make sure you have the following:
- Python installed on your system.
- PySpark installed. You can install it using the following command:
pip install pyspark
- Open the Jupyter Notebook file named "FinalProjectA_318170917_322995358.ipynb".
- Execute the code cells step-by-step. The code is written with explanations at each stage.
- The data preprocessing, feature engineering, and outlier detection process will be performed first.
- After that, the K-means and Bisecting K-means clustering algorithms will be applied.
- The results of clustering and outlier detection will be visualized.
The code can be divided into the following sections:
- Data Preparation: The data is read from the JSON files and preprocessed to flatten nested JSON structures.
- Feature Selection: The relevant features for clustering and outlier detection are chosen.
- One-Hot Encoding: Categorical features are one-hot encoded to be used in the clustering process.
- Feature Scaling: The selected features are scaled using StandardScaler for better performance in clustering algorithms.
- K-means Clustering: K-means algorithm is applied to cluster the data into different groups.
- Bisecting K-means Clustering: Bisecting K-means algorithm is applied to compare clustering results with K-means.
- Outlier Detection: Outliers are detected based on the distance from the closest centroid.
- Results Visualization: The results of clustering and outlier detection are visualized using PCA.
The project provides two main results:
-
Clustering Results: The data is clustered into multiple groups using K-means and Bisecting K-means algorithms. The results are visualized using PCA to show the clusters in a 2D space.
-
Outlier Detection Results: Outliers are detected based on the distance of data points from the closest centroid. The outliers are identified and visualized.