This Python project is designed to process and analyze Spotify dataset to predict patterns using the K-Nearest Neighbors (KNN) algorithm. The program checks for data integrity, preprocesses the dataset, encodes features, and applies KNN to predict outcomes based on provided inputs.
- Dataset integrity check and automatic correction.
- Data loading and preprocessing with encoding.
- Splitting dataset into training and testing sets.
- Combining and reshaping data for analysis.
- Utilization of K-Nearest Neighbors (KNN) for predictions.
- Detailed output of predictions and cosine similarity results.
- Advanced data retrieval and plotting of top results.
- Python 3.x
- Libraries:
numpy
,matplotlib
,seaborn
,sklearn
,scipy
,os
- A Spotify dataset file named
correct_dataset.csv
located in adata
directory.
Clone this repository and ensure that all required Python libraries are installed by running:
pip install numpy matplotlib seaborn scikit-learn scipy
To use this program, follow these steps:
-
Prepare the Dataset: Ensure the Spotify dataset file named
correct_dataset.csv
is located in the../data/
directory relative to the script. If the dataset is not correct, the program will attempt to automatically fix it by referencing a file namedspotify_dataset.csv
. -
Run the Script: Execute the script in your Python environment using the command:
python spotify_analyzer.py
-
Follow the on-screen prompts to interact with the program.
- Data Integrity Check: Initially, the program checks if the required dataset exists and is correct. If not, it calls a function to correct the dataset.
- Data Loading and Encoding: The dataset is loaded and encoded to transform raw data into a format suitable for machine learning.
- Training and Testing: The data is split into training and test sets, with 75% of the data used for training.
- Prediction and Analysis: KNN is used to predict the outcomes based on the test dataset. Predictions and their accuracy are then printed out.
- Results Interpretation: The program allows users to input an ID to find related entries and prints a list of potential related IDs based on the predictions.
- Visualization: A bar chart of the top ten successful songs is displayed, highlighting the success rates using data visualizations.