Decision-Trees-in-PySpark-Project

Project Overview

This project focuses on leveraging decision trees in PySpark for both classification and regression tasks. The project is divided into two main parts:

Theoretical Section:

Introduction to Decision Trees: Understanding the principles of decision trees, their partitioning, operational methods, advantages, and limitations.

Practical Section:

a. Classification Trees:

Exploratory Data Analysis (EDA) of Iris Dataset:
- Perform EDA on the Iris dataset, including visualizations, correlation maps, and dataset splitting into training and testing sets.
Classification Tree Model:
- Create a decision tree classification model using PySpark.
- Train the model on the training set and evaluate its performance on the test set.
- Assess the model using precision, accuracy, confusion matrix metrics.
- Visualize the decision tree's predictions.

b. Regression Trees:

Random Number Dataset Generation:
- Generate a dataset of random numbers for regression purposes.
- Conduct EDA on the generated dataset, including visualizations, and dataset splitting into training and testing sets.
Regression Tree Model:
- Develop a decision tree regression model using PySpark.
- Train the model on the training set and evaluate its performance on the test set.
- Assess the model using precision, accuracy, confusion matrix metrics.
- Visualize the decision tree's predictions.

Feel free to explore the code, adapt it to different datasets, and experiment with various decision tree parameters to enhance model performance. Happy exploring!