/Decision-Trees-in-PySpark-Project

This project focuses on leveraging decision trees in PySpark for both classification and regression tasks.

Primary LanguageJupyter NotebookGNU General Public License v3.0GPL-3.0

Decision-Trees-in-PySpark-Project

Project Overview

This project focuses on leveraging decision trees in PySpark for both classification and regression tasks. The project is divided into two main parts:

Theoretical Section:

  • Introduction to Decision Trees: Understanding the principles of decision trees, their partitioning, operational methods, advantages, and limitations.

Practical Section:

a. Classification Trees:

  1. Exploratory Data Analysis (EDA) of Iris Dataset:

    • Perform EDA on the Iris dataset, including visualizations, correlation maps, and dataset splitting into training and testing sets.
  2. Classification Tree Model:

    • Create a decision tree classification model using PySpark.
    • Train the model on the training set and evaluate its performance on the test set.
    • Assess the model using precision, accuracy, confusion matrix metrics.
    • Visualize the decision tree's predictions.

b. Regression Trees:

  1. Random Number Dataset Generation:

    • Generate a dataset of random numbers for regression purposes.
    • Conduct EDA on the generated dataset, including visualizations, and dataset splitting into training and testing sets.
  2. Regression Tree Model:

    • Develop a decision tree regression model using PySpark.
    • Train the model on the training set and evaluate its performance on the test set.
    • Assess the model using precision, accuracy, confusion matrix metrics.
    • Visualize the decision tree's predictions.

Feel free to explore the code, adapt it to different datasets, and experiment with various decision tree parameters to enhance model performance. Happy exploring!