ds-project-template: A Jupyter Notebook repository from basel-ay

Overview

This assessment is designed to get hands-on experience with the DS project workflow (missing values, outlier handling, standardization, visualization, APIs, etc..). It comprises two main components:

Python Project: Missing values, outlier handling, standardization, visualization.
Flask App: Provides RESTful APIs to interact with the dataset.

The assessment is divided into three parts:

Data Cleaning and Preparation
Statistical Analysis
Data Visualization

Part 1: Data Cleaning and Preparation

Objective

Clean a dataset by handling missing values, treating outliers, and standardizing data formats.

Dataset

The dataset includes columns such as ID, Name, Date_of_Birth, Salary, and Department.

Steps

Handling Missing Values:
- Filled missing IDs with unique values.
- Imputed missing dates of birth using a default date or placeholder.
- Replaced missing salary values with the mean salary.
Outlier Treatment:
- Identified and removed or adjusted salary outliers.
- Corrected negative salary values.
Standardization:
- Converted all date formats to a consistent YYYY-MM-DD format.

Code

The data cleaning process is implemented in data_cleaning.py.

Part 2: Statistical Analysis

Objective

Perform linear regression analysis to predict house prices based on features like size, number of bedrooms, and location.

Dataset

The dataset includes columns such as Size, Bedrooms, Location, and Price.

Steps

Data Preparation:
- Encoded the Location column using one-hot encoding.
- Split the data into training and testing sets.
Model Training:
- Trained a linear regression model on the training set.
Model Evaluation:
- Evaluated the model using Mean Absolute Error (MAE) and R² score.

Model Saving

The trained model is saved using joblib and can be reloaded for future use.

Code

The regression analysis and model saving are implemented in regression_analysis.py.

Part 3: Data Visualization

Objective

Create visualizations to explore and present the dataset's insights.

Dashboard

Tools Used: Power BI
Visualizations Included:
- Total sales over time
- Sales breakdown by product category
- Top-performing sales regions

Interactive Visualization

Tools Used: Plotly
Dataset: stock prices
Features:
- Interactive line chart with hover information and filters for dynamic data exploration.

Code

The data visualization scripts are included in data_visualization.py.

How to Use

Prerequisites

Python 3.7+
Git
Ensure all necessary Python packages are installed (pandas, numpy, scikit-learn, joblib, plotly).

Running the Scripts:
- Run data_cleaning.py for data cleaning tasks.
- Run regression_analysis.py for statistical analysis and model training.
- Run data_visualization.py to generate visualizations.
Dashboard:
- Access the dashboard through the specified data visualization tool.

Python Project Setup

Clone the Repository:

git clone https://github.com/basel-ay/ds-project-template.git

Create and Activate a Virtual Environment:

python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Install Dependencies:
```
pip install -r requirements.txt
```
Run the Python Application:
```
python app.py
```

basel-ay/ds-project-template

Overview

Part 1: Data Cleaning and Preparation

Objective

Dataset

Steps

Code

Part 2: Statistical Analysis

Objective

Dataset

Steps

Model Saving

Code

Part 3: Data Visualization

Objective

Dashboard

Interactive Visualization

Code

How to Use

Prerequisites

Python Project Setup

Dashboard and Interactive Visualization