This assessment is designed to get hands-on experience with the DS project workflow (missing values, outlier handling, standardization, visualization, APIs, etc..). It comprises two main components:
- Python Project: Missing values, outlier handling, standardization, visualization.
- Flask App: Provides RESTful APIs to interact with the dataset.
The assessment is divided into three parts:
- Data Cleaning and Preparation
- Statistical Analysis
- Data Visualization
Clean a dataset by handling missing values, treating outliers, and standardizing data formats.
- The dataset includes columns such as
ID
,Name
,Date_of_Birth
,Salary
, andDepartment
.
-
Handling Missing Values:
- Filled missing IDs with unique values.
- Imputed missing dates of birth using a default date or placeholder.
- Replaced missing salary values with the mean salary.
-
Outlier Treatment:
- Identified and removed or adjusted salary outliers.
- Corrected negative salary values.
-
Standardization:
- Converted all date formats to a consistent
YYYY-MM-DD
format.
- Converted all date formats to a consistent
The data cleaning process is implemented in data_cleaning.py
.
Perform linear regression analysis to predict house prices based on features like size, number of bedrooms, and location.
- The dataset includes columns such as
Size
,Bedrooms
,Location
, andPrice
.
-
Data Preparation:
- Encoded the
Location
column using one-hot encoding. - Split the data into training and testing sets.
- Encoded the
-
Model Training:
- Trained a linear regression model on the training set.
-
Model Evaluation:
- Evaluated the model using Mean Absolute Error (MAE) and R² score.
- The trained model is saved using
joblib
and can be reloaded for future use.
The regression analysis and model saving are implemented in regression_analysis.py
.
Create visualizations to explore and present the dataset's insights.
- Tools Used: Power BI
- Visualizations Included:
- Total sales over time
- Sales breakdown by product category
- Top-performing sales regions
- Tools Used: Plotly
- Dataset: stock prices
- Features:
- Interactive line chart with hover information and filters for dynamic data exploration.
The data visualization scripts are included in data_visualization.py
.
- Python 3.7+
- Git
- Ensure all necessary Python packages are installed (
pandas
,numpy
,scikit-learn
,joblib
,plotly
).
-
Running the Scripts:
- Run
data_cleaning.py
for data cleaning tasks. - Run
regression_analysis.py
for statistical analysis and model training. - Run
data_visualization.py
to generate visualizations.
- Run
-
Dashboard:
- Access the dashboard through the specified data visualization tool.
-
Clone the Repository:
git clone https://github.com/basel-ay/ds-project-template.git
-
Create and Activate a Virtual Environment:
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install Dependencies:
pip install -r requirements.txt
-
Run the Python Application:
python app.py