/Job_and_Health_Analysis

Study of the relation between Reasons for Leaving a Job and Health Characteristics

Primary LanguageJupyter Notebook

Study of the relation between Reasons for Leaving a Job and Health Characteristics

Overview of the analysis:

This project aims to assess the reasons for leaving the job and whether they are related to different types of health characteristics.
Our initial objective was to study the effects on the health of people affected by the layoffs that have occurred in the most different industries lately, especially in IT, but due to the limitation in the researched databases, it was necessary to change the initial idea and in this way we will work with all the reasons for leaving the job.

Description of the data:

To carry out this project, we used tables available on the Statistics Canada website.
The tables used are:

job_df

heath_df

In the images below we present the data in its raw form as well as identify which cleanings are necessary for future modeling.

job_dic health_dic

To create our future models, we merge the two previous tables by Year, Province, Sex and Age_group, reaching the final result in the table below.

final_dic

As previously mentioned, because we have a limitation of the data found, our analysis will be based on the years between 2017 and 2021 (5 years).
Our final dataframe is:

healthXjob_df

Questions to answer with the project:

Based on the data presented above, the questions we intend to answer in this project are:

  • What are the main reasons why employees tend to leave their current job?
  • Is there any relation between reasons for leaving employment and health characteristics? If such relation exist, what are the major correlations?
  • Do province, gender, and age group interfere with any specific health characteristics?

Results:

In the file LogisticRegression.ipynb we create a correlation matrix using as base a table with the indicators transformed into columns to analyze the correlation between the variables. From the image below we can see that none of the job variables (J_L_R...) has a strong correlation with the health variables.

correlation

In this other heatmap below containing the correlation percentages we can confirm that the health variables do not present any correlation above 0.80 with the job variables.

correlationper

Unsupervised Model

In the file Final_Project_Model.ipynb we first create an unsupervised model because we don't have any target variables.
We use PCA to reduce dimension to 3 principal components. With these components we create an elbow curve to find the best value for K.

elbowcurve

For the creation of the K-Means model we chose K=5 to create 5 predicted clustering classes. Below we can see the 3D-Scatter with the PCA data and the clusters.

clustergraph

With these classes, we create some hvplot scatter plots to evaluate if there is any bias between the classes created and the main variables we would like to evaluate.

Analysys Indicators_Health x J_L_R_Permanent_layoff

healthxPerman

Analysys Province x J_L_R_Permanent_layoff

ProvincexPerman ProvinceData

Analysys Sex x J_L_R_Permanent_layoff

sexxPeman sexData

Supervised Model

Logistic Regression

After creating the variable Class as a result of the predicted clustering we use it as a target variable to analyze if our model has a good accuracy in classifying classes.

Confusion Matrix:
LogisticRegressio_confusion
LR_accuracy
Classification Report:
LR_Class_report
Lr_confusion

Random Forest Classifier

As the logistic regression model did not show a good result, we modeled the data using the Random Forest Classifier model.
rf_accuracy
Confusion Matrix:
rf_confusion
Classification Report:
rf_class_report

As we can see in the images above, this model presented a very good adherence to the clustering classification of the data.

Below we list the variables with the greatest weight for the model in order of importance. variables

Tableau Analysis

After the model was finalized, we used Tableau software as a tool to visualize our findings.

In the figure below where we analyze the Health Indicators and the number of people who declared this, we can see that the classification did not serve to differentiate which type of health indicator is being considered. In a future prediction, we will not be able to tell which health indicator affects the individual just by looking at the model classification.

Indicators_HealthbyClass

In this other dashboard, we analyze the average of layoffs (Permanent Layoff and Temporary Layoff) during the years of the analysis and separated by age group.
LayoffsbyAgeGroup

The first thing we can see is that 2020 was the peak of layoffs, followed by 2021, affected by the pandemic.
Another analysis that we can do is that those most affected by the layoffs were professionals with an average age of up to 30 to 50 years, followed by professionals in the range of 15 to 29 years. In these cases, the model managed to classify the ages, such as classes 0 and 4 for professionals under 50 years old and class 2 for those over 50 years old.

In the bellow dashboard, we analyze the classification by province, age group and gender.
ProvincebyAge_Sex_Class

We can check how the model classifies each of the groups:
Class 0: only age group 1 and 2 and 3 provinces
Class 1: Men only of all age groups and excluding the provinces of Ontario and Quebec
Class 2: age group 3 and 4 (seniors) and only 3 provinces
Class 3: only women of all age groups and excluding the provinces of Ontario
Class 4: age group 1 and 2 (young adults) and provinces of Ontario and Quebec only

Summary

Analyzing all the data above, we can see that our model, despite having good accuracy in predicting the classification, cannot explain a specific variable. This may be related to our findings that the values of health indicators are not related to the values of the reasons for leaving the job. Therefore, we cannot specify the main reasons that lead people to leave work.
As the data used were 3 years before the start of the pandemic and 2 years after the start of the pandemic, our data may have been biased due to the fact that many people lost their jobs in the last 2 years of the analysis. And consequently, health problems can affect people who leave their jobs later, which maybe 2 years is a short period of time for this model to be developed without having this pandemic bias.
As improvements for a future model, analyze the most recent data (2022) as we believe we have a lot of important data in this analysis as it was a year with a lot of change in the employment situation. Another improvement would be to select only health indicators with expressive values that will really make sense for the analysis. And finally, analyze the values in terms of percentages of the population, since the provinces have very different densities.