/Data_Scientists_Salary_Prediction

A project to predict the salary of data scientists jobs across America using Data Collected from web scraping.

Primary LanguagePython

Data Scientists Salary Prediction: Project Overview

  • Created a tool that can predict the salary of data scientists(mean_absolute_error ~ $11K) to negotiate job offers.
  • Used selenium to scrape over 1000 jobs from glassdoor website.
  • Done features engineering to extract important values (eg. extracting jobs skills like python, aws ... etc. from Job Description).
  • Tried Lasso, LinearRegession, Randomforrest models and optimized them using GridSearchCV library.
  • Built a client facing API using Flask.

Resources

Code and Packages

  • Python Version: 3.9.6
  • Packages: Selenium, Pandas, Sk-learn, Numpy, Flask, Requests, Pickle, Matplotlib
  • for Web Framwork Requirements: In the FlaskAPI directorty run pip install -r requirements.txt

Web Scraping

I modified the github repo(above), since it wasn't up-to-date, to get over 1000 jobs. For each job, I got acquired the following info:

  • Job title
  • Salary Estimate
  • Job Description
  • Rating
  • Company
  • Location
  • Company Headquarters
  • Company Size
  • Company Founded Date
  • Type of Ownership
  • Industry
  • Sector
  • Revenue
  • Competitors

Data Cleaning

After scraping, the data had to be ready for the model usages. Hence, I made the following changes:

  • Dropped the rows with no job salaries
  • Parsed the salary by getting rid of "Glassdoor est.", the dollar sign, and the 'K' letter.
  • Extracted per_hour and employer provided columns from Salary Estimate column
  • Extracted the min_salary, max_salary, and avg_salary, which will be our target column.
  • Parsed Company Names to get rid of the rating, to a new company_txt column.
  • Parsed Location to include only the two letter code for the city, to job_state column
  • Checked whether the Location is the same as the Headquarters, to a new same_state column
  • Extracted age column (years since the country was founded till now) from Founded column
  • Made columns for if different skills were listed in the Job Description:
    • Python
    • Tableau
    • Excel
    • AWS
    • Spark
  • Column for simplified job title and Seniority
  • Column for description length

Exploratory Data Analysis

I looked at the distribution to have an idea how the data are organized and how they are distributed to take hint of what model would be suitable. Below are a few highlights:

alt text alt text

alt text

Model Building

  • I extract the most important features(22 feature), and then I got the dummy columns for the categorical features
  • Note: In pulling out the y column, I converted it into a 1-d array because it is recommended.
  • I split the data into train and test test_size=0.2
  • I used statsmodels OLS to get information on the relevant column of the data. I tried three different models:
  • Linear Regression – Baseline for the model
  • Lasso Regression – Because of the sparse data from the many categorical variables, I thought a normalized regression like lasso would be effective.
  • Random Forest – Again, with the sparsity associated with the data, I thought that this would be a good fit.

Model performance

The Random Forest model far outperformed the other approaches on the test and validation sets.

MAE: mean absolute error

  • Random Forest : MAE = 11.22
  • Linear Regression: MAE = 18.86
  • Ridge Regression: MAE = 19.67

Productionization

In this step, I built a flask API endpoint that was hosted on a local webserver. The API endpoint takes in a request with a list of values from a job listing and returns an estimated salary. SEE THE EXAMPLE FILE: API_request_example.ipynb