/ga_projects

Primary LanguageJupyter Notebook

Project 4: Web Scraping Job Postings

Business Case Overview

You're working as a data scientist for a contracting firm that's rapidly expanding. Now that they have their most valuable employee (you!), they need to leverage data to win more contracts. Your firm offers technology and scientific solutions and wants to be competitive in the hiring market. Your principal wants you to

  • determine the industry factors that are most important in predicting the salary amounts for these data.

To limit the scope, your principal has suggested that you focus on data-related job postings, e.g. data scientist, data analyst, research scientist, business intelligence, and any others you might think of. You may also want to decrease the scope by limiting your search to a single region.

Hint: Aggregators like Indeed.com regularly pool job postings from a variety of markets and industries.

Goal: Scrape your own data from a job aggregation tool like Indeed.com in order to collect the data to best answer this question.


Directions

In this project you will be leveraging a variety of skills. The first will be to use the web-scraping and/or API techniques you've learned to collect data on data jobs from Indeed.com or another aggregator. Once you have collected and cleaned the data, you will use it to answer the two questions described above.

Factors that impact salary

To predict salary the most appropriate approach would be a regression model. Here instead we just want to estimate which factors (like location, job title, job level, industry sector) lead to high or low salary and work with a classification model. To do so, split the salary into two groups of high and low salary, for example by choosing the median salary as a threshold (in principle you could choose any single or multiple splitting points).

Use all the skills you have learned so far to build a predictive model. Whatever you decide to use, the most important thing is to justify your choices and interpret your results. Communication of your process is key. Note that most listings DO NOT come with salary information. You'll need to be able to extrapolate or predict the expected salaries for these listings.

Directions:

  • Start by ONLY using the location as a feature.
  • Use at least two different classifiers you find suitable.
  • Remember that scaling your features might be necessary.
  • Display the coefficients/feature importances and write a short summary of what they mean.
  • Create a few new variables in your dataframe to represent interesting features of a job title (e.g. whether 'Senior' or 'Manager' is in the title).
  • Incorporate other text features from the title or summary that you believe will predict the salary.
  • Then build new classification models including also those features. Do they add any value?
  • Tune your models by testing parameter ranges, regularization strengths, etc. Discuss how that affects your models.
  • Discuss model coefficients or feature importances as applicable.

Model evaluation:

Your boss would rather tell a client incorrectly that they would get a lower salary job than tell a client incorrectly that they would get a high salary job. Adjust one of your models to ease his mind, and explain what it is doing and any tradeoffs.

  • Use cross-validation to evaluate your models.
  • Evaluate the accuracy, AUC, precision and recall of the models.
  • Plot the ROC and precision-recall curves for at least one of your models.

Bonus:

  • Answer the salary discussion by using your model to explain the tradeoffs between detecting high versus low salary positions.
  • Discuss the differences and explain when you want a high-recall or a high-precision model in this scenario.
  • Obtain the ROC/precision-recall curves for the different models you studied (at least the tuned model of each category) and compare.

Summarize your results in an executive summary written for a non-technical audience.

  • Writeups should be at least 500-1000 words, defining any technical terms, explaining your approach, as well as any risks and limitations.

BONUS

Convert your executive summary into a public blog post of at least 500 words, in which you document your approach in a tutorial for other aspiring data scientists. Link to this in your notebook.


Suggestions for Getting Started

  1. Collect data from Indeed.com (or another aggregator) on data-related jobs to use in predicting salary trends for your analysis.
  • Select and parse data from at least 1000 postings for jobs, potentially from multiple location searches.
  1. Find out what factors most directly impact salaries (e.g. title, location, department, etc).
  • Test, validate, and describe your models. What factors predict salary category? How do your models perform?
  1. Discover which features have the greatest importance when determining a low versus high paying job.
  • Your Boss is interested in what overall features hold the greatest significance.
  • HR is interested in which SKILLS and KEY WORDS hold the greatest significance.
  1. Author an executive summary that details the highlights of your analysis for a non-technical audience.
  2. If tackling the bonus question, try framing the salary problem as a classification problem detecting low versus high salary positions.

Useful Resources