/Joblisting-Cleaning-EDA

An Exploratory Data Analysis (EDA) data science project on the joblistings scraped from the Joblisting Webscraper project. Check out the app here: https://joblisting-cleaning-and-eda.streamlit.app/

Primary LanguageHTMLMIT LicenseMIT

Joblisting Cleaning & Exploratory Data Analysis (EDA)

A cleaning and Exploratory Data Analysis (EDA) data science project on the data science joblistings scraped from the Joblisting Webscraper project!

Table of Contents

Motivation

My motivation for this project was three-pronged. Firstly, I wanted to perform exploratory knowledge analysis with what I have learned. Secondly, I wanted to venture into the unknown! I wanted to dive deeper into EDA (which I have never really done before in my collection of Learning ML Projects). Lastly, the EDA project can provide a comprehensive exploration in job market demographics for data scientist job offerings. :)

Structure


Figure 1. Data science lifecycle.

This project is part of a larger project! This is only 1 step in that larger project. To check out the other projects in this series:

  1. Joblisting-Webscraper
  2. Joblisting-Cleaning-EDA
  3. Joblisting-Modeling

About the structure of this repo:

  • csv stores the CSVs I generated for this project
  • diagrams stores my diagrams
  • img stores images from EDA, auto-EDA, and the banner for my app
  • input stores the dataset I scraped
  • pages stores the subpage for my app
  • sheets stores the spreadsheet I use to organize my EDA process
  • 1_📚_EDA_Report.py is the main page of my app
  • banner.ipynb is a short notebook with code that generated the images I used for my banner
  • cleaning.ipynb is my cleaning notebook and pipeline
  • eda.ipynb is my EDA notebook

Note: the package versions listed in requirements.txt and imported in my code may not be the exact versions. However, the versioning here is less important. I've listed all used libraries.

Dataset

A little about the dataset: the data was webscraped from Glassdoor.com's job listings for data science jobs. I used my own webscraper for it! That can be found here: https://github.com/alckasoc/Joblisting-Webscraper. The dataset is small and can be found in this repo under input. As an alternative, I've also stored this on Kaggle publicly: https://www.kaggle.com/datasets/vincenttu/glassdoor-joblisting.

Difficulties

  • I struggled with structuring this project! There were so many things to include or think about that I spent a good deal of time thinking about the infrastructure of my project. One example was figuring out how to structure the main dataframes and the edit log for convenience and readability. I finally solved this issue after a long mental discussion and came up with a design (specified in the project).
  • One difficulty (as I am working on the project now) is the interpretation! I've interpreted different transformations on a df before, but, as I write these chains of complex functions, I realize that interpretation soon grows a bit more complex! I've spent countless hours interpreting and stepping through transformations.
  • Another difficulty, this one I encountered in data wrangling and cleaning, was deciding on how to impute the data and how to interpret and create thresholds for keeping or deleting features and rows. Because of the many imputation methods and the context in which imputing needs to be considered in, data cleaning took a bit of time!
  • I also had a number of technical difficulties with setting up code for my EDA and app! Though these were just examples of my inexperience with this. After a few hours of scratching my head and reading stack overflow posts, I eventually solved all of these technical difficulties!

What I Learned

  • Imbalanced Learn
    • Imbalanced-learn is a great library for tackling over/under sampling problems playing off of scikit-learns name!
  • Pandas profiling/sweetviz/autoviz/dtale
    • pandas profiling and the like are comprehensive tools for analyzing dataframes (summary statistics and graphs) automatically.
  • Plotly
    • A high-level visually stunning and interactive graphing library.
    • I learned a good chunk of plotly in order to make the heatmaps you see in my EDA!
  • SciPy
    • A library for scientific computation.
    • I learned a bit of inferential statistics through scipy in my EDA.
  • EDA structure
    • As messy as EDA can be, there needs to be some imposed structure for people to follow your line of thought!
  • Google Spreadsheets
  • Data Science Analysis
    • Analysis can take many forms and it's often an umbrella term. Analyzing large quantities of data means effectively aggregating findings and results. This project I've split analyses into 2 categories: breadth and depth. The first paradigm of analyses seeks to get comfortable with the data and probe it. The second seeks to answer specific questions and dive deeper into the data.
  • Asking the right questions!

References

Author Info

Contact me:

Gmail: tuvincent0106@gmail.com
Linkedin: Vincent Tu
Kaggle: vincenttu

Thank you

I've written quite a few of these notes already, but I ought to write one more for the README. This wonderful project has been a rollercoaster and I've enjoyed and turn, twist, and drop. I'd like to thank the internet for carrying me through all the countless hours of debugging! Catherine, thank you for helping with the banner. You've inspired me all throughout this project ❤️! And thank you for checking out my project and reading this! 😁👋