/data-literacy-research

In our research, we aim to discover how we can quantify data literacy expectations from job postings.

Primary LanguageJupyter Notebook

Data Literacy Research

In this research project, I collaborate with Professor Sandra Cannon in measuring data literacy expectations from the ways employers describe jobs and the way they describe the people they are looking for. We find the discriminatory power between how employers describe jobs and what the actual work on the job entails.

Current Pipeline

Data Literacy Expectations Pipeline Transparant

Files:

linkedin.py This is the file you use to generated the dataframe of linkedin postings - results will be stored in data/scraping_results (tagged with "linkedin")

indeed.py Same as linkedin.py but for indeed postings - results will be stored in data/scraping_results (tagged with "indeed")

Notes

Data files:

  • merged_headings_df: Contains both the LinkedIn and Indeed postings in a single DataFrame

Utility Functions (in utilities.utils)

  • to_wcdf: Applies sklearn CountVectorizer
  • preprocess_heading_text: Takes the Heading Text, which is initially intended for merged_headings_df, and applies a preprocessing pipeline on it
  • visualize_counts: Takes in a Pandas series of string row entiresand visualizes using Seaborn teh top n words in that corpus
  • visualize_seq_lengths: Visualizes the distribution of word lengths in a sequence