A Python script to scan resumes for signals of elite class status.
Anonymous screening seems to be a good solution for reducing bias in hiring. However, it may not be possible to fully anonymize a resume, particularly in regards to class status (elite vs. non-elite), because class is signalled in many subtle ways. This script searches resumes for terms that signal elite class status and counts them, outputting a CSV intended to be loaded into Stata for analysis.
python final.py
The script will output a CSV where each row is a resume and each column is a term. The intersection of each row and column holds the number of occurences of that term (and its synonyms) in that resume.
Sample resumes can be found in this Drive folder, and a sample output is available in /sample_output/
.
These instructions will get you a copy of the project up and running.
-
Clone the repo:
git clone https://github.com/TheFirstQuestion/resume-parser.git
-
Install dependencies via
pip
/conda
/mamba
:textract nltk tqdm pandas
-
Run
setup.py
to download and generate necessary files:python setup.py
-
Edit the terms lists (in
/terms_of_interest
) to suit your needs. Each line represents a concept, so each new term should be on a new line. Synonyms of the term should be comma-separated on the same line; their counts will be combined in the output. The terms are divided into different files for convenience only. -
Edit the config section at the top of the script to suit your needs.
Variable Usage Suggested Value RESUME_DIRECTORY
Path (relative to script location) to the directory containing the resumes. "./sample_resumes/"
TERMS_LOCATION
Path (relative to script location) to the directory containing the CSV file(s) defining the terms of interest. "./terms_of_interest/"
OUTPUT_DIRECTORY
Path (relative to script location) to the directory wherein the script will write the output files. "./output/"
RESUME_ID_COLUMN_NAME
The header for the CSV column that identifies each resume. "resumeName"
SKIP_GREEK
Should the script skip searching for all the Greek terms of interest? False
-
Run the script.
python final.py
On the sample set of 2538 resumes, the script finishes in ~7 minutes, with the main loop running at ~6 resumes per second.
Collaboration is what makes the world such an amazing place to learn, inspire, and create. Any contributions or suggestions you make are greatly appreciated!
Feel free to do any of the following:
- send me an email
- open an issue
- fork the repo and create a pull request
- Most of the sample resumes used in testing came from the Kaggle resume dataset, which was a super convenient resource.