/career_village_entities

Object model for the Kaggle Data Science for Good: CareerVillage.org initiative.

Primary LanguagePythonMIT LicenseMIT

career_village_entities

Object model for the Kaggle Data Science for Good: CareerVillage.org initiative.

Getting started

First, we want to read in the CSV files to build the object model. We then save that to a pickle file.

from career_village_entities import CareerVillage

# Load the raw data and save as a pickle file
CareerVillage.load_raw('input/').save('data/cv.p')

After that, we can just load the pickle file.

from career_village_entities import CareerVillage

# In the future, we can just load the pickle
cv = CareerVillage.load('data/cv.p')

Note the pickle file around 350 MB and it can take up to 30 seconds to load. I therefore recommend using it within a Jupyter notebook so that you only have to load it once and can then perform all of your analysis.

The CareerVillage instances contains several collections, one for each type of entity in the data set. Each collection is a scalaps list, which gives it a lot of useful helper methods.

from career_village_entities import CareerVillage

cv = CareerVillage.load('data/cv.p')

print(cv.tags.length, 'tags')

cv.questions.take(5).for_each(print)

Each entity is linked to other entities. E.g., an Answer is linked to it's question and it's author. Similarly, each person (Student or Professional) is linked to the questions they've asked and the answers they've provided. This helps us find patterns in the data for use in developing methods to recommend specific questions to specific professionals.

Here's an example where we check how important emails are for encouraging answers. We simply check how frequently a question was answered by a professional who was emailed with a suggestion to answer that question.

# Count how many questions were answered by a professional emailed about the question
# vs. how many questions were answered w/o prompting
from career_village_entities import CareerVillage

cv = CareerVillage.load('data/cv.p')

def is_question_answered_by_emailed_an_professional(question):
    emailed_professionals = question.emails.map('recipient')
    authors = question.answers.map('author')
    return bool(set(emailed_professionals) & set(authors))

(cv
 .questions
 .filter(lambda q: q.answers.length > 0) # Only consider questions that were answered
 .map(is_question_answered_by_emailed_an_professional)
 .value_counts()
 .items()
 .for_each(print))

The results are

(True, 10452)
(False, 12658)

Hence, 45.2% of answered questions were answered in response to an email prompt.

Very much a work in progress. I'd appreciate other people's input, so feel free to submit a PR.

Contact: Matt Hagy matthew.hagy@gmail.com