/WiDS-datathon

Use ML to predict gender from sparse data

Primary LanguageJupyter Notebook

Project Description

This is our analyses for the 2018 Women in Data Science (WiDS) Datathon hosted by Stanford University. The goal is to predict if a survey respondent is a male or a female based on 1200+ variables. The data are very sparse. Our model yielded 96.5% accuracy (ROC AUC).

Data Background

The data contains demographic and behavioral information from a representative sample of survey respondents from India and their usage of traditional financial and mobile financial services. The data was obtained by a research group to help the world’s poorest people take advantage of widely available mobile phones and other digital technology to access financial tools and participate more fully in their local economies. Women in these communities, in particular, are often largely excluded from the formal financial system. By predicting gender, the datathon participants will explore the key differences in behavior patterns of men and women, and how that may impact their use of new financial services.

Work Summary

  • Imputing sparse data
  • Modeling with Random Forest, SVM, Naive Bayes
  • Feature generation followed by modeling with linear models

Python Installation

pip install scikit-learn
pip install pandas