/MBTI-ML

Primary LanguageJupyter Notebook

Project Description

For our machine learning final project, we decided to predict Myers Briggs personality types(MBTI) using a dataset we found on Kaggle. The dataset consists of posts users made on a personality forum: https://www.kaggle.com/datasnaek/mbti-type

Screen Shot 2019-05-07 at 1 30 59 AM

Methods

First, we cleaned up the text data and created NLP features, then fit different models such as Multinomial Naive Bayes, Logistic Regression, Random Forest, XGBoost, LightGBM and LSTM.

Metrics such as AUC, F1 score and accuracy were used to evaluate the performance of models. We then fine tuned our models and leveraged voting classifier to ensemble them. Finally, to build up our test set, we scraped tweets from celebrities like Obama and Lady Gaga, using our model to predict their MBTI types. The predictions look interesting.

Screen Shot 2019-05-07 at 1 35 43 AM

Results

The data was heavily imbalanced, with most people identifying as introverted (I) and Intuitive (N) rather than extroverted (E) and Sensitive (S). Because of this, all the models we tried (Logarithmic Regression, Random Forest, Multinomial Naive Bayes, SVM, LightGBM, and XGBoost) had trouble classifying extroversion vs introversion and intuition vs sensitivity. However, when we made a Voting Classifier using Logarithmic Regression, Random Forest, LightGBM, and XGBoost, we were able to achieve the best AUC-ROC score and f-scores.

Screen Shot 2019-05-07 at 1 42 12 AM

Team (alphabetical order)

Ben Khuong
Donya Fozoonmayeh
Nan Lin
Tomohiko Ishihara
Zack Pan