Try to predict MBTI type based on posts data from https://www.kaggle.com/datasnaek/mbti-type. Since there is only one given feature and label, all of the features used in the model are generated through NLP analysis of the posts data.
All analysis was done in Google Colab, with my personal Google Drive mounted to the file system.
Based on the theory of the MBTI, there are 16 personality types. Therefore, this is a multivariate multiclass classification task.
However, the given data does not have an equal distribution of entries for each personality type, so stratified train-test data splitting must be used for best representation of each class:
Type | Row Count |
---|---|
INFP | 1832 |
INFJ | 1470 |
INTP | 1304 |
INTJ | 1091 |
ENTP | 685 |
ENFP | 675 |
ISTP | 337 |
ISFP | 271 |
ENTJ | 231 |
ISTJ | 205 |
ENFJ | 190 |
ISFJ | 166 |
ESTP | 89 |
ESFP | 48 |
ESFJ | 42 |
ESTJ | 39 |
Features that have been generated during analysis using TextBlob library (https://textblob.readthedocs.io/en/dev/index.html#):
- Average post sentence length in characters
- Standard deviation of post sentence length in characters
- Average post sentence number of words
- Standard deviation of post sentence number of words
- Average post sentence descriptor length in characters
- Standard deviation of post sentence descriptor length in characters
- Words that each personality type uses relatively more frequently than other personality types (see
Determining the Distinguishing Words for each Personality Type
in.ipynb
file) - Sentiment polarity of posts
- Sentiment subjectivity of posts
- Count of words in posts which are considered to have "extreme" polarity (|polarity| >= some threshold)
- Count of words in posts which are considered to have "extreme" subjectivity (|subjectivity| >= some threshold)
- Count of words in posts belong to certain classes of words ("mood", "profanity", "irony")
- Count of punctuation marks (e.g. ',', '.', '!', '?') used in posts
All features have been included together for analysis and feature importance determined by a random forest classifier has allowed me to arbitrarily exclude features (see .ipynb
file for features marked with tag # UF).
Since generating some of the features takes a significant amount of time, I generated the features and saved them in pickle files, found under the GitHub folder preprocessed_data
. In the .ipynb
code, the paths are relative to my Google Drive file system, but are easily adaptable to new filesystems. Simply loading the pickle files and swapping in and out features to include saves a lot of time from repeated feature computation.
Scikit-Learn's implementations of SVC
, RandomForestClassifier
, and LogisticRegressor
have all achieved around 62% accuracy. Using the ensemble VotingClassifier
, which aggregrates the votes based on soft voting (i.e. averaging class probabilities), the accuracy has increased to around 65%. However, due to the imbalance of data representating each personality type and the possibility that personality types cannot be distinguished solely on website posts, I believe 65% is the best that I can achieve.