/Classification-of-Depression-on-Social-Media-Using-Text-Mining

The first asian machine learning in Jeju Island, South Korea - Project

Primary LanguagePython

Classification of Depression on Social Media Using Text Mining

Author - 저자

Name: Nikie Jo Elauria Deocampo

Country: Philippines

Educational Background:

Undergraduate: Bachelor of Science in Information System

Graduate: Masters in Information Technology

School: West Visayas State University

Mentor: Dr. Bobby Gerardo

Motto: I work hard so my dog can have a better life.

Introduction - 소개

Mental illness has been prevalent in the world, depression is one of the most common psychological problem i know and i would like to help as much as i can. Being a fan of Anthony Bourdain and Robin Williams, It has propel me to explore in this study. With the use of the Large amount of data tweets and Facebook post online i can use machine learning to data mine it and be able to produce a meaningful and useful outcome.

Social media generates countless data every day because of millions of active users share and communicate in entire community, it changes human interaction. For this project, I will be using Python and various modules and libraries.

The Project - 프로젝트

Requirements:

  • Python 3.6.1 or Higher
  • Twitter developer account
  • A bunch of modules (Keras, TF, Numpy, sklearns, pandas and itertools)
  • A lot of patience and a love for machine learning.

The aim of the project is to predict early signs of depression through Social Media text mining. Below are the steps to run the python codes using the data sets uploaded in the repositories or you can download your own.

Follow steps below:

  1. Create a twitter developers account ( Register Here), From that account your would need 4 things.
  2. consumer_key = '', consumer_secret = '', access_token = '', access_secret = ''
  3. Using the file "Download_twitter_Api.py" insert the credentials and you can download current tweets using keywords such us depression, anxiety or sadness. When data sets are ready you may proceed on the preprocessing stage.
  4. Run "preprocessor.py", This stage will go through your data sets and the given dictionary. The dictionary contain words with their corresponding polarity, which is essential to calcualting the sentiment of each tweet, each word will be seperated, tokenized and given its polarity. Every tweet will consist of the summation of all polarity of each word and devided by number of words in that tweet.
  5. Once preprocess is done. You can find the file in the directory "processed_data/output.xlsx". Opening it you will find that the ID (tweet) and Sentiment of each tweet is seperated into 2 columns. With this output you now have a twitter data set and its corresponding sentiment filtered by depress keywords. (Positive, Neutral and Negative).
  6. Now for training and Predicting. Make sure all files are located in proper folders, Run "depression_sentiment_analysis.py". The code will run through the output.xlsx file and at the same time recover the tweet corresponding to the id of each sentiment. using this we use the original data and feed them to our classifiers. When everything is done you should have all the AUC of each classifier listed in the console.
  7. But wait, There's more. You will also have the ability to type in a sample tweet, The tweet will go through the highest AUC in the list of classifier to predict the sentiment of the tweet you wrote.

What the result could mean? Postive, This mean that person is unlikely to have depression or anxiety. Neutral, This is the middle level wherein the user may or may not have depression but may also be more prone to being depress. At that stage the user may display some depression like symptoms. lasty, Negative is the lowest level where depression and anxiety symptoms are being detected through the users tweets. The more negative words the user uses mean the more negative emotion the tweet has.

Video Tutorials

Results - 결과들

Below are the Matrix for the 5 classifier with Decision tree having the highest score.

Using the same data set to test my accuracy, I trained and tested about 10,000 Tweets:

AUC is an abbrevation for area under the curve. It is used in classification analysis in order to determine which of the used models predicts the classes best.

Accuracy:

  • Naive Bayes Accuracy: 93.79406648429645 %
  • Decision Tree: 98.55668748040587 %
  • Support Vector Machine: 50.0 %
  • Kneighbors: 81.464022923447 %
  • Random Forest: 49.1038137743686 %

Completion Time:

  • Naive Bayes Accuracy: 0.59779 Seconds
  • Decision Tree: 3.40457 Seconds
  • Support Vector Machine: 29.83311 Seconds
  • Kneighbors: 7.99048 Seconds
  • Random Forest: 0.60994 Seconds

Future Plans - 향후 계획

This study is not yet perfect and im still aiming to improve it.

  • Use Contextual Semantic segmentation
  • Use Stopwords to increase accuracy of model
  • Eliminating features with extremely low frequency
  • Use Complex Features: n-grams and part of speech tags

References - 참고

Acknowledgement - 승인

This work is not possible without the overwhelming support from Jeju National University, Jeju Development Center and other selfless sponsors. I would like to specifically give a big thanks to Prof. Yungcheol Byun for being the best host ever and my mentor Dr. Bobby Gerardo for the help and guidance.