
Suicidal Ideation Detection in Online User Content

Suicidal Ideation Detection in Online User Contents

Project archived. No updates, no dataset licensing. Please consider using UMD Suicidality dataset instead.

Getting Started

Due to the anonymity of online media and social networks, people tend to express their feelings and sufferings in online communities. In order to prevent suicide, it is necessary to detect suicide-related posts and users' suicide ideation in cyberspace by natural language processing methods. We focus on the online community called Reddit and the social networking website Twitter, and classify users' posts with potential suicide and without suicidal risk through texts features processing and machine learning based methods.


We collect two sets of data from Reddit and Twitter. The Reddit data set includes 5,326 suicidal ideation samples and a number of non-suicide texts (20k). The Twitter dataset has totally 10k tweets with 594 tweets (around 6%) with suicidal ideation.

The Reddit word cloud (left) and Twitter word cloud (right) are shown as follow:

Reddit word cloudTwitter word cloud

The original text data can not be provided publicly for the consideration of users privacy. It will be provided by request, see the data availability in our paper.

Notice Only Reddit dataset is available for sharing. Please contact me using your institutional email to identify yourself when request for it. Due to a large number of messages, I may miss your message or it may been misclassified as spam. Sorry for that.

If you'd like to collect your own data, please refer this repository: web spider.

These two xlsx files in this project contain some sample data composed by the author. Notice: when running the scripts, please replace them with requested data or your own data.

Features Precessing

We extracted six sets of features, i.e., statistical features, POS counts, TF-IDF, Topics probability, LIWC, and pre-trained word2vec word embedding.

The csv files are the processed features using LIWC. Notice: when running the scripts, please replace them with your own data.

All these features are visualized in the following 6 pictures, using PCA as dimensionality reduction.


Running the scripts

Six models were implemented. They are logistic regression, random forest, gradient boosting decision tree, xgboost, support vector machine, and LSTM networks.

Former five models for Reddit and Twitter were implemented by python clf.py and python clf_reddit.py. The LSTM model for Reddit and Twitter by python lstm.py and python lstm_reddit.py. python lstm_word2vec.py and python lstm_word2vec_reddit.py.

These scripts were written in Python 3.6. Please check the requirements before running.


Part of experimental results as below on Reddit SuicideWatch vs. all dataset with 5,326 posts containing suicidal ideation.

Model Acc. Pre. Rec. F1 AUC
RF 0.941440 0.958286 0.906931 0.931861 0.986029
GBDT 0.961845 0.964161 0.948894 0.956437 0.991860
XGB 0.965660 0.969280 0.952525 0.960796 0.993403
LSTM 0.961098 0.959305 0.952117 0.955449 0.992637


There is also a remarkable work from University of Maryland which was finished almost at the same period of our work.

