/RED

Retrieve dialogues from Reddit and generate an empathetic conversation dataset.

Primary LanguageJupyter NotebookMIT LicenseMIT

Reddit Empathetic Dialogue Dataset (RED Dataset)

Most people suffer from emotional distress due to going through a significant life change, financial crisis, being a caregiver or due to various physical and mental health conditions. Inability to regulate emotion in such episodes can potentially lead to self-destructive behavior such as substance abuse, self-harm or suicide. However, due to public and personal “stigma” associated with mental health, most people do not reach out for help. Even therapeutic consultations are limited and are not available 24/7 to support people when they are going through a traumatic episode. Therefore, it is important to assess the ability of AI driven chatbots to help people to deal with emotional distress and help them regulate emotion. One of the major limitations in developing such a chatbot is the unavailability of a curated dialogue dataset containing emotional support. With this project, we aim to curate and analyse such a dataset having the potential to train and evaluate mental care giving chatbot that can support people in emotional distress.

Table of Contents

Dependencies

The codes are implemented in Python 3. You will need the following dependencies installed:

Files Description

  • reddit-scrape-pushshift.ipynb: The notebook is mainly used for scraping Reddit textual data using Pushshift APIs.
  • preprocess.ipynb: Preprocess raw scraped conversation data and convert them to table-like data frames.
  • EDA.ipynb: The notebook presents various analyses and graphical representations to attain insights and find patterns.
  • utils4text.py: This file contains the supporting functions applied in EDA.ipynb.
  • EmoBERT.ipynb: The notebook for making emotion prediction on the messages in dialogues. Before running it, make sure to load the checkpoints HERE.

Dataset

  • The dataset can be found in two folders, raw and dataset, in Google Drive.
  • Categories
    • raw: Raw data scraped by Pushshift can be found HERE.
    • dataset: Refined data after preprocessing can be found HERE.

Steps towards Results

  1. For scraping dialogues from Reddit, run the notebook reddit-scrape-pushshift.ipynb. Note that it would take several hours to finish scraping on the subreddits like r/depression, r/offmychest and r/suicidewatch.
  2. Run preprocess.ipynb to transform the scraped data in the JSON format into data frames.
  3. If you want to explore the dialogues, check EDA.ipynb for more details.
  4. Run EmoBERT.ipynb to get the emotion prediction of the utterances.

References

  1. The Pushshift Reddit Dataset
  2. EmpatheticIntents

License

Licensed under MIT License