Persona Classification

Medical social media is a subset of social media that is restricted to health care related topics. Different kinds of people (or personae) contribute to the medical social media - like patients, caretakers, consultants, pharmacists, medical researchers or medical journalists. The problem at hand is that for a given a blog post as an input, the system is expected to return which personae wrote that post.

Dependencies

This code is written in python. To use it you will need:

Getting started

You will first need to download the model files, word embeddings and blog posts data (see below). The embedding files (utable and btable) are quite large (>2GB) so make sure there is enough space available. The encoder vocabulary can be found in dictionary.txt.

Dataset

Instructions (before running the code)

  • The Google News Vectors file and Glove Vectors files need to be in the same directory as the code
  • All the files mentioned in the "Getting started" section also need to be in the same directory as the code
  • Use the config file in the skip-thoughts-master directory and modify the base path of the Persona_Classification directory
  • Add the path of the skip-thoughts-master directory in your PYTHONPATH environment variable

To run the code

  • python downsample.py -> creates documents label files
  • python create_word_embeddings.py -> creates document embeddings for our corpus using averaged word embeddings
  • python create_sentence_embeddings.py -> creates document embeddings for our corpus using averaged sentenced embeddings from pre-trained skip-thoughts module
  • python train_skipthought.py -> trains skipthoughts on the glove/google news vectors for our sentence corpus