/overdosed

What linguistic features are unique to discussions of nonmedical substance use?

Primary LanguagePythonApache License 2.0Apache-2.0

overdosed 0.1

What linguistic features are unique to discussions of nonmedical substance use?

Background

Social media (Twitter, Facebook, websites like CrazyMeds) can provide us with information on how the general population uses substances for nonmedical purposes. Social media may, in fact, provide a more accurate picture of usage than data from surveys or emergency rooms. Surveys ask a small sample of the population to remember (sometimes) illicit activities and report them to a federal authority under the promise of anonymynity. Emergency rooms only see the part of the story when substance use goes wrong.

Methodology

overdosed 0.1 uses latent semantic analysis to identify the words or phrases that distinguish tweets discussing the use of substances from other substances. There are two phases:

Phase 1

  1. Sample two streams from Twitter gardenhose (1% sampler).
    Stream 1: Unfiltered.
    Stream 2: Filtered for keywords describing substance of interest.

  2. Develop the classifier.
    Sensitive (rule-in) component: Identify words present in both streams.
    Specific (rule-out) component: Identify words present in filtered stream but not unfiltered stream. (Filtered stream - unfiltered stream)

  3. Analyze the classifier.
    Identify groups of semantically related words in the rule-in component.
    Same for rule-out component. (i.e. Taxonomize)

  4. Test the classifier.
    Curate new samples from the two streams
    Adjust the words needed to be present or absent in a tweet to achieve an acceptable sensitivity and specificity

Phase 2

  1. Sample the unfiltered Twitter gardenhose (1% sampler)
    Cannot calculate valid sample statistics if you combine streams

  2. Partition the unfiltered Twitter stream into
    All tweets discussing use of the substance
    All other tweets

  3. Calculate the relative abundance of each component of the metadata, e.g.
    Are the geographic distributions the same?
    What latent attributes differ?

Quickstart

 git clone https://github.com/mac389/overdosed.git
 cd overdosed
 sh setup.sh

Dependencies

  1. Tweepy (3.3.0)
  2. Gensim (0.10.3)
  3. Seaborn (0.6dev, for visualization, also requires pandas)
  4. NumPy (1.9.1)
  5. Matplotlib
  6. SciPy