What linguistic features are unique to discussions of nonmedical substance use?
Social media (Twitter, Facebook, websites like CrazyMeds) can provide us with information on how the general population uses substances for nonmedical purposes. Social media may, in fact, provide a more accurate picture of usage than data from surveys or emergency rooms. Surveys ask a small sample of the population to remember (sometimes) illicit activities and report them to a federal authority under the promise of anonymynity. Emergency rooms only see the part of the story when substance use goes wrong.
overdosed 0.1 uses latent semantic analysis to identify the words or phrases that distinguish tweets discussing the use of substances from other substances. There are two phases:
Phase 1
-
Sample two streams from Twitter gardenhose (1% sampler).
Stream 1: Unfiltered.
Stream 2: Filtered for keywords describing substance of interest. -
Develop the classifier.
Sensitive (rule-in) component: Identify words present in both streams.
Specific (rule-out) component: Identify words present in filtered stream but not unfiltered stream. (Filtered stream - unfiltered stream) -
Analyze the classifier.
Identify groups of semantically related words in the rule-in component.
Same for rule-out component. (i.e. Taxonomize) -
Test the classifier.
Curate new samples from the two streams
Adjust the words needed to be present or absent in a tweet to achieve an acceptable sensitivity and specificity
Phase 2
-
Sample the unfiltered Twitter gardenhose (1% sampler)
Cannot calculate valid sample statistics if you combine streams -
Partition the unfiltered Twitter stream into
All tweets discussing use of the substance
All other tweets -
Calculate the relative abundance of each component of the metadata, e.g.
Are the geographic distributions the same?
What latent attributes differ?
git clone https://github.com/mac389/overdosed.git
cd overdosed
sh setup.sh
- Tweepy (3.3.0)
- Gensim (0.10.3)
- Seaborn (0.6dev, for visualization, also requires pandas)
- NumPy (1.9.1)
- Matplotlib
- SciPy