/distributed_multinomial_regression

Testing out Matt Taddy's distrom package on some real data.

Primary LanguageR

distributed_multinomial_regression

Testing out Taddy's distrom package on some real data.

Steps:

  • Pulled bios of moms on Twitter (~1M).
  • Created the DTM in Python Scikit-learn and exported in Market Matrix format (tried using tm & quanteda R packaes for this but they kept bombing out)
  • Read in DTM to R and run DMR. Word ~ f(state)
  • Plot over/under index terms by comparing the coef for each term & state: CA vs NY, CA vs CO etc. The regressions normalize each word to the same unit interval so we can compare relative popularity of words in each state, controlling for overall popularity, gender, etc.
  • Plan to extend this so I can isolate the time trend for each word, controlling for the changing composition of the underlying user base.