Topic analysis of US Presidents' State of the Union Addresses and Messages
The text is scraped from: The American Presidency Project
Inspiration for this project: Topic Modeling the State of the Union: TV and Partisanship
State of the Union Messages to the Congress are mandated by the US constitution. In modern times messages are orally delivered message presented to a joint session of Congress, but the State of the Union was a written report sent to Congress to coincide with a new Session of Congress.
In the texts considered here, Nixon submited multiple documents or gave both oral and written messages. Roosevelt's last (1945) and Eisenhower's 4th (1956) were technically written messages although they also addressed the American people via radio.
The text is scraped using urllib2 and BeautifulSoup from this site: The American Presidency Project
To avoid having to scrape the site too often, the scraped texts are stored in documents_raw.pkl using Pickle.
See Scrape.ipynb for the code doing the scraping.
The text is imported from documents_raw.pkl and preprocessed. Preprocessing includes removing removing non-unicode characters, words starting and ending with non-letter characters ("1st" is ok, "123" not), removing punctuation and stop words ("and", "won't"), lemmatization.
After that Latent Dirichlet Allocation LDA and Non-Negative Matrix Factorization NMF are applied. The topics and the analysis are plotted using pyLDAvis and WordCloud.
Install a virtual environment or use the --user flag after pip.
pip3 install -r requirements.txt
Also download NLTK data with a command similar to the following (more details on www.nltk.org):
python -m nltk.downloader -d /usr/local/share/nltk_data all