/Presidentspeech

Topic analysis on US Presidents' State of the Union Addresses and Messages

Primary LanguagePythonMIT LicenseMIT

Presidentspeech

Topic analysis of US Presidents' State of the Union Addresses and Messages

The text is scraped from: The American Presidency Project

Inspiration for this project: Topic Modeling the State of the Union: TV and Partisanship


US Presidents' State of the Union Addresses and Messages

State of the Union Messages to the Congress are mandated by the US constitution. In modern times messages are orally delivered message presented to a joint session of Congress, but the State of the Union was a written report sent to Congress to coincide with a new Session of Congress.

In the texts considered here, Nixon submited multiple documents or gave both oral and written messages. Roosevelt's last (1945) and Eisenhower's 4th (1956) were technically written messages although they also addressed the American people via radio.

Scraping

The text is scraped using urllib2 and BeautifulSoup from this site: The American Presidency Project

To avoid having to scrape the site too often, the scraped texts are stored in documents_raw.pkl using Pickle.

See Scrape.ipynb for the code doing the scraping.


Topic analysis

The text is imported from documents_raw.pkl and preprocessed. Preprocessing includes removing removing non-unicode characters, words starting and ending with non-letter characters ("1st" is ok, "123" not), removing punctuation and stop words ("and", "won't"), lemmatization.

After that Latent Dirichlet Allocation LDA and Non-Negative Matrix Factorization NMF are applied. The topics and the analysis are plotted using pyLDAvis and WordCloud.

See Presidentspeech.ipynb

Requirements

Install a virtual environment or use the --user flag after pip.

pip3 install -r requirements.txt

Also download NLTK data with a command similar to the following (more details on www.nltk.org):

python -m nltk.downloader -d /usr/local/share/nltk_data all