jieliang/analyzing_presidency_similarity

Python script analyzing 130K presidential documents to calculate presidency similarity in terms of most covered topics.

Jupyter Notebook

Project Description Slides

analyzing_presidency_similarity.pdf

Data Acquisition

web_scrape_american_presidency_raw_text.ipynb

retrieves preseidential documents from American Presidency Project in units of 1000 docs, then save each unit in a pickle file

Data Processing

process_data_raw_text_to_DB.ipynb

extract date, title, author and text from each web page scraped by scrape_american_presidency_raw_text.ipynb and insert the record into mongoDB

process_data_stem_text_to_DB.ipynb
1. retrieve records of presidential docs from database
2. clean and tokenize raw text
3. insert processed text along with date, author and title as new records into a new collection in database

Data Transformation

transform_data_tfidf_topic_modelers.ipynb
1. transform tokenized text to word matrix using tfidf vectorizer
2. build 3 topic modlers: lsa, lda and nmf

Data Analysis

data_analysis_presidency_similarity_and_clusters.ipynb
1. given author, find most similar presidents in terms of cosine similarity
2. group presidents in clusters
3. visualization

Flask directory

files for creating web based flask app that takes in user input and returns a list of similar presidents
files in flask/data/ not included in submission because size exceeds git hub limit

Data directory

files in this directory not included in submission because size exceeds git hub limit
1. scraped presidential documents from data sources
2. saved pickle files