- analyzing_presidency_similarity.pdf
-
web_scrape_american_presidency_raw_text.ipynb
retrieves preseidential documents from American Presidency Project in units of 1000 docs, then save each unit in a pickle file
- process_data_raw_text_to_DB.ipynb
extract date, title, author and text from each web page scraped by scrape_american_presidency_raw_text.ipynb and insert the record into mongoDB
- process_data_stem_text_to_DB.ipynb
- retrieve records of presidential docs from database
- clean and tokenize raw text
- insert processed text along with date, author and title as new records into a new collection in database
-
transform_data_tfidf_topic_modelers.ipynb
- transform tokenized text to word matrix using tfidf vectorizer
- build 3 topic modlers: lsa, lda and nmf
-
data_analysis_presidency_similarity_and_clusters.ipynb
- given author, find most similar presidents in terms of cosine similarity
- group presidents in clusters
- visualization
- files for creating web based flask app that takes in user input and returns a list of similar presidents
- files in flask/data/ not included in submission because size exceeds git hub limit
-
files in this directory not included in submission because size exceeds git hub limit
- scraped presidential documents from data sources
- saved pickle files