/news_analyzer

National news articles recommender and analyzer using LDA. DATA 515 spring 2018 group project.

Primary LanguagePythonMIT LicenseMIT

NARA: News Articles Recommender and Analyzer

Build Status Coverage Status

Background

Conventional news recommendation systems use a small set of keywords to identify the top recommended articles to users based on keywords frequency. We built upon this framework by enabling users to find recommended articles by providing an entire news article. The recommendation process uses article topics to evaluate which articles to recommended.

Our system highlights topic insights through two different models: an unguided LDA for identifying topic recommended articles and a guided LDA with seed words from NYTimes to show interpretable topics. Our system also shows sentiment information and word cloud that summarize the query article for the user.

UIScreenshot

NARA does 3 things:

  • Recommends news articles based on topic relevance
  • Analyzes sentiment of news articles
  • Visualizes news articles

How does it work?

  • Takes user input query article or keyword/phrases from UI
  • Recommends relevant articles from corpus using LDA
  • Lists relevant topics for query article
  • Presents sentiment analysis based on number of positive/negative/neutral sentences in input
  • Visualizes query article using word cloud

For more details, please see: FunctionalDesign.md

Data

All-the-news Kaggle dataset

It consists of over 140,000 articles from 15 US national publishers between 2015 - 2017. The distribution of the publishers is shown below:

publishers distribution

The New York Times seed words dataset

We used labelled article information from the New York Times to seed words for the guided LDA. We aggregated and analyzed the article titles, summaries and categories from a series of days to generate the list of seed words.

Installation

  1. First clone the repo in your local directory. Then in the repo root directory, run the set up file to install the dependencies:
pip install -r requirements.txt
  1. Now, run the setup.py file as following to download other data dependencies:
python setup.py build 

Currently NARA only supports Python 3.4 or newer.

Demo Example

Because the full news articles corpus and resulting topic models are too large to be stored in the repo, we provide a smaller corpus that is subset of the full corpus. This demo section goes over how to run Now to run the user interface to start using NARA.

  1. Follow the installations steps above

  2. Run the Dash UI

python path_to_libraries/user_interface.py
  1. Copy the url that shows up in your command line to your browser to start the UI.

  2. Enter any sample article from the examples folder or of your choice

Refer to the NARA_UserGuide.pdf in the examples directory for more detailed information.

Full Build

The repo includes files for a subset of the Kaggle data. To download and build the topic models based on the full dataset you'll need to register for a Kaggle account and follow instructions in this link under the API Credentials section: https://github.com/Kaggle/kaggle-api. Once Kaggle credentials are set up, simply run the build_resources.py script as follows from the root directory:

python <path to root folder>/news_analyzer/scripts/build_resources.py --download

The build script will automatically download the csv files using Kaggle's API, and build the topic models. Once complete, you can simply run the UI as shown above in the demo example.

The full build using the entire corpus of 140,000 articles is quite computationally expensive. It is recommended that the build is performed on a cloud machine with at least 48GB of memory. The full build took about ~24 hours on compute optimized AWS machine with 60GB memory.

If the data has already been downloaded and you would like rebuild the topic model with different settings, you can run the build_resources.py script without the --download flag.

Component Design

ComponentDesignFlowChart
For more details, please see: ComponentDesign.md

Directory Structure

news_analyzer
├── LICENSE
├── README.md
├── doc
│   ├── ComponentDesign.md
│   ├── Final_presentation.pptx
│   ├── FunctionalDesign.md
│   ├── WeReadTheNews.pptx
│   ├── mockup.png
│   ├── ui_screenshot.png
│   └── news-nlp-flowchart-2.png
├── examples
│   ├── NARA_UserGuide.pdf
│   ├── example_article1.txt
│   ├── example_article2.txt
│   └── example_article3.txt
├── news_analyzer
│   ├── __init__.py
│   ├── libraries
│   │   ├── __init__.py
│   │   ├── article_recommender.py
│   │   ├── configs.py
│   │   ├── handler.py
│   │   ├── nytimes_article_retriever.py
│   │   ├── sentiment_analyzer.py
│   │   ├── temp_wordcloud.png
│   │   ├── text_processing.py
│   │   ├── topic_modeling.py
│   │   ├── user_interface.py
│   │   └── word_cloud_generator.py
│   ├── resources
│   │   ├── articles.csv
│   │   ├── guidedlda_model.pkl
│   │   ├── nytimes_data
│   │   │   ├── NYtimes_data_20180507.csv
│   │   │   ├── NYtimes_data_20180508.csv
│   │   │   ├── NYtimes_data_20180509.csv
│   │   │   ├── NYtimes_data_20180513.csv
│   │   │   ├── NYtimes_data_20180515.csv
│   │   │   ├── NYtimes_data_20180527.csv
│   │   │   ├── NYtimes_data_20180528.csv
│   │   │   ├── NYtimes_data_20180529.csv
│   │   │   ├── NYtimes_data_20180603.csv
│   │   │   └── aggregated_seed_words.txt
│   │   ├── preprocessor.pkl
│   │   └── unguidedlda_model.pkl
│   ├── scripts
│   │   └── build_resources.py
│   └── tests
│       ├── __init__.py
│       ├── test_article_recommender.py
│       ├── test_handler.py
│       ├── test_nytimes_article_retriever.py
│       ├── test_preprocessing.py
│       ├── test_resources
│       │   ├── test_dtm.pkl
│       │   └── test_topics_raw.pkl
│       ├── test_sentiment_analyzer.py
│       ├── test_topic_modeling.py
│       ├── test_user_interface.py
│       └── test_word_cloud_generator.py
├── requirements.txt
└── setup.py

Team:WeReadTheNews

MS Data Science, University of Washington
DATA 515 Software Design for Data Science (Spring 2018)

Team members:

References: