This repository contains the code and documentation for the News Correlation Analysis project. The project aims to perform exploratory data analysis (EDA), statistical analysis, sentiment analysis, topic modeling, and more on a dataset comprising news articles from various sources.
It will explore the relationship between the news presenting websites and quantify different characterstics by comparison. By conducting this analysis into different aspects and features of these websites,meaningful insights can be obtained.
Gathered data, such as the news articles, domain locations, and website trafficdata will be compared and analyzed in various aspects, such as sentiment analysis of article titles, traffic ranks, and domain locations.
Quantifying these characteristics and comparing them across different news presenting websites, interesting patterns or correlations will be shown. This analysis may provide valuable insights into how various factors, like sentiment, traffic, and geographic location, contribute to the success or popularity of news websites.
- Introduction
- How to get started
- Project Structure
- Task Overview
- Task 1: Project Setup and EDA
- Task 2: Data Science Component Building
- Task 3: PostgreSQL
- Task 4: Dashboards
- Contributions
- License
This project is centered around the analysis of news data to reveal valuable insights including top news websites, traffic patterns, sentiment analysis, topic modeling, and other pertinent aspects. The dataset encompasses details about news articles such as their source, author, content, sentiment, and publication date. Through the application of diverse data science methodologies, the goal is to extract actionable insights that provide a deeper understanding of the news landscape.
This section will guide you through cloning the repository and setting up your development environment.
Git installed on your system. You can download and install Git from https://git-scm.com/downloads
-
Go to terminal window.
-
Navigate to the desired directory on your local machine where you want to clone the repository. You can use the cd command to change directories.:
-
Clone the repository using the following command:
git clone git@github.com:Betfsh/news_correlation_10ac_week0-.git cd news_correlation_10ac_week0
-
Creating a Virtual Environment
If Conda is your preferred package manager:
Open your terminal or command prompt.
Navigate to the project directory.
bash cd path/to/news-correlation
Run the following commands to create a new virtual environment.
```bash
conda create --name env_name python=3.8.10
```
Replace ```env_name``` with the desired name of the virtual environment and ```3.8.10``` with your preferred Python version.
Activate the virtual environment.
```bash
conda activate env_name
```
-
Install the required dependencies:
pip install -r requirements.txt
The project is structured as follows:
- .github
- workflows
- flake8_check.yml
- unittests.yml
- docstring_tests.yml
- workflows
- .vscode
- settings.json
- model
- saved_model_weights.h5
- notebooks
- news_correlation.ipynb
- screenshoots: screenshots of the streamlit dashboard.
- src
- csv_handler.py
- database.py
- loader.py
- main.py
- utils.py
- tests
- init.py
- .gitignore
- README.md
- app.py
- config.json
- requirements.txt
The project is divided into multiple tasks:
- Project Setup and EDA: Setup Python environment, perform exploratory data analysis, and answer specific questions about the data.
- Data Science Component Building: Develop components for machine learning operations (MLOps), conduct time series analysis, classification of headlines, topic modeling, sentiment analysis, and predictive modeling.
- PostgreSQL: Design database schema, load data into PostgreSQL, and utilize it for storing ML features.
- Dashboards: Design and implement a dashboard using Streamlit or React to visualize analysis results.
- Deployment: Deploy the project using GitHub Actions for continuous deployment, and configure environment variables and PostgreSQL database.
- Git and GitHub Setup: Created a GitHub repository and set up version control.
- Python Environment Setup: Prepared a Python environment for data analysis.
- Exploratory Data Analysis (EDA): Analyzed the dataset to answer various questions about news articles, including top websites, traffic analysis, sentiment analysis, and more.
- Topic Modeling: Implemented topic modeling on news articles to uncover underlying themes.
- Sentiment Analysis: Conducted sentiment analysis on news article titles to understand public perception.
- Database Schema Design: Designed a schema for PostgreSQL to store ML features.
- Data Loading: Loaded data from CSV into PostgreSQL database for efficient storage and retrieval.
- Streamlit Dashboard: Designed and implemented a Streamlit dashboard to visualize EDA and model prediction results.
Contributions are welcome! Feel free to open issues or pull requests for any suggestions, bug fixes, or enhancements.
This project is licensed under the MIT License.