This News Scraper showcases my ability to build a bot for the purposes of process automations.
This project automates the process of extracting news articles from a news website using Robotic Process Automation (RPA). It leverages the RPA framework and Selenium for web automation to streamline the extraction process.
- Search for news articles based on a specified search phrase.
- Filter news articles by category, section, or topic.
- Extract data such as title, date, description, and picture URL for each news article.
- Store extracted data in an Excel file for further analysis or reporting.
- Download images associated with news articles and link them to the corresponding Excel entry.
- Count occurrences of the search phrase in the title and description of each news article.
- Identify if the title or description contains any monetary values.
-
Clone the repository to your local machine:
git clone https://github.com/tony-rsa/rpa-news-scraper.git
-
Navigate to the project directory:
cd rpa-news-scraper
-
Install the necessary dependencies:
pip install -r requirements.txt
-
Download the appropriate WebDriver for your browser (e.g., ChromeDriver for Google Chrome) and place it in the project directory.
-
Open the
config.ini
file and set the desired parameters:search_phrase
: The keyword or phrase to search for in news articles.news_category
: Optional parameter to filter news articles by category, section, or topic.num_months
: Specifies the number of months for which to retrieve news articles (0 or 1 for the current month, 2 for the current and previous month, and so on).
-
Run the main Python script to start the automation process:
python main.py
-
After execution, the extracted data will be saved in an Excel file (
news_data.xlsx
) located in theoutput
directory.
- Search Phrase: The keyword or phrase to search for in news articles.
- News Category/Section/Topic: Optional parameter to filter news articles by category, section, or topic.
- Number of Months: Specifies the number of months for which to retrieve news articles (0 or 1 for the current month, 2 for the current and previous month, and so on).
These parameters can be provided via the config.ini
file or as command-line arguments.
- src/: Contains the main Python script for the RPA News Scraper.
- output/: Directory to store output files such as Excel files and downloaded images.
- tests/: Directory containing unit tests for the RPA News Scraper.
Contributions are welcome! If you have any suggestions, improvements, or feature requests, feel free to open an issue or submit a pull request.
This project is licensed under the MIT License. See the LICENSE file for details.