rpa-project-nytimes

Automation to extract informations from news in NY Times website.

Your challenge is to automate the process of extracting data from the news site. Link to the news site: www.nytimes.com

You must have 3 configured variables (you can save them in the configuration file, but it is better to put them to the Robocorp Cloud Work Items):

search phrase
news category or section
number of months for which you need to receive news

Example of how this should work: 0 or 1 - only the current month, 2 - current and previous month, 3 - current and two previous months, and so on

The main steps:

Open the site by following the link
Enter a phrase in the search field
On the result page, apply the following filters:
- select a news category or section
  
  your automation should have the option to choose from none to any number of categories/sections. This should be specified via the config file or/and Robocorp Cloud Work Items
- choose the latest (i.e., newest) news
Get the values: title, date, and description.
Store in an Excel file:
- title
- date
- description (if available)
- picture filename
- count of search phrases in the title and description
- True or False, depending on whether the title or description contains any amount of money
Possible formats: $11.1 | $111,111.11 | 11 dollars | 11 USD
Download the news picture and specify the file name in the Excel file
Follow steps 4-6 for all news that falls within the required time period

Project structure

The project is divided in three folders:

config: configuration files;
outputs: output folders generated using current datetime for unique folders and containing the Excel file, a log file and the images folder with all images downloaded;
src: folder containing all scripts

Main libraries used in this project (also available in requirements.txt):

Python version: 3.11

The config.ini file is located in the config folder and contains all the configuration parameters divided by section:

website_parameters: URLs, xpaths, ids and parameters related to the website http://www.nytimes.com;
input_parameters: the configured variables by user:
- search phrase: must be separated by space;
- news category or section: must be a list, eg: [Books,Fashion,Movies,Opinion,U.S.];
- number of months for which you need to receive news: must be an integer
browser_parameters: basically the chrome version;
general_parameters: folders path, time to wait on clicks and regex structures