News Scraper

Project Description

News Scraper is a web scraping project built using Scrapy to extract articles from various news websites. The extracted data includes article URLs, titles, publication dates, authors, and content. The data is saved in JSON files categorized by the website from which they were scraped.

Features

Scrapes multiple news websites sequentially.
Extracts and saves articles in JSON format.
Organizes scraped data by website.
Configurable through a text file containing URLs.

Setup Instructions

Prerequisites

Python 3.x
Scrapy

Installation

Clone the repository:

git clone https://github.com/Ganesh-VG/Web-Scraping-with-Scrapy
cd Web-Scraping-with-Scrapy

Create and activate a virtual environment (optional but recommended):

python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`

Install the required packages:
```
pip install -r requirements.txt
```

Configuration

config.txt: Add the list of URLs to be scraped in this file, one URL per line.
settings.py: Adjust the Scrapy settings as needed.

Directory Structure

Web-Scraping-with-Scrapy/
├── config.txt
├── main.py
├── news_spider.py
├── pipelines.py
├── requirements.txt
└── JSON_files/

Usage

Add URLs to config.txt.
Run the scraper:
```
python main.py
```
The scraped data will be saved in JSON files in the JSON_files directory, categorized by the website name.

If you want to run spider on single url:

scrapy crawl getnews -a url=<Input URL>

Example

An example of a URL in config.txt:

https://www.livemint.com/

https://economictimes.indiatimes.com/

After running the scraper, JSON files will be created for each website:

JSON_files/
├── livemint_articles.json
├── economictimes_articles.json

Contributing

Contributions are welcome! Please follow these steps to contribute:

Fork the repository.
Create a new branch: git checkout -b my-feature-branch.
Make your changes and commit them: git commit -m 'Add some feature'.
Push to the branch: git push origin my-feature-branch.
Submit a pull request.