News Scraper is a web scraping project built using Scrapy to extract articles from various news websites. The extracted data includes article URLs, titles, publication dates, authors, and content. The data is saved in JSON files categorized by the website from which they were scraped.
- Scrapes multiple news websites sequentially.
- Extracts and saves articles in JSON format.
- Organizes scraped data by website.
- Configurable through a text file containing URLs.
- Python 3.x
- Scrapy
-
Clone the repository:
git clone https://github.com/Ganesh-VG/Web-Scraping-with-Scrapy cd Web-Scraping-with-Scrapy
-
Create and activate a virtual environment (optional but recommended):
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install the required packages:
pip install -r requirements.txt
- config.txt: Add the list of URLs to be scraped in this file, one URL per line.
- settings.py: Adjust the Scrapy settings as needed.
Web-Scraping-with-Scrapy/
├── config.txt
├── main.py
├── news_spider.py
├── pipelines.py
├── requirements.txt
└── JSON_files/
-
Add URLs to
config.txt
. -
Run the scraper:
python main.py
-
The scraped data will be saved in JSON files in the
JSON_files
directory, categorized by the website name. -
If you want to run spider on single url:
scrapy crawl getnews -a url=<Input URL>
An example of a URL in config.txt
:
https://economictimes.indiatimes.com/
After running the scraper, JSON files will be created for each website:
JSON_files/
├── livemint_articles.json
├── economictimes_articles.json
Contributions are welcome! Please follow these steps to contribute:
- Fork the repository.
- Create a new branch:
git checkout -b my-feature-branch
. - Make your changes and commit them:
git commit -m 'Add some feature'
. - Push to the branch:
git push origin my-feature-branch
. - Submit a pull request.