This project is designed to crawl websites and extract URLs efficiently. It crawls the entire website using Scrapy and stores the URLs in a file named urls.json
inside a directory named after the project.in the same manner.
The project creates a directory named after the project and stores the URLs in a file named urls.json
.
- Crawls the entire website using Scrapy.
- Stores the extracted URLs in a JSON file.
- Python 3.x
- Scrapy
-
Clone the Repository
Clone the repository to your local machine.
git clone https://github.com/naitridoshi/Web-Crawling-with-Scrapy.git
-
Install the Required Packages
Install the required Python packages using pip.
pip install scrapy
You can change the crawler's parameters by modifying the config.py
file. This file includes settings such as the project name and the starting URL.
Set your desired parameters such as the project name and the starting URL.
-
Configure the Parameters
Open
config.py
and set your desired parameters such as the project name and the starting URL. -
Run the Crawler
Run the Python script to start the crawling process.
scrapy runspider spider.py -o project_name/urls.json
This file contains the Scrapy spider for crawling the entire website when no sitemap is found. It extracts URLs and saves them in urls.json
.
Contributions are welcome! Please fork the repository and submit a pull request with your changes.
This project is licensed under the MIT License. See the LICENSE file for details.