This web-crawler is designed to crawl product pages from e-commerce websites. Built using FastAPI, it supports crawling using BeautifulSoup and Playwright. It uses AI to determine which page is a product page, enhancing the accuracy of the crawling process. It is built keeping scalability in mind.
- AI-Powered URL Discovery: Uses AI to intelligently figure out product URLs within the website, enhancing the accuracy and efficiency of the crawling process.
- Concurrent Crawling: Efficiently crawl multiple domains at once with configurable concurrency settings.
- Dual Crawling Methods: Choose between Playwright for browser-based crawling or BeautifulSoup for HTTP-based crawling.
- Dynamic Content Handling: Supports infinite scrolling to capture content loaded dynamically.
- Configurable Settings: Easily adjust settings like concurrent requests per domain and across domains, download delays, and more.
- Automatic URL Pattern Detection: Utilizes regex patterns to identify and filter product URLs.
- Comprehensive Logging: Detailed logging for monitoring and debugging the crawling process.
- RESTful API Interface: Interact with the crawler through a well-documented API with Swagger support.
-
Clone the Repository:
git clone git@github.com:whatmanathinks/web-crawler.git cd web-crawler
-
Install Dependencies: Ensure you have Python installed, then run:
pip install -r requirements.txt
-
Environment Configuration: Copy the
.env.example
to.env
and configure your environment variables as needed:cp .env.example .env
- Benefits: Ideal for websites with dynamic content that requires JavaScript execution. It can handle complex interactions and infinite scrolling. Slower compared to BeautifulSoup crawler.
- Usage: Set
CRAWLER=playwright
in your environment configuration.
- Benefits: Lightweight and faster for static websites where JavaScript execution is not required. It uses HTTP requests to fetch page content.
- Usage: Set
CRAWLER=beautifulsoup
in your environment configuration.
-
Run the Application: Start the FastAPI server:
fastapi dev main.py
-
Access the API Documentation: Open your browser and navigate to
http://localhost:8000/docs
to explore the API endpoints and test the crawler.
- Domains: A list of domain URLs to crawl for product links.
- Product URLs: A dictionary containing the domain as the key and a list of discovered product URLs as the value.
To crawl domains, send a POST request to the /crawl
endpoint with a JSON body containing the list of domains:
curl -X 'POST' \
'http://127.0.0.1:8000/crawl' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"domains": [
"https://tiesta.in"
]
}'