This project involves creating a web crawler application in Python that, given a URL seed, can crawl through all links on the page and scan deeply to a specified level of depth. The application is designed to return URLs that contain specific search texts and rank them based on several criteria.
- Python: Official Site
- Beginner's Guide: Python Beginner's Guide
- Tutorial: Python Official Tutorial
- Networking
- None
The core functionality of the web crawler involves the following features:
-
Input:
- URL Seed: e.g.,
www.hackernews.com
- Depth: e.g.,
5
(This means the crawler will follow links on a page up to 5 levels deep.) - Search Text: e.g.,
"python"
- URL Seed: e.g.,
-
Output:
- A list of URLs that contain the specified search text.
For those seeking an additional challenge, the following advanced features have been implemented:
- The crawler will only return URLs that contain a user-specified substring.
- This ensures that the results are more relevant to the user’s query.
- The crawler supports searching for multiple search strings simultaneously.
- Ranking: URLs are ranked based on:
- The number of different search strings found on the page.
- The total occurrences of each search string within the page.
- The depth level of the URL relative to the seed URL (URLs closer to the seed are prioritized if other criteria are equal).
- URLs are ranked based on their depth relative to the seed URL, with those found closer to the seed URL being given higher priority in the results.
- Future Implementation: A feature to highlight in the output if the URL is among a long list of blacklisted URLs (e.g., approximately 10,000 blacklisted URLs) will be added. This would help users avoid known problematic or irrelevant sites.
To run this application, you'll need Python installed on your machine. You can install the required packages using pip
:
pip install -r requirements.txt
The application can be run from the command line. Here is the basic usage:
python3 script.py <start_url> <crawl_depth> <required_substring> <search_term1> [search_term2 ...]
python3 script.py http://example.com 5 "example" "python" "crawler"
- start_url:
http://example.com
- The seed URL where the crawl begins. - crawl_depth:
5
- The number of link levels to crawl. - required_substring:
"example"
- The substring that must be present in the returned URLs. - search_terms:
"python" "crawler"
- The search terms to find on the pages.
The output will list URLs containing the specified search terms, ranked according to the implemented rules:
- Number of unique search terms found
- Total occurrences of those search terms
- URL depth relative to the seed URL
Each output line will display the URL, the number of unique terms found, total occurrences, and the depth level.
http://example.com/page1 (Unique Terms: 2, Total Occurrences: 5, Level: 2)
http://example.com/page2 (Unique Terms: 1, Total Occurrences: 3, Level: 1)
- Rule 2: Integrate a feature to check URLs against a blacklist and highlight those that are flagged.
- Improved User Interface: Add a web-based interface for easier use and visualization of results.