Web Crawler Python App

Problem Statement

This project involves creating a web crawler application in Python that, given a URL seed, can crawl through all links on the page and scan deeply to a specified level of depth. The application is designed to return URLs that contain specific search texts and rank them based on several criteria.

Language:

Python: Official Site

Resources:

Beginner's Guide: Python Beginner's Guide
Tutorial: Python Official Tutorial

App Category:

Networking

Database:

None

Features

Simple Part (Mandatory)

The core functionality of the web crawler involves the following features:

Input:
1. URL Seed: e.g., www.hackernews.com
2. Depth: e.g., 5 (This means the crawler will follow links on a page up to 5 levels deep.)
3. Search Text: e.g., "python"
Output:
- A list of URLs that contain the specified search text.

Advanced Features (Implemented)

For those seeking an additional challenge, the following advanced features have been implemented:

Rule 1: The Returned URL Must Contain a Specific Substring

The crawler will only return URLs that contain a user-specified substring.
This ensures that the results are more relevant to the user’s query.

Rule 3: Search for Multiple Search Strings

The crawler supports searching for multiple search strings simultaneously.
Ranking: URLs are ranked based on:
- The number of different search strings found on the page.
- The total occurrences of each search string within the page.
- The depth level of the URL relative to the seed URL (URLs closer to the seed are prioritized if other criteria are equal).

Rule 4: Rank Based on URL Depth Relative to the Seed URL

URLs are ranked based on their depth relative to the seed URL, with those found closer to the seed URL being given higher priority in the results.

Additional Features (Future Work)

Rule 2: Highlight Blacklisted URLs

Future Implementation: A feature to highlight in the output if the URL is among a long list of blacklisted URLs (e.g., approximately 10,000 blacklisted URLs) will be added. This would help users avoid known problematic or irrelevant sites.

Installation

To run this application, you'll need Python installed on your machine. You can install the required packages using pip:

pip install -r requirements.txt

Usage

The application can be run from the command line. Here is the basic usage:

python3 script.py <start_url> <crawl_depth> <required_substring> <search_term1> [search_term2 ...]

Example

python3 script.py http://example.com 5 "example" "python" "crawler"

start_url: http://example.com - The seed URL where the crawl begins.
crawl_depth: 5 - The number of link levels to crawl.
required_substring: "example" - The substring that must be present in the returned URLs.
search_terms: "python" "crawler" - The search terms to find on the pages.

Output

The output will list URLs containing the specified search terms, ranked according to the implemented rules:

Number of unique search terms found
Total occurrences of those search terms
URL depth relative to the seed URL

Each output line will display the URL, the number of unique terms found, total occurrences, and the depth level.

Sample Output

http://example.com/page1 (Unique Terms: 2, Total Occurrences: 5, Level: 2)
http://example.com/page2 (Unique Terms: 1, Total Occurrences: 3, Level: 1)

Future Enhancements

Rule 2: Integrate a feature to check URLs against a blacklist and highlight those that are flagged.
Improved User Interface: Add a web-based interface for easier use and visualization of results.

dkdndes/python-crawler