LinkHound

A fast, concurrent web crawler that sniffs out links on a domain containing specific strings. It follows redirects and checks every URL in the redirect chain.

Features

Automatic sitemap.xml discovery for fast page enumeration
Crawls all pages on a domain
Checks every link found on each page
Follows redirect chains and checks each URL in the chain
Concurrent requests for fast scanning
Live progress display showing:
- Pages crawled
- Links checked
- Matches found
- Queue size
Detailed output showing:
- Which page contains the matching link
- The original link URL
- The full redirect chain (if any)
- Which URL in the chain matches
Supports both sitemap indexes and regular sitemaps

Installation

Install dependencies using uv (recommended):

uv sync

Or using pip:

pip install -e .

Usage

Basic usage:

python linkhound.py example.com "suspicious-string"

Search for multiple strings:

python linkhound.py example.com "malware" "phishing" "spam"

With options:

python linkhound.py example.com "tracker" -c 20 -t 60

Arguments

domain: The domain to crawl (e.g., example.com or https://example.com)
search_strings: One or more strings to search for in URLs (case-insensitive)

Options

-c, --concurrent: Maximum concurrent requests (default: 10)
-t, --timeout: Request timeout in seconds (default: 30)
-v, --verbose: Enable verbose output for debugging (recommended to see what links are being scanned)
--no-sitemap: Skip sitemap.xml discovery and crawl pages manually

Tip: Use the -v flag to see real-time progress of which pages and links are being scanned. This is especially helpful for understanding what the crawler is doing and troubleshooting any issues.

Examples

Find affiliate links

python linkhound.py myblog.com "affiliate" "ref=" "partner"

Find tracking parameters

python linkhound.py example.com "utm_" "fbclid" "gclid"

Scan for suspicious redirects

python linkhound.py example.com "bit.ly" "tinyurl" "redirect"

With verbose output (recommended)

python linkhound.py example.com "tracker" -v

This will show real-time output of every page and link being scanned, helping you understand the crawler's progress and troubleshoot any issues.

How It Works

Sitemap Discovery: Automatically checks for sitemap.xml or sitemap_index.xml to quickly discover all pages
Crawling: Crawls all discovered pages (from sitemap or by following links)
Link Extraction: Extracts all links from each page (both internal and external)
Link Checking: For each unique link:
- Makes a request without following redirects
- Checks if the URL contains any search strings
- If it's a redirect, follows to the next URL
- Repeats until reaching the final destination or 10 redirects
Reporting: Shows all matches with their source page and redirect chain

Output Example

LinkHound
Domain: https://example.com
Looking for: tracking, analytics

Pages crawled    50
Links checked    234
Matches found    3
Queue size       5

Crawling complete!

Found 3 matching links:

1. Match found:
   Found on page: https://example.com/blog/post-1
   Link URL: https://example.com/out/link123
   Redirect chain:
        https://example.com/out/link123
     -> https://tracker.example.net/click?id=123

2. Match found:
   Found on page: https://example.com/about
   Link URL: https://analytics.service.com/track
   Matching URL: https://analytics.service.com/track

Use Cases

Finding and auditing affiliate links
Detecting tracking pixels and analytics
Identifying malicious redirects
Checking for broken or suspicious links
Compliance audits for link policies
SEO analysis

License

MIT

timonweb/linkhound