/linkhound

Primary LanguagePythonMIT LicenseMIT

LinkHound

A fast, concurrent web crawler that sniffs out links on a domain containing specific strings. It follows redirects and checks every URL in the redirect chain.

Features

  • Automatic sitemap.xml discovery for fast page enumeration
  • Crawls all pages on a domain
  • Checks every link found on each page
  • Follows redirect chains and checks each URL in the chain
  • Concurrent requests for fast scanning
  • Live progress display showing:
    • Pages crawled
    • Links checked
    • Matches found
    • Queue size
  • Detailed output showing:
    • Which page contains the matching link
    • The original link URL
    • The full redirect chain (if any)
    • Which URL in the chain matches
  • Supports both sitemap indexes and regular sitemaps

Installation

Install dependencies using uv (recommended):

uv sync

Or using pip:

pip install -e .

Usage

Basic usage:

python linkhound.py example.com "suspicious-string"

Search for multiple strings:

python linkhound.py example.com "malware" "phishing" "spam"

With options:

python linkhound.py example.com "tracker" -c 20 -t 60

Arguments

  • domain: The domain to crawl (e.g., example.com or https://example.com)
  • search_strings: One or more strings to search for in URLs (case-insensitive)

Options

  • -c, --concurrent: Maximum concurrent requests (default: 10)
  • -t, --timeout: Request timeout in seconds (default: 30)
  • -v, --verbose: Enable verbose output for debugging (recommended to see what links are being scanned)
  • --no-sitemap: Skip sitemap.xml discovery and crawl pages manually

Tip: Use the -v flag to see real-time progress of which pages and links are being scanned. This is especially helpful for understanding what the crawler is doing and troubleshooting any issues.

Examples

Find affiliate links

python linkhound.py myblog.com "affiliate" "ref=" "partner"

Find tracking parameters

python linkhound.py example.com "utm_" "fbclid" "gclid"

Scan for suspicious redirects

python linkhound.py example.com "bit.ly" "tinyurl" "redirect"

With verbose output (recommended)

python linkhound.py example.com "tracker" -v

This will show real-time output of every page and link being scanned, helping you understand the crawler's progress and troubleshoot any issues.

How It Works

  1. Sitemap Discovery: Automatically checks for sitemap.xml or sitemap_index.xml to quickly discover all pages
  2. Crawling: Crawls all discovered pages (from sitemap or by following links)
  3. Link Extraction: Extracts all links from each page (both internal and external)
  4. Link Checking: For each unique link:
    • Makes a request without following redirects
    • Checks if the URL contains any search strings
    • If it's a redirect, follows to the next URL
    • Repeats until reaching the final destination or 10 redirects
  5. Reporting: Shows all matches with their source page and redirect chain

Output Example

LinkHound
Domain: https://example.com
Looking for: tracking, analytics

Pages crawled    50
Links checked    234
Matches found    3
Queue size       5

Crawling complete!

Found 3 matching links:

1. Match found:
   Found on page: https://example.com/blog/post-1
   Link URL: https://example.com/out/link123
   Redirect chain:
        https://example.com/out/link123
     -> https://tracker.example.net/click?id=123

2. Match found:
   Found on page: https://example.com/about
   Link URL: https://analytics.service.com/track
   Matching URL: https://analytics.service.com/track

Use Cases

  • Finding and auditing affiliate links
  • Detecting tracking pixels and analytics
  • Identifying malicious redirects
  • Checking for broken or suspicious links
  • Compliance audits for link policies
  • SEO analysis

License

MIT