Triangle Web Scraper

A lightweight, modular, and terminal-first Python web scraper for extracting links, phone numbers, emails (with optional MX verification), and addresses from web pages. Features a clean CLI interface, a startup ASCII banner, and graceful handling of Ctrl+C for quiet exits.

Features

Modular Design: Extractors for links, phone numbers, emails, and addresses are organized in separate modules under extractors/.
CLI-Friendly: Supports both interactive and non-interactive modes for flexible usage.
Graceful Exit: Exits cleanly with code 0 on Ctrl+C.
Optional MX Verification: Email extraction includes optional domain MX record checks (requires dnspython).
Responsive: Uses requests with timeouts to handle network issues gracefully.
Customizable: Easily extendable with new extractors and configurable settings.

Requirements

The project includes a requirements.txt file with all necessary dependencies:

pyfiglet>=0.8.1 — For ASCII banner generation.
colorama>=0.4.6 — For colored terminal output.
requests>=2.28.2 — For HTTP requests.
beautifulsoup4>=4.12.2 — For HTML parsing.
phonenumbers>=8.13.12 — For phone number extraction and validation.
dnspython>=2.4.2 (optional) — For MX record verification during email extraction.
Pillow>=9.5.0 (optional) — For image processing features (if implemented).

Notes:

dnspython is only required for email MX verification.
Pillow is only needed for image-related features (not used in core scraping).

Installation

1. Install System Packages (Debian/Ubuntu)

Ensure Python 3 and pip are installed (administrator privileges may be required):

sudo apt update
sudo apt install -y python3 python3-pip

For other operating systems (e.g., macOS, Windows, or other Linux distributions), install the equivalent Python 3 and pip packages using your package manager or the official Python website.

2. Set Up a Virtual Environment (Recommended)

To avoid conflicts with system Python packages, use a virtual environment:

python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
python3 -m pip install -r requirements.txt

3. Install Python Dependencies

From the project root, install the required packages:

python3 -m pip install -r requirements.txt

To update or regenerate requirements.txt after adding new dependencies:

Install new packages in your virtual environment and run:
```
python3 -m pip freeze > requirements.txt
```
This pins exact versions for reproducibility.
Alternatively, use pipreqs to generate a requirements file based on imports:
```
python3 -m pip install pipreqs
pipreqs --force .
```

Usage

Run the scraper in either interactive or non-interactive mode.

Interactive Mode

Launch the script and follow the prompts to enter a URL and select an extraction option:

python3 webscraper.py

Non-Interactive Mode

Provide the URL via stdin and specify an extraction option (1–4):

echo "https://example.com" | python3 webscraper.py --choice 1

Extraction Options:

Links: Extracts all hyperlinks (optionally checks reachability).
Phone Numbers: Extracts and validates phone numbers using phonenumbers.
Emails: Extracts email addresses with optional MX record verification (if dnspython is installed).
Locations: Extracts addresses using OpenStreetMap Nominatim API.

Early Termination

Press Ctrl+C at any time to exit the program cleanly (exit code 0).

Examples

Extract links non-interactively:

echo "https://example.com" | python3 webscraper.py --choice 1

Extract emails with MX verification:

echo "https://example.com" | python3 webscraper.py --choice 3

Run interactively and follow prompts:
```
python3 webscraper.py
```

Behavior & Troubleshooting

Missing Dependencies: If you encounter an ImportError, ensure all packages in requirements.txt are installed using the same Python interpreter running the script.
Network Issues: The scraper uses timeouts in requests to prevent hanging. Network errors are logged but won’t crash the program.
MX Verification: If dnspython is not installed, email extraction will skip MX checks and note that verification is unavailable.
Disable ASCII Banner: To remove the startup banner, comment out the pyfiglet import and call in webscraper.py or edit triangle.py.
Logs: Check terminal output for detailed error messages or debugging information.

Contributing

Contributions are welcome! To contribute:

Add New Extractors: Place new extractor modules under extractors/ and update requirements.txt with any new dependencies.
Update Dependencies: Use pinned versions (pip freeze > requirements.txt) for reproducible CI builds or pipreqs for minimal requirements.
Submit Pull Requests: Fork the repository, create a feature branch, and submit a pull request with clear descriptions of your changes.
Report Issues: Use the issue tracker to report bugs or suggest improvements.

Happy scraping with Triangle Web Scraper!

triangle-motelti/web-scraper