bs4tools
is a collection of utilities designed to assist developers in working with web scraping and HTML parsing using BeautifulSoup 4 (BS4). These tools aim to provide efficient, robust, and maintainable solutions for various web scraping tasks.
webfetch.py
is a command-line utility for downloading web pages for offline development. It allows users to fetch HTML content from specified URLs and save them with filenames based on the domain and URL structure.
- Download Web Pages: Fetch HTML content from given URLs.
- Offline Development: Save web pages locally for offline access.
- Debug Mode: Enable debug information for detailed logging.
python3 webfetch.py <URL1> <URL2> ...
python3 webfetch.py --debug <URL1> <URL2> ...
python3 webfetch.py --help
- Python 3
- BeautifulSoup 4
- requests
datascraper.py
is a command-line utility for extracting structured data like tables from web pages. It offers customizable targeting and multiple export options, providing flexibility in scraping various structured data.
- Extract Structured Data: Target specific tags and attributes.
- Multiple Export Options: Supports CSV and JSON formats.
- User-Friendly Command-Line Interface: Customizable extraction parameters.
python3 datascraper.py <URL> [--tag TAG] [--attrs ATTRS] [--output-format FORMAT] [--file FILE_PATH]
- Python 3
- BeautifulSoup 4
- requests
- csv
- json
contentextractor.py
is a command-line utility for extracting specific content from HTML files or URLs. It enables users to explore HTML structure and identify correct tags, classes, and attributes for BeautifulSoup 4 applications.
- Interactive Exploration: Query different tags, classes, and attributes to view content.
- Batch Extraction: Extract specific content based on user-defined parameters.
- Preview Mode: Preview targeted content.
- Export Options: Supports text format.
- URL Support: Extract content directly from a URL.
python3 contentextractor.py --file <HTML_FILE> --interactive
python3 contentextractor.py --url <URL> --interactive
python3 contentextractor.py --file <HTML_FILE> --tag TAG [--attrs ATTRS] [--preview] [--export FORMAT]
python3 contentextractor.py --url <URL> --tag TAG [--attrs ATTRS] [--preview] [--export FORMAT]
python3 contentextractor.py --file <HTML_FILE> --search "your_search_text"
python3 contentextractor.py --url <URL> --search "your_search_text"
- Python 3
- BeautifulSoup 4
- requests (for URL support)
- HTML Validator (
htmlvalidator.py
): Validates downloaded HTML files. - Site Map Generator (
sitemapgenerator.py
): Creates a website's structure map. - Link Checker (
linkchecker.py
): Validates internal and external links. - Search Engine (
searchengine.py
): Performs keyword searches within content. - Visualization Tool (
visualizationtool.py
): Visualizes HTML elements or extracted data. - Rate Limiter (
ratelimiter.py
): Manages download rates. - Proxy Manager (
proxymanager.py
): Manages proxies for anonymous scraping. - User-Agent Rotator (
useragentrotator.py
): Rotates user-agent strings.
If you have ideas, suggestions, or improvements, please feel free to contribute!
MIT License
Draeician, 2023