This package contains classes and modules for web scraping tasks.
Web scraping is the process of extracting data from websites. This package provides functionality to scrape content from web pages and PDF files.
The WebScraper
class in the scraper.py
module is responsible for scraping content from web pages. It adheres to the following SOLID principles:
- The class is responsible for handling web scraping tasks and error handling related to scraping.
- The class is open for extension, allowing additional features to be added without modifying existing code.
- The class provides methods specifically for web scraping, avoiding unnecessary dependencies.
- The class depends on abstractions (requests library, BeautifulSoup) rather than concrete implementations, allowing flexibility and ease of maintenance.
- The class includes error handling to handle unexpected situations gracefully, providing feedback on encountered issues.
The PdfReader
class in the pdf_reader.py
module is responsible for reading text content from PDF files. It adheres to the following SOLID principles:
- The class is responsible for handling PDF file reading tasks and error handling related to reading.
- The class is open for extension, allowing additional features to be added without modifying existing code.
- The class provides methods specifically for reading PDF files, avoiding unnecessary dependencies.
- The class depends on abstractions (PyPDF2 library) rather than concrete implementations, allowing flexibility and ease of maintenance.
- The class includes error handling to handle unexpected situations gracefully, providing feedback on encountered issues.
The package requires the following external libraries, which can be installed using pip: