Project Summary: Web Scraping Tool with GUI
Objective: The objective of this project is to develop a web scraping tool with a graphical user interface (GUI) that allows users to extract data from websites that require login and have a search functionality. The tool will automate the process of logging in, performing a search, and scraping desired data from the website. The scraped data will be stored in a SQLite database for further analysis and processing.
Key Features:
-
GUI Interface: The tool will provide a GUI built with PyQt5, allowing users to input website URL, login credentials, search query, and CSS selectors for data extraction and navigation.
-
Login Automation: The tool will automate the login process by filling in the login form with user-provided credentials and checking for successful login based on the presence of a specific element on the page.
-
Search Functionality: After successful login, the tool will allow users to enter a search query and perform a search on the website. The search results will be loaded and ready for data extraction.
-
Data Extraction: Users will provide CSS selectors to identify the desired data elements on the page. The tool will extract the specified data from the search results and store it in a raw format in an SQLite database.
-
Pagination Handling: If the search results are paginated, the tool will automatically navigate through the pages by clicking on the "next" button until all pages are processed.
-
Progress Tracking: The GUI will display the progress of the scraping process, including the current page being scraped, the number of items extracted, and any errors encountered.
-
Database Storage: The scraped data will be stored in a SQLite database for persistent storage and future analysis. The database will be created and managed by the tool. Columns should be following: URL, date, div contents
-
Error Handling: The tool will incorporate robust error handling mechanisms to gracefully handle and report errors related to login, search, data extraction, and database operations.
-
Logging: Detailed logging will be implemented to capture important events, errors, and debug information during the scraping process. Logs will be stored in a file for later review and troubleshooting.
-
Configuration Management: The tool will store user-provided settings such as website URL, login credentials, and CSS selectors in a configuration file. These settings will be loaded automatically when the tool is launched, allowing users to quickly resume their scraping tasks.
-
Multithreading: The scraping process will be executed in a separate thread to ensure a responsive GUI and prevent blocking of user interactions.
-
Modularity and Reusability: The codebase will be organized into modular components, separating the GUI, scraping, database, and utility functions. This modular architecture will enhance code reusability and maintainability.
Technologies and Libraries:
- Python: The primary programming language for the project.
- PyQt5: The library used for building the GUI.
- Selenium: A web automation library used for interacting with web pages, handling login, and performing search.
- BeautifulSoup: A library for parsing HTML and extracting desired data from web pages.
- SQLite: A lightweight and embedded database engine for storing scraped data.
- Requests: A library for making HTTP requests to web pages.
- ChromeDriver: A web driver used by Selenium to automate Google Chrome browser.
Project Structure: The project will be organized into the following files and directories:
main.py
: The entry point of the application, responsible for creating the GUI and starting the application.main_window.py
: Defines the main window of the GUI and handles user interactions.scraper.py
: Defines the core scraping functionality, including login, search, and data extraction.database.py
: Defines the database operations for storing scraped data.config.py
: Handles configuration management for storing and loading user settings.logger.py
: Implements logging functionality for capturing important events and errors.config.ini
: A configuration file for storing user-provided settings.requirements.txt
: A file listing the project dependencies.README.md
: A readme file providing an overview of the project and installation instructions.