This repository contains a Python-based project demonstrating web scraping techniques to extract book data and images from a website. The project is inspired by the tutorial “Web Scraping JSON Data and Images from HTML: Python Tutorial 📖” by Moulaye Eli, published on CamelCodes.net.
This project focuses on:
- Extracting book details and images from the CamelCodes Books Page.
- Structuring the extracted data into a JSON format.
- Processing and saving images as thumbnails.
The aim is to provide a comprehensive guide to web scraping, including handling both textual and visual data.
To run this project, ensure that Python 3 is installed on your system. Additionally, you will need to install the following Python libraries:
beautifulsoup4for parsing HTML content.pillowfor image processing.requestsfor making HTTP requests.
Create a requirements.txt file with the following content:
beautifulsoup4==4.12.3
pillow==10.2.0
requests==2.31.0
Install the dependencies using the following command:
python3 -m pip install -r requirements.txtfile_utils.py: Contains utility functions for file operations such as creating directories and saving JSON files.image_utils.py: Includes functions for downloading, resizing, and saving images.web_utils.py: Manages web requests and HTML content retrieval.data_extractors.py: Extracts and processes book information from HTML content.timing_utils.py: Provides timing utilities to measure script execution duration.main.py: Orchestrates the web scraping process, including data extraction and image processing.
- Setup: Ensure the necessary folders for JSON and image storage are created by the script.
- Execution: The script will fetch HTML content from the target URL, extract book details, and process images.
- Output: The extracted data will be saved in a JSON file, and images will be resized and stored in the designated folder.
To execute the script, run:
python3 main.pyUpon successful execution, you will find:
- A
books.jsonfile containing the structured book data. - Resized images saved in the specified directory.
This project is inspired by the tutorial Web Scraping JSON Data and Images from HTML: Python Tutorial 📖 by Moulaye Eli.