This focuses on web scraping product information from Amazon, including product URLs, names, prices, ratings, and reviews. The scraped data will be stored in a CSV file. The assignment is divided into two parts:
Part 1: Scraping Product Listings In this part, we scrape product listings from Amazon. The requirements include:
- Scraping at least 20 pages of product listings.
- Collecting the following information:
- Product URL
- Product Name
- Product Price
- Rating
- Number of reviews
Part 2: Scraping Individual Product Pages With the product URLs obtained in Part 1, we will visit each URL and collect additional information for around 200 products. The information to be scraped includes:
- Description
- ASIN
- Product Description
- Manufacturer
The entire dataset will be exported in CSV format.
To run the code, you'll need the following Python packages:
requests
for making HTTP requests.pip install requests
beautifulsoup4
for parsing HTML content.pip install beautifulsoup4
pandas
for working with CSV filespip install pandas
validators
for URL validationpip install validators
fake_useragent
for generating random user agents to prevent bot detectionpip install fake-useragent
The code consists of several key components:
-
Fetching Product URLs:
hit_all_pages()
is responsible for scraping product URLs from Amazon's search pages and saving them in a CSV file. -
Cleaning Valid URLs:
valid_csv_generator()
reads the CSV file generated in ABOVE, validates the URLs, and creates a new CSV file with only valid URLs. -
Scraping Product Information:
product_info_scraper()
takes the valid product URLs and scrapes the required information from individual product pages, storing it in a CSV file. -
Scraping Functions: These functions (
scrap_name()
,scrap_price()
, etc.) are used to scrape specific information from product pages. -
Main block: The code in
__main__
is used to execute the entire workflow.
During the web scraping process, it has been observed that Amazon occasionally returns a different response, such as a CAPTCHA HTML page, even when the HTTP request's status is reported as 200
. This issue occurs when attempting to retrieve details from individual product pages using BeautifulSoup.
It seems antiscraping measures of amazon product page.