Amazon Web Scraper Project

Overview

This web scraper, built using Python, Beautiful Soup, and Requests, is designed to extract information from Amazon product pages. The goal is to retrieve product details such as title, price, ratings, and date, storing the data in a CSV file for further analysis or use.

Skill Demonstrated

Web Scraping Proficiency
Data Processing and Cleaning
Automation and Scripting
CSV Data Storage

Tools

Python Download here
Jupyter Notebook (Beautiful soup, pandas, requests)

Features

Product Information Extraction: Extracts essential product details, including title, price, ratings, and date.
CSV Data Storage: Saves the extracted information in a CSV file (AmazonWebScraperDataset.csv) for convenient data manipulation.
Customizable: Easily adaptable for different Amazon product pages or additional information extraction.

Usage

Installed Jupyter

pip install jupyter

Launched Jupyter Notebook

jupyter notebook

Created a new notebook with python in the Jupyter Notebook interface.

Imported the necessary libraries

from bs4 import BeautifulSoup
import requests
import time
import datetime
import smtplib

Connected to the website to pull the data

   URL = 'https://www.amazon.com/Funny-Data-Systems-Business-Analyst/dp/B07FNW9FGJ/ref=sr_1_5?crid=2B4LQHJDAJHLR&dib=eyJ2IjoiMSJ9.WiKhGOLdBAacALLGC9ayNOuSFgH2acw5wZ3-xstLg4_swdoRKRSjtvVVD-eNmgait23JGAUqu0oK-D8jDjw0oaPjZ3j0poQyDL2ZxeamSjs0xqmLeBl6pagYqF4RZGE7sGRH2FOV-St2pHZjjKVX_8Rnx4RWyzGhZk-xwOva4C6alCFwePf4O0l7aJ-HhLgxSQZiLtenf5ghDJZZ-i7ZFvyPsP0__0KA0B4qKsmVAdAgMv-07nnPCiwUPa1ghSFCCy2mBxCWGCK2PrfPRYEIX1QXzfyGy41-LImlMCNDZ8E.962m5aTRMj4rk4pjp5JZ6P9Y5RYt0594NNnjEwvPpMg&dib_tag=se&keywords=data+analyst+shirt&qid=1708148038&sprefix=data+analyst+shirt%2Caps%2C729&sr=8-5'
   headers = {"User-Agent": "# YOUR USER AGENT GOES HERE", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
   
   page = requests.get(URL, headers=headers)
   
   soup1 = BeautifulSoup(page.content, "html.parser")

Organize the data scraped from the website

   soup2 = BeautifulSoup(soup1.prettify(), "html.parser")

Queried and cleaned "required" data from the web scrape

   product = soup2.find(id='productTitle').get_text(strip=True)

   price_span = soup2.find('span', class_='a-price aok-align-center reinventPricePriceToPayMargin priceToPay')
   if price_span:
       whole_part = price_span.find('span', class_='a-price-whole').get_text(strip=True)
       fraction_part = price_span.find('span', class_='a-price-fraction').get_text(strip=True)

    price = f"{whole_part}{fraction_part}"
 
 ratings_span = soup2.find('span', class_='a-size-base a-color-base')
 ratings = ratings_span.get_text(strip=True)

print(product)
print(price)
print(ratings)

Created a Timestamp to track when data was collected

    today = datetime.date.today()
    print(today)

Created a csv file and populated it with the data we pulled and cleaned

  import csv 

  header = ['Product', 'Price', 'Rating', 'Date']
  data = [product, price, ratings, today]
  
  with open('AmazonWebScraperDataset.csv', 'w', newline='',encoding='UTF8') as f:
   writer = csv.writer(f)
   writer.writerow(header)
   writer.writerow(data)

  import pandas as pd

  df = pd.read_csv(r'C:\Users\user\AmazonWebScraperDataset.csv')
  print(df)

    with open('AmazonWebScraperDataset.csv', 'a+', newline='', encoding='UTF8') as f:
        writer = csv.writer(f)
        writer.writerow(data)
    import pandas as pd

    df = pd.read_csv(r'C:\Users\user\AmazonWebScraperDataset.csv')
    print(df)

Automate the process to pull data on a specific time frame, clean it, and add it to the csv file

    def check_price():
URL = 'https://www.amazon.com/Funny-Data-Systems-Business-Analyst/dp/B07FNW9FGJ/ref=sr_1_5?crid=2B4LQHJDAJHLR&dib=eyJ2IjoiMSJ9.WiKhGOLdBAacALLGC9ayNOuSFgH2acw5wZ3-xstLg4_swdoRKRSjtvVVD-eNmgait23JGAUqu0oK-D8jDjw0oaPjZ3j0poQyDL2ZxeamSjs0xqmLeBl6pagYqF4RZGE7sGRH2FOV-St2pHZjjKVX_8Rnx4RWyzGhZk-xwOva4C6alCFwePf4O0l7aJ-HhLgxSQZiLtenf5ghDJZZ-i7ZFvyPsP0__0KA0B4qKsmVAdAgMv-07nnPCiwUPa1ghSFCCy2mBxCWGCK2PrfPRYEIX1QXzfyGy41-LImlMCNDZ8E.962m5aTRMj4rk4pjp5JZ6P9Y5RYt0594NNnjEwvPpMg&dib_tag=se&keywords=data+analyst+shirt&qid=1708148038&sprefix=data+analyst+shirt%2Caps%2C729&sr=8-5'
headers = {"User-Agent": "# YOUR USEER AGENT GOES HERE", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}

page = requests.get(URL, headers=headers)
soup1 = BeautifulSoup(page.content, "html.parser")
soup2 = BeautifulSoup(soup1.prettify(), "html.parser")

product = soup2.find(id='productTitle').get_text(strip=True)

price_span = soup2.find('span', class_='a-price aok-align-center reinventPricePriceToPayMargin priceToPay')
if price_span:
    whole_part = price_span.find('span', class_='a-price-whole').get_text(strip=True)
    fraction_part = price_span.find('span', class_='a-price-fraction').get_text(strip=True)

price = f"{whole_part}{fraction_part}"

ratings_span = soup2.find('span', class_='a-size-base a-color-base')
ratings = ratings_span.get_text(strip=True)

import datetime

today = datetime.date.today()

import csv 

header = ['Product', 'Price', 'Rating', 'Date']
data = [product, price, ratings, today]

with open('AmazonWebScraperDataset.csv', 'a+', newline='', encoding='UTF8') as f:
    writer = csv.writer(f)
    writer.writerow(data)

    while(True):
    check_price()
    time.sleep(86400) #checks price after every 24hours and inputs data into your CSV

    import pandas as pd

    df = pd.read_csv(r'C:\Users\user\AmazonWebScraperDataset.csv')
    print(df)

Customization

Feel free to customize the notebook to meet your specific scraping needs. Modify headers, adapt to different Amazon product pages, or add more features as required.

# Example customization:
# Add more fields to the CSV header
header = ['Product', 'Price', 'Ratings', 'Date', 'AdditionalField']
# Append additional data to the data list
data = [product, price, ratings, today, additional_data]

Contributing

Contributions are welcome! If you have ideas for improvements, bug fixes, or additional features, open an issue or submit a pull request.

.
.
.
.

#Pikkachoo 😫😁🦾

Pikkachoo/Amazon-Web-Scraper-Project