Toscrape

🛠️ Web Scraping Exploration with Selenium

Take a gentle dive into the basics of web scraping with this repository! Using Selenium, the project walks you through extracting data from books and quotes websites. It's a simple yet effective exercise to get hands-on experience with web scraping techniques. The data collected is neatly organized into a CSV file, offering a practical glimpse into data processing. Whether you're new to web scraping or just looking for a straightforward example, this repository provides a humble starting point for your exploration. Happy coding!

image

image

Files

Steps

Setting Up Libraries

Selenium is a powerful web automation library for Python, widely used for web scraping and testing.
pip install selenium
Pandas is a versatile data manipulation library in Python, commonly employed for data analysis and storage, such as saving data to CSV files.
pip install pandas

Getting Started

  1. Create a webdriver instance
driver = webdriver.Chrome()
url = "http://books.toscrape.com/"
driver.get(url)
  1. Chrome must be loaded with the message
    Chrome is being controlled by automated test software.

Explicit Waits

Use explicit waits for a smoother web scraping experience:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

try:
            # Explicitly wait for the next page button to be present
            WebDriverWait(driver, 20).until(EC.presence_of_element_located(next_page_button_locator))

            # Explicitly wait for the next page button to be clickable
            WebDriverWait(driver, 20).until(EC.element_to_be_clickable(next_page_button_locator))

            # Find the next page button and click it
            next_page_button = driver.find_element(*next_page_button_locator)
            next_page_button.click()


        except Exception as e:
            print(f"Exception: {type(e).__name__} - {e}. Refreshing the page and retrying click.")
            driver.refresh()

Data Extraction

Use various locators using By for element identification:
By.

from selenium.webdriver.common.by import By
  • find_element(By.CSS_SELECTOR, some_string) Finds element using CSS selector. It performs the same tasks as the old one. find_element_by_css_selector
  • find_element(By.XPATH, some_string) Finds elment by XPATH instead of find_element_by_xpath
  • find_element(By.CLASS_NAME, some_string) Finds element by Class Name as the old one did find_element_by_class_name These methods return an instance of WebElement

WebElement

  • element.click() Clicking on the element
  • element.get_attribute(‘class’) Accessing attribute class, title...etc
    • element.text Accessing text element

Store data

Save a list of lists as a data frame using Pandas

df = pd.DataFrame(books_list)

Save the data frame to a CSV file for further use

df.to_csv('path-to-folder/booksToScrape.csv', index=True)

Finally

Close the browser

driver.quit()