WEBSCRAPPING
Introduction
- For data scientists data is main key source before proceeding into the next step .Data can collected from different ways ,some of the ways are :
1. From Data Bases
2. From different online open source data sets
3. Through Web scrapping
4. Some other Sources
RESOURCES
- For web scrapping normally we are going proceed in different ways using different resources ,but now i'm using Selenium:
Selenium is an automation testing framework for web applications/websites which can also control the browser to navigate the website just like a human. Selenium uses a web-driver package that can take control of the browser and mimic user-oriented actions to trigger desired events. Selenium :
Setup :
Selenium:
To download selenium package, execute the below pip command in terminal:
pip install selenium
Selenium Drivers:
Web drivers enable python to control the browser via OS-level interactions. Web drivers use the browser's built-in support for the automation process so, in order to control the browser, the web-driver must be installed and should be accessible via the PATH variable of the operating system (only required in case of manual installation).
Driver Installation:
Download the drivers from official site for Chrome, Firefox, and Edge. Opera drivers can also be downloaded from the Opera Chromium project hosted on Github.
Safari 10 on OS X El Capitan and macOS Sierra have built-in support for the automation driver. This guide contains snippets to interact with popular web-drivers, though Safari is being used as a default browser throughout this guide.
Other browsers like UC, Netscape etc., cannot be used for automation. The Selenium-RC (remote-control) tool can control browsers via injecting its own JavaScript code and can be used for UI testing.
Data Extraction :
Let's get started by searching a sarees ,price , ratings and downloading the CSV file(s) with the following steps:
- Import Dependencies and Create Driver Instance: The initial step is to create an object of webdriver for particular browser by importing it from selenium click
Locating WebElement :
Selenium offers a wide variety of functions to locate an element on the web-page as:
- find_element_by_id: Use id to find an element.
- find_element_by_name: Use name to find an element.
- find_element_by_xpath: Use xpath to find an elements.
- find_element_by_link_text: Use text value of a link to find element.
- find_element_by_partial_link_text: Find element by matching some part of a hyper link text(anchor tag).
- find_element_by_tag_name: Use tag name to find an element.
- find_element_by_class_name: Use value of class attribute to find an element.
- find_element_by_css_selector: Use CSS selector for id, class to find element. Or use find_element with BY locater as:
images=driver.find_element_by_xpath("")
Use overloaded versions of functions to find all occurrences of a searched value. Just use elements instead of element as:
images=driver.find_elements_by_xpath("")
X Path :
XPath is an expression path syntax to find an object in DOM. XPath has its own syntax to find the node from the root element either via an absolute path or anywhere in the document using a relative path. Below is the explanation of XPath syntax with an example:
-
/ : Select node from the root. /html/body/div[1] will find the first div.
-
//: Select node from the current node. //form[1] will find the first form element.
-
[@attributename='value']: Use the syntax to find the node with the required value of the attribute.
Another simple way to get the XPath is via inspect element option. Just right click on the desired node and choose copy xpath option as: -
Final step here is downloading images and storing data into csv format .
References
To install and to know more about Selenium visit this URL
For more information about web scrapping with python visit Webscrapping with selenium
Web Scrapping examples with python
Web Scrapping using Selenium