This repository contains various Python scripts to check web scraping techniques for certain sites.
How we can use on our local machine?
Make sure your OS has already installed Python 3.x.x
, that comes with pip
, pip
is a package manager for Python and is included by default with the Python installer. It helps to install and uninstall Python packages (such as Django!).
Setting up a virtual environment:
After cloning this repo, enter the following command on your terminal:
python -m venv venv
It will create the virtual environment for the following project, where we will install all our project dependencies. There is a venv
directory automatically created in your project that Git will not track.
After that we have to activate the virtual environment and install the project dependencies there.
For Windows:
Enter the following command at the root of your cloned repo.
venv\Scripts\activate
For Unix:
Enter the following command at the root of your cloned repo.
. venv\bin\activate
Install dependencies:
pip install -r requirements.txt
Scrapy commands:
Scrapy start a new project in your virtual environment:
scrapy startproject ietf_scraper
Scrapy will create a template directory with all its stub files.
cd ietf_scraper
ietf_scraper/scrapy.cfg
holds the configuration information.
ietf_scraper/ietf_scraper/items.py
defines the objects or the entities that we are scraping.
ietf_scraper/ietf_scraper/middlewares.py
contains various Scrapy hooks.
ietf_scraper/ietf_scraper/pipelines.py
defined functions that create and filter items.
ietf_scraper/ietf_scraper/settings.py
contains project settings.
ietf_scraper/ietf_scraper/spiders/
scrapy project is basically collection of spiders. We have to tell scrapy each new spider we want to make with the new command.
Create a new scrapy spider for the website pythonscraping.com:
Go to the following directory in your virtual environment:
cd ietf_scraper/ietf_scraper/spiders
We will create a new spider.
Run the command:
scrapy genspider ietf pythonscraping.com
This command will create a file named ietf.py
and that file will contain class definition:
import scrapy
class IetfSpider(scrapy.Spider):
name = "ietf"
allowed_domains = ["pythonscraping.com"]
start_urls = ["https://pythonscraping.com/linkedin/ietf.html"]
def parse(self, response):
pass
Modified:
import scrapy
class IetfSpider(scrapy.Spider):
name = "ietf"
allowed_domains = ["pythonscraping.com"]
start_urls = ["https://pythonscraping.com/linkedin/ietf.html"]
# Scrapy will scrape the content, and it will pass this in second parameter
# named response which is basically a parsed Python dictionary of data
# We have to capture the title written on the span tag something like this:
# <span class="title">A Standard for the Transmission of IP Datagrams on Avian Carriers</span>
# There are 2 ways to capture this title string with scrapy using
# 1.XPath
# 2.Using CSS
def parse(self, response):
# Using CSS ::text is pseudo selector and get function will
# get the first tag retrieved that matches the CSS.
# If we wanted to get list of tags then we have to use getAll()
# title = response.css('span.title::text').get()
# Using same with XPath approach,
title = response.xpath('//span[@class="title"]/text()').get()
return {"title": title}
Run the spider:
scrapy runspider ietf.py
We will see a lot of output in the Terminal, please search and find following text in your Terminal:
{'title': 'A Standard for the Transmission of IP Datagrams on Avian Carriers'}
It means the spider has run successfully.
Notes:
//h1
double slash h1
allow us to select tags anywhere in the page without starting from the
top level e.g. /html/body/div/h1
of the tree hierarchy of HTML.
//div/h1
This will select only the h1
tags that are immediate children of the div
tag.
//div//h1
This would select an h1
tag that falls anywhere under a div
tag, regardless of whether or not it was an immediate child.
span[@class="title"]
This would select element by attribute using this [@class="title"]
symbol.
//span[@class="title"]
This will select any span
tags that have the class "title"
or the attribute of the span
tag is equal to the string title
e.g. class="title"
.
Case:
We can also select the text from the attribute of the tags themselves, so if we wanted to select the id
attribute
we do @id
, and this will select the value
of the id
attribute where the class attribute is "title"
.
//span[@class="title"]/@id
but if we don't want the attribute text or attribute value, What we want the inner text from the tags. For this we will use XPath selector /text()
.
//span[@class="title"]/text()