kevinggrimm/modern_webscraping

udemy course - Modern Web Scraping with Python using Scrapy Splash Selenium

Python

modern_webscraping

udemy course - Modern Web Scraping with Python using Scrapy Splash Selenium

Resource

Sites

Try jsoup - Interactive HTML/XML Parser
XPath Playground - Interactive XPath Parser

StackOverflow

Running Scrapy on Lambda
Get results from Scrapy on Lambda
Scrapy in Lambda as a layer
Package scrapy dependencies to Lambda

CSS Selectors

Note: Using CSS Selectors when covering Splash
Be careful with tag selection
Don't want to just call tags by name

By class .className

Can belong to multiple elements

By id #idName

Belongs to single element

Multiple classes

.classOne.classTwo

Attributes

elementName[identifierName=identifierValue]

Examples

'a' elements where href starts with https a[href^='https']
'a' elements where href ends with .fr a[href$='fr']
'a' element where href contains google a[href*='google']
All paragraphs inside a div with specific class div.intro p Note: span elements within the div are considered descendants and not included

To add, change to div.intro p, span

All direct children of an element div.intro > p
Elements immediately after an element div.intro + p
First item in a list li:nth-child(1)
First and third item li:nth-child(1), li:nth-child(3)
Odd/even items in a list li:nth-child(odd) li:nth-child(even)

Theory

Foreign Attributes `[attributeName='value']
Value lookup [attributeName ^='start'] [attributeName ~='between'] [attributeName $='end']
Position
Direct Children

Element > element

CSS Combinators

All p elements place after the div

There can be other elements inbetween
div ~ p

XPath Fundamentals

Richer in functionality than CSS selectors
Do not explicitly

Examples

all h1 elements, regardless of position //h1
p elements within div with specific class //div[@class='intro']/p
div elements with class of intro or outro

or logical operator //div[@class='intro' or @class='outro']/p

Text value of selected elements //div[@class='intro' or @class='outro']/p/text()
href value from link elements //a/@href
Links where href starts with https

startwith() function //a[start-with(@href, 'https')

Links where href ends with fr

NOTE: ends-with is only supported in XPath 2.0

XML, Chrome only support 1.0 //a[ends-with(@href, 'fr')]

Links where href has specific text //a[contains(@href, 'google')]
Links where the link text (not href) contains specific text

NOTE: Value passed is case sentitive
- text(), 'france' vs text(), France //a[contains(text(), 'France')]

First list item from an element //ul[@id='items']/li[1]
First and last list elements ul//[@id='items']/li[position() = 1 or position() = last()]
All list items after the first ul//[@id='items']/li[position() > 1]

XPath - Navigating Up The Tree

Cannot do this with CSS Selectors

Parent of a p element where p id = unique

Parent in XPath is called an axis
Used to navigate the HTML marketup //p[@id='unique']/parent::div

NOTE: Sometimes we do not know what the parent is node() - figures out the parent element //p[@id='unique']/parent::node()
Ancestors - returns the parent and grandparent //p[@id='unique']/ancestor::node()
Return ancestors or the element itself //p[@id='unique']/ancestor-or-self::node()
Preceding - Returns all elements that precede the p element

Excludes ancestors //p[@id='unique']/preceding Example: Get the h1 element that precedes a p //p[@id='unique']/preceding::h1

Preceding Sibling

Brother element
Elements are siblings if they share the same parent //p[@id='outside']/preceding-sibling::node()

XPath - Navigating Down The Tree

Get all p children of an element //div[@id='intro']/child::p
Get all general children of an element //div[@id='intro']/child::node())
All elements that are listed after a specific element

After the closing tag //div[@class='intro']/following::node()

All elements after an element that share the same parent //div[@class='intro']/following-siblign::node()
All children + grandchildren of an element //div[@class='intro']/descendant::node()

Xpath - Theory

Any element //elementName
Class name, ID or attribute elementName[@attribute='value'] elementName[@id='value'] elementName[@class='value']
Position //li[1] //li[position() = 1 or position() = 2] //li[position() = 1 and contains(@text, 'hello')]
Functions starts-with() contains() ends-with() (not supported for XPath 1.0)
Predicates

Conditions
Content within []

Axes axisName::elementName Up:

parent
ancestor
preceding
preceding-sibling Down:
child
following - elements after the closing tag of an element
following-sibling - elements after the closing tag of an element
descendant - children and grandchildren of element

Scraping APIs

quotestoscrape.com/scroll
To check for APIs, go to Network tab, filter to XHR requests
XHR stands for XML and HTTP requests

Notes

API URL will typically be different from website URL
When scraping APIs, always use the base template
- If you use a separate template, you cannot define the Rule object
- Most of the time, there are no links to follow
Flesh out the API structure before assembling scraper