udemy course - Modern Web Scraping with Python using Scrapy Splash Selenium
- Try jsoup - Interactive HTML/XML Parser
- XPath Playground - Interactive XPath Parser
- Running Scrapy on Lambda
- Get results from Scrapy on Lambda
- Scrapy in Lambda as a layer
- Package scrapy dependencies to Lambda
- Note: Using CSS Selectors when covering
Splash
- Be careful with tag selection
- Don't want to just call tags by name
- By class
.className
- Can belong to multiple elements
- By id
#idName
- Belongs to single element
- Multiple classes
.classOne.classTwo
- Attributes
elementName[identifierName=identifierValue]
- 'a' elements where
href
starts withhttps
a[href^='https']
- 'a' elements where
href
ends with.fr
a[href$='fr']
- 'a' element where
href
containsgoogle
a[href*='google']
- All paragraphs inside a div with specific class
div.intro p
Note:span
elements within the div are considered descendants and not included
- To add, change to
div.intro p, span
- All direct children of an element
div.intro > p
- Elements immediately after an element
div.intro + p
- First item in a list
li:nth-child(1)
- First and third item
li:nth-child(1), li:nth-child(3)
- Odd/even items in a list
li:nth-child(odd)
li:nth-child(even)
- Foreign Attributes `[attributeName='value']
- Value lookup
[attributeName ^='start']
[attributeName ~='between']
[attributeName $='end']
- Position
- Direct Children
Element > element
- All
p
elements place after the div
- There can be other elements inbetween
div ~ p
- Richer in functionality than CSS selectors
- Do not explicitly
- all h1 elements, regardless of position
//h1
p
elements withindiv
with specific class//div[@class='intro']/p
div
elements with class ofintro
oroutro
or
logical operator//div[@class='intro' or @class='outro']/p
- Text value of selected elements
//div[@class='intro' or @class='outro']/p/text()
href
value from link elements//a/@href
- Links where
href
starts withhttps
startwith()
function//a[start-with(@href, 'https')
- Links where
href
ends withfr
NOTE:
ends-with
is only supported in XPath 2.0
- XML, Chrome only support 1.0
//a[ends-with(@href, 'fr')]
- Links where
href
has specific text//a[contains(@href, 'google')]
- Links where the link text (not href) contains specific text
- NOTE: Value passed is case sentitive
text(), 'france'
vstext(), France
//a[contains(text(), 'France')]
- First list item from an element
//ul[@id='items']/li[1]
- First and last list elements
ul//[@id='items']/li[position() = 1 or position() = last()]
- All list items after the first
ul//[@id='items']/li[position() > 1]
- Cannot do this with CSS Selectors
- Parent of a
p
element wherep
id =unique
- Parent in XPath is called an axis
- Used to navigate the HTML marketup
//p[@id='unique']/parent::div
- NOTE: Sometimes we do not know what the parent is
node()
- figures out the parent element//p[@id='unique']/parent::node()
- Ancestors - returns the parent and grandparent
//p[@id='unique']/ancestor::node()
- Return ancestors or the element itself
//p[@id='unique']/ancestor-or-self::node()
- Preceding - Returns all elements that precede the
p
element
- Excludes ancestors
//p[@id='unique']/preceding
Example: Get theh1
element that precedes ap
//p[@id='unique']/preceding::h1
- Preceding Sibling
- Brother element
- Elements are siblings if they share the same parent
//p[@id='outside']/preceding-sibling::node()
- Get all
p
children of an element//div[@id='intro']/child::p
- Get all general children of an element
//div[@id='intro']/child::node()
) - All elements that are listed after a specific
element
- After the closing tag
//div[@class='intro']/following::node()
- All elements after an element that share the same parent
//div[@class='intro']/following-siblign::node()
- All children + grandchildren of an element
//div[@class='intro']/descendant::node()
- Any element
//elementName
- Class name, ID or attribute
elementName[@attribute='value']
elementName[@id='value']
elementName[@class='value']
- Position
//li[1]
//li[position() = 1 or position() = 2]
//li[position() = 1 and contains(@text, 'hello')]
- Functions
starts-with()
contains()
ends-with()
(not supported for XPath 1.0) - Predicates
- Conditions
- Content within
[]
- Axes
axisName::elementName
Up:
parent
ancestor
preceding
preceding-sibling
Down:child
following
- elements after the closing tag of an elementfollowing-sibling
- elements after the closing tag of an elementdescendant
- children and grandchildren of element
quotestoscrape.com/scroll
- To check for APIs, go to Network tab, filter to XHR requests
- XHR stands for XML and HTTP requests
- API URL will typically be different from website URL
- When scraping APIs, always use the base template
- If you use a separate template, you cannot define the Rule object
- Most of the time, there are no links to follow
- Flesh out the API structure before assembling scraper