/modern_webscraping

udemy course - Modern Web Scraping with Python using Scrapy Splash Selenium

Primary LanguagePython

modern_webscraping

udemy course - Modern Web Scraping with Python using Scrapy Splash Selenium

Resource

Sites

  • Try jsoup - Interactive HTML/XML Parser
  • XPath Playground - Interactive XPath Parser

StackOverflow

  • Running Scrapy on Lambda
  • Get results from Scrapy on Lambda
  • Scrapy in Lambda as a layer
  • Package scrapy dependencies to Lambda

CSS Selectors

  • Note: Using CSS Selectors when covering Splash
  • Be careful with tag selection
  • Don't want to just call tags by name
  1. By class .className
  • Can belong to multiple elements
  1. By id #idName
  • Belongs to single element
  1. Multiple classes
  • .classOne.classTwo
  1. Attributes
  • elementName[identifierName=identifierValue]

Examples

  1. 'a' elements where href starts with https a[href^='https']
  2. 'a' elements where href ends with .fr a[href$='fr']
  3. 'a' element where href contains google a[href*='google']
  4. All paragraphs inside a div with specific class div.intro p Note: span elements within the div are considered descendants and not included
  • To add, change to div.intro p, span
  1. All direct children of an element div.intro > p
  2. Elements immediately after an element div.intro + p
  3. First item in a list li:nth-child(1)
  4. First and third item li:nth-child(1), li:nth-child(3)
  5. Odd/even items in a list li:nth-child(odd) li:nth-child(even)

Theory

  1. Foreign Attributes `[attributeName='value']
  2. Value lookup [attributeName ^='start'] [attributeName ~='between'] [attributeName $='end']
  3. Position
  4. Direct Children
  • Element > element

CSS Combinators

  1. All p elements place after the div
  • There can be other elements inbetween
  • div ~ p

XPath Fundamentals

  • Richer in functionality than CSS selectors
  • Do not explicitly

Examples

  1. all h1 elements, regardless of position //h1
  2. p elements within div with specific class //div[@class='intro']/p
  3. div elements with class of intro or outro
  • or logical operator //div[@class='intro' or @class='outro']/p
  1. Text value of selected elements //div[@class='intro' or @class='outro']/p/text()
  2. href value from link elements //a/@href
  3. Links where href starts with https
  • startwith() function //a[start-with(@href, 'https')
  1. Links where href ends with fr

NOTE: ends-with is only supported in XPath 2.0

  • XML, Chrome only support 1.0 //a[ends-with(@href, 'fr')]
  1. Links where href has specific text //a[contains(@href, 'google')]
  2. Links where the link text (not href) contains specific text
  • NOTE: Value passed is case sentitive
    • text(), 'france' vs text(), France //a[contains(text(), 'France')]
  1. First list item from an element //ul[@id='items']/li[1]
  2. First and last list elements ul//[@id='items']/li[position() = 1 or position() = last()]
  3. All list items after the first ul//[@id='items']/li[position() > 1]

XPath - Navigating Up The Tree

  • Cannot do this with CSS Selectors
  1. Parent of a p element where p id = unique
  • Parent in XPath is called an axis
  • Used to navigate the HTML marketup //p[@id='unique']/parent::div
  1. NOTE: Sometimes we do not know what the parent is node() - figures out the parent element //p[@id='unique']/parent::node()
  2. Ancestors - returns the parent and grandparent //p[@id='unique']/ancestor::node()
  3. Return ancestors or the element itself //p[@id='unique']/ancestor-or-self::node()
  4. Preceding - Returns all elements that precede the p element
  • Excludes ancestors //p[@id='unique']/preceding Example: Get the h1 element that precedes a p //p[@id='unique']/preceding::h1
  1. Preceding Sibling
  • Brother element
  • Elements are siblings if they share the same parent //p[@id='outside']/preceding-sibling::node()

XPath - Navigating Down The Tree

  1. Get all p children of an element //div[@id='intro']/child::p
  2. Get all general children of an element //div[@id='intro']/child::node())
  3. All elements that are listed after a specific element
  • After the closing tag //div[@class='intro']/following::node()
  1. All elements after an element that share the same parent //div[@class='intro']/following-siblign::node()
  2. All children + grandchildren of an element //div[@class='intro']/descendant::node()

Xpath - Theory

  1. Any element //elementName
  2. Class name, ID or attribute elementName[@attribute='value'] elementName[@id='value'] elementName[@class='value']
  3. Position //li[1] //li[position() = 1 or position() = 2] //li[position() = 1 and contains(@text, 'hello')]
  4. Functions starts-with() contains() ends-with() (not supported for XPath 1.0)
  5. Predicates
  • Conditions
  • Content within []
  1. Axes axisName::elementName Up:
  • parent
  • ancestor
  • preceding
  • preceding-sibling Down:
  • child
  • following - elements after the closing tag of an element
  • following-sibling - elements after the closing tag of an element
  • descendant - children and grandchildren of element

Scraping APIs

  • quotestoscrape.com/scroll
  • To check for APIs, go to Network tab, filter to XHR requests
  • XHR stands for XML and HTTP requests

Notes

  • API URL will typically be different from website URL
  • When scraping APIs, always use the base template
    • If you use a separate template, you cannot define the Rule object
    • Most of the time, there are no links to follow
  • Flesh out the API structure before assembling scraper