YusukeIwaki/puppeteer-ruby

XPath attributes without evaluation()

Closed this issue · 0 comments

n1xn commented

Simple description about the feature

The title is simplest description possible.

Descrption

I am scrapping a pretty big list of product links (> 1000) and have issues that puppeteer throws cannot find context with specified id undefined. This is because I use XPath for collecting the nodes and evaluate the array in the next step. I would like to avoid the evaluate operation after using Sx() and access / map the desired attribute to a variable. Therefore I played around with the options we have with Sx()[] and found out that the attributes are actually all loaded - which means the evaluation step afterwards is not needed. My problem is that I have to replace JSHandle: when accessing an attribute within its string.

Puppeteer reference

Current issue

  1. Here is an example, which I would like to show you. This example is the one actually failing after some 'hundred-ish' iterations.
paginations.each do |pagination_step|
  xpath_links = '(//a[contains(concat(" ", normalize-space(@class), " "), " productlist__link ")])'
  link_nodes = page.Sx(xpath_links)

  link_nodes.each do |product_link|

    # evalution will be called thousands of times.      <---------
    href = page.evaluate('e => e.href', product_link)
    # href = product_link.evaluate('e => e.href')

    product = { href:, category: }
    products.push(product)
  end
end
  1. After realizing that this is exceeding some limitations by browsers / puppeteer I have tried to optimize the evaluation to execute only once and setting the desired attribute href.
paginations.each do |pagination_step|
  xpath_links = '(//a[contains(concat(" ", normalize-space(@class), " "), " productlist__link ")])'
  link_nodes = page.Sx(xpath_links)

  # executing now only on each pagination step - better.       <---------
  product_links = page.evaluate('e => e.map((el) => el.href)', link_nodes)
  product_links.each do |product_link|

    # but product_link is actually empty.       <---------
    product = { href: product_link, category: }
    products.push(product)
  end
end
  1. As mentioned in the comment in code 2. the problem is that the mapped evaluation does not contain any values (tried also el.getAttribute('href')). So I tried to access the properties from Sx directly in ruby via property('href') and actually got the value but prefixed with JSHandle: - which I replaced and got it working.
paginations.each do |pagination_step|
  xpath_links = '(//a[contains(concat(" ", normalize-space(@class), " "), " productlist__link ")])'
  link_nodes = page.Sx(xpath_links)
  
  # do not evaluate anything - loop through nodes
  link_nodes.each do |product_link|
   
    # access the current nodes property and remove JSHandle: prefix.      <---------
    href = product_link.property("href").to_s.gsub('JSHandle:', '')

    product = { href:, category: }
    products.push(product)
  end
end

Usecase / Motivation

I am not sure if I am using this right or missed a conzept, but as mentioned I have a problem with page.evaluate(). I would like to get attributes by xpath without hacking .to_s.gsub('JSHandle:'.'').
See the code below for my suggestion.

xpath = '//expression'
xpath_nodes = Sx(xpath)

xpath_nodes.each do |node|
  href = node.attribute('href')
end