XPath attributes without evaluation()
Closed this issue · 0 comments
Simple description about the feature
The title is simplest description possible.
Descrption
I am scrapping a pretty big list of product links (> 1000) and have issues that puppeteer throws cannot find context with specified id undefined
. This is because I use XPath
for collecting the nodes and evaluate the array in the next step. I would like to avoid the evaluate operation after using Sx()
and access / map the desired attribute to a variable. Therefore I played around with the options we have with Sx()[]
and found out that the attributes are actually all loaded - which means the evaluation step afterwards is not needed. My problem is that I have to replace JSHandle:
when accessing an attribute within its string.
Puppeteer reference
- Here is a reference on the same issue within puppeteer itself
Current issue
- Here is an example, which I would like to show you. This example is the one actually failing after some 'hundred-ish' iterations.
paginations.each do |pagination_step|
xpath_links = '(//a[contains(concat(" ", normalize-space(@class), " "), " productlist__link ")])'
link_nodes = page.Sx(xpath_links)
link_nodes.each do |product_link|
# evalution will be called thousands of times. <---------
href = page.evaluate('e => e.href', product_link)
# href = product_link.evaluate('e => e.href')
product = { href:, category: }
products.push(product)
end
end
- After realizing that this is exceeding some limitations by browsers / puppeteer I have tried to optimize the evaluation to execute only once and setting the desired attribute
href
.
paginations.each do |pagination_step|
xpath_links = '(//a[contains(concat(" ", normalize-space(@class), " "), " productlist__link ")])'
link_nodes = page.Sx(xpath_links)
# executing now only on each pagination step - better. <---------
product_links = page.evaluate('e => e.map((el) => el.href)', link_nodes)
product_links.each do |product_link|
# but product_link is actually empty. <---------
product = { href: product_link, category: }
products.push(product)
end
end
- As mentioned in the comment in code 2. the problem is that the mapped evaluation does not contain any values (tried also
el.getAttribute('href')
). So I tried to access the properties fromSx
directly in ruby viaproperty('href')
and actually got the value but prefixed withJSHandle:
- which I replaced and got it working.
paginations.each do |pagination_step|
xpath_links = '(//a[contains(concat(" ", normalize-space(@class), " "), " productlist__link ")])'
link_nodes = page.Sx(xpath_links)
# do not evaluate anything - loop through nodes
link_nodes.each do |product_link|
# access the current nodes property and remove JSHandle: prefix. <---------
href = product_link.property("href").to_s.gsub('JSHandle:', '')
product = { href:, category: }
products.push(product)
end
end
Usecase / Motivation
I am not sure if I am using this right or missed a conzept, but as mentioned I have a problem with page.evaluate()
. I would like to get attributes by xpath without hacking .to_s.gsub('JSHandle:'.'')
.
See the code below for my suggestion.
xpath = '//expression'
xpath_nodes = Sx(xpath)
xpath_nodes.each do |node|
href = node.attribute('href')
end