propublica/upton

find by xpath

abacha opened this issue · 5 comments

is it possible to do something like that:

page = Upton::Scraper.new(url)
page.find_by_xpath("//body/div/a").value

Hi @abacha,

Yes, Upton supports searching by XPath.

If you had an index page ( = a page with links you want to scrape), you could do something like this:

scraper = Upton::Scraper.new(url, "//body/div/a")
scraper.scrape do | instance_html, instance_url, instance_index|
   puts "The title of the page at #{instance_url} is #{Nokogiri::HTML(instance_html).title}"
end

Thanks to #11, you can use XPath or CSS selectors interchangeably.

I wish I could do it in a simple way like I've demonstrated.. I need to do lots of searches through different xpath's in the same url

Is the value of the content specified by the XPath expression another link to be scraped? Or just data you want to access?

And do you have lots of pages, or just one page to be scraped?

If you just want to scrape lots of data from one page, just use Nokogiri. (Upton uses Nokogiri for HTML parsing.)

Nokogiri(Net::HTTP.get(URI(url)).xpath("//body/div/a").text

Were you able to find a solution, @abacha?