felipecsl/wombat

xpath is working properly?

Closed this issue · 3 comments

My Test

some_text xpath: '//*[@id="Content_Regs"]/table[1]/tbody/tr/td[2]/table/tbody/tr[4]/td[2]/text()[2]'

Return

{"some_text"=>nil}

Console Google Chrome

$x('//*[@id="Content_Regs"]/table[1]/tbody/tr/td[2]/table/tbody/tr[4]/td[2]/text()[2]')
["
            São Paulo - SP"
]

What might be happening? forgot something?

Can you tell me what is the page you are trying to scrape so it helps me to digg down the problem?

Ok, no problem

require 'wombat'

class TelelistaScraper
  include Wombat::Crawler
  base_url "http://www.telelistas.net/br/restaurantes"
  path "/?pagina=9"

  some_text xpath: '//*[@id="Content_Regs"]/table[1]/tbody/tr/td[2]/table/tbody/tr[4]/td[2]/text()[2]'
end

puts TelelistaScraper.new.crawl

I suppose they are using some kind of javascript hack to avoid being scraped

1.9.3-p194 :018 > Nokogiri::HTML("http://www.telelistas.net/br/restaurantes/?pagina=9")
 => #<Nokogiri::HTML::Document:0x3ffeed8eb83c name="document" children=[#<Nokogiri::XML::DTD:0x3ffeed8f04b8 name="html">, #<Nokogiri::XML::Element:0x3ffeed8ef16c name="html" children=[#<Nokogiri::XML::Element:0x3ffeed8f2b3c name="body" children=[#<Nokogiri::XML::Element:0x3ffeed8f26dc name="p" children=[#<Nokogiri::XML::Text:0x3ffeed8f1f5c "http://www.telelistas.net/br/restaurantes/?pagina=9">]>]>]>]> 
1.9.3-p194 :019 > html.inner_html
 => "<html><body><p>http://www.telelistas.net/br/restaurantes/?pagina=9</p></body></html>"