Nokogiri is an awesome scraping library for Rails. We can use it to get data from websites that don't have public APIs. It works by navigating to a URL and then parses out the HTML into the different tags. Then, you can access the information through CSS selectors and persist it to your own database or do anything you want with it.
Note: You may want to think about why the site you want to scrape doesn't have a public API in the first place and whether you're allowed to scrape their data.
Let's say I want to scrape Craigslist's Manhattan apartments. So let's jump right in...
gem 'nokogiri'
andrequire 'open-uri'
in yourGemfile
and then runbundle install
.- You may also want to use
gem 'addressable'
andrequire 'addressable/uri'
to build the URLs you scrape. This is not necessary, but oh so convenient. I will be using it here.
First, you open the page you want and load it into memory, and Nokogiri parses it:
url = Addressable::URI.parse("http://newyork.craigslist.org/search/aap/mnh")
query_hash = {"s" => 0,
"nh" => 129,
"minAsk" => 1800,
"maxAsk" => 3500}
url.query = URI.encode_www_form(query_hash)
url.query #=> "s=0&nh=129&minAsk=1800&maxAsk=3500"
url.to_s #=> "http://newyork.craigslist.org/search/aap/mnh?s=0&nh=129&minAsk=1800&maxAsk=3500"
page = Nokogiri::HTML(open(url.to_s))
Once you have this, you can access the content with CSS selectors:
page.css(".row")
This will return an array of all the elements with class .row
. In the example URL, .row
matches each row in the list of apartments. Then, I can do something like this:
page.css(".row").each do |row|
apt_url = row.at_css("a")[:href]
craigslist_id = apt_url[/\d{6,}/].to_i
title = row.at_css("a").text
Listing.create(url: apt_url, title: title, craigslist_id: craigslist_id)
end
Note that #css
returns all matching nodes in the selection while #at_css only returns the first
. There are many, many other DOM navigation methods you
can use. Here, have some docs
I use Selector Gadget to find the best CSS selectors for what I need. It's really good and much quicker than looking through the HTML yourself.
Check out lib/craigslist_scraper.rb
,
config/initializers/initialize_modules.rb
, and app/models/scraping.rb
.
I put the scraping code into a module in the lib
folder and then initialize
it in the initializers to it's in scope by the time I extend
the Scraping
class.