This is a fairly simple gem that will help you simplify the parsing of web pages.
Gem is based on several libraries that do most of the work:
- HTTParty is an HTTP client
- Parallel allows performing queries in multiple threads
- Nokogiri is an HTML, XML, SAX, and Reader parser
Add this line to your application's Gemfile:
gem 'simple-scraper'And then execute:
$ bundle
Or install it yourself in the following way:
$ gem install simple-scraper
require 'simple/scraper'
scraper = Simple::Scraper::Parser.new(
title: { selector: "//h1[@class='title']", handler: ->(els) { els.first.text }, default: 'Ruby' },
summary: { selector: "//h2[@class='summary']", handler: ->(els) { els.first.text } },
link: { selector: "//a[@class='link']", handler: ->(els) { els.first['href'] } },
text_array: { selector: "//*[@class='link']", handler: ->(els) { els.map(&:text) } }
)
result1 = scraper.parse('https://www.codica.com/')
result2 = scraper.parse(['https://www.codica.com/1', 'https://www.codica.com/2'])The response will be similar to:
[
{
"title": "scraped title text",
"summary": "scraped summary text",
"link": "https://www.codica.com/blog/top-ruby-gems-we-cant-live-without/",
"text_array": ["text", "text" ...]
},
...
]Or just find a page:
Simple::Scraper::Finder.find(url: 'https://www.codica.com/', query: {}, headers: {}) do |page|
# page is an instance of Nokogiri::HTML::Document
endtitle, summary, link, text_array- Random hash keys, they may be whatever you want.selector- XPath. With its help you can find desired elements on the page.handler- Any ruby object that can respond to#callmethod (proc,lambdaor plain ruby class that has defined#callmethod). One argument will be passed to the handler which is an array of the elements found on the page. Each element is an instance ofNokogiri::XML::Element. You can read Nokogiri documentation for more info.default- In case scraper cannot find the desired element usingselector, the value provided for thedefaultattribute will be returned.
query = { page: 2 }
headers = { 'Authorization': 'Bearer' }
result = scraper.parse('https://www.codica.com/', query: query, headers: headers)Simple::Scraper.configure do |config|
config.proxy_addr = 'proxy.something.com'
config.proxy_port = 80
config.proxy_user = 'user:'
config.proxy_pass = 'password'
endSimple::Scraper.configure do |config|
config.logger = Logger.new('path/to/my/logs')
endBy default the logging is turned off
Simple::Scraper.configure do |config|
config.number_of_threads = 20
endBy default scraper works in 1 thread.
You might need to reset configuration to defaults
Simple::Scraper.resetNow you can provide new configuration if needed
Copyright © 2015-2019 Codica. It is released under the MIT License.
simple-scraper is maintained and funded by Codica. The names and logos for Codica are trademarks of Codica.
We love open source software! See our other projects or hire us to design, develop, and grow your product.