/BoilerpipeArticle

This gem removes the surplus “clutter” (boilerplate, templates) around the main textual content of a web page (pure Ruby implementation).

Primary LanguageRuby

BoilerpipeArticle

This gem removes the surplus “clutter” (boilerplate, templates) around the main textual content of a web page (pure Ruby implementation). It's especially made for news websites content. It's also able to extract schema.org microdata and other HTML meta data.

##Installation

gem install BoilerpipeArticle

###Usage Example

require 'boilerpipe_article'
require 'net/http'

uri = URI('http://www.bbc.com/news/election-us-2016-36935175')
html = Net::HTTP.get(uri)

parser =  BoilerpipeArticle.new(html)

articleText = parser.getArticle
metas = parser.getMetas
microdata = parser.getMicroData
allText  = parser.getAllText

puts articleText
puts metas
puts microdata

Runtime Dependencies:

nokogiri = 1.6.8 mida = 0.3.9

###Support

Check out textracto.com for lastest updates and API