A generator for Yeoman.
It generates the basic structure of an html parser in node.js.
Useful if you are doing scraping with node.js.
To install generator-html-parser from npm, run:
$ npm install -g generator-html-parser
mkdir facebook-html-parser && cd $_
yo html-parser
That's it!
The main file is <site-name>-html-parser.js
.
It contains two methods
parse(html,url)
: it receives as input the html (string) to parse and an url (string), useful if you need to resolve some relative url with the node module Url (already imported)getNextPages(html,url)
: to get the urls of next pages to surf. Usually useful when you are scraping a list of pages. Still, it takes as input the html (string) to parse, and the url (string) to resolve eventually urls extracted from the html.
The generated code contains code for testing as well.
Have a look at the folder test/
It is based on cheerio to parse the html.
Cheerio is like jQuery, but faster.
$ = cheerio.load(html);
$('.item').each(function() {
var el=$(this);
result.push(el.text());
})