一个新闻网页正文通用抽取工具,还有标题、作者和发布日期。
该项目启发自kingname/GeneralNewsExtractor,由 Python 迁移到 Node.js ,并做了一些改动,提高提取准确度。
https://general-news-extractor-demo.stayin.cn/
Using npm:
npm i general-news-extractor
const GeneralNewsExtractor = require('general-news-extractor')
const htmlString = `` // HTML for a news page
const gne = new GeneralNewsExtractor()
// gne.extract( html: string, { titleSelector = '', authorSelector = '', dateTimeSelector = '', noiseNodeList = [] } = {})
const result = gne.extract(htmlString, {})
- Run in browser
MIT © zenghongtu