html-extractor
There are 9 repositories under html-extractor topic.
miso-belica/sumy
Module for automatic summarization of text documents and HTML pages.
bookieio/breadability
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
cdimascio/essence
Automatically extract the main text content (and more) from an HTML document
cnyangkui/html-extractor
基于行块分布函数的通用网页正文抽取算法优化,Python实现
kwaziidev/textractor
从html中提取正文,用于新闻类网页
JanDC/css-from-html-extractor
PHP library which determines which css is used from html snippets.
importcjj/go-readability
Go package that cleans a HTML page for better readability.
davidmillerpak/Media-Graper
Media Graper is a open source tool for Linux which is developed to extract all the Images, links, Videos from a Webpage.
the-real-yey/Simple-HTML-Extractor-
A simple extractor based on BeatufulSoup, You can use it to iterate through all the HTML files in the website root directory and get the text, placeholders and other text.