html-extractor

There are 9 repositories under html-extractor topic.

miso-belica/sumy
Module for automatic summarization of text documents and HTML pages.
Language:Python3.5k 112 122525
bookieio/breadability
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
Language:HTML204 22 2325
cdimascio/essence
Automatically extract the main text content (and more) from an HTML document
Language:Kotlin115 7 516
cnyangkui/html-extractor
基于行块分布函数的通用网页正文抽取算法优化，Python实现
Language:Python51 1 011
kwaziidev/textractor
从html中提取正文,用于新闻类网页
Language:Go15 2 14
JanDC/css-from-html-extractor
PHP library which determines which css is used from html snippets.
Language:PHP8 5 42
importcjj/go-readability
Go package that cleans a HTML page for better readability.
Language:HTML2 1 0
davidmillerpak/Media-Graper
Media Graper is a open source tool for Linux which is developed to extract all the Images, links, Videos from a Webpage.
Language:Shell10
the-real-yey/Simple-HTML-Extractor-
A simple extractor based on BeatufulSoup, You can use it to iterate through all the HTML files in the website root directory and get the text, placeholders and other text.
Language:Python0 1 00