html-extraction

There are 14 repositories under html-extraction topic.

miso-belica/sumy
Module for automatic summarization of text documents and HTML pages.
Language:Python3.5k 113 124529
bookieio/breadability
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
Language:HTML204 22 2325
html-extract/hext
Domain-specific language for extracting structured data from HTML documents
Language:C++51 4 333
Whomrx666/Xtract-html
Xtract-html is a tool for extracting HTML display code from a website, which you can also use for your website.
Language:Python31
zanachka/article-extraction-benchmark
Article extraction benchmark: dataset and evaluation scripts
Language:Python2 1 00
Whomrx666/Xtract-htmlV2
Xtract-htmlV2 is a tool for getting the HTML code from the website you want and is the successor to the previous version
Language:Python1
zanachka/dateparser
python parser for human readable dates
Language:Python1 1 00
zanachka/extruct
Extract embedded metadata from HTML markup
Language:Python1 1 00
shmdoc/unit-parser
Script for extracting units from http://vocab.nerc.ac.uk/collection/P06/current/ to easily add units to the database (This should only be temporarily to demonstrate how units can work)
Language:HTML0 3 00
zanachka/html-text
Extract text from HTML
Language:HTML1 0
zanachka/jusText
Heuristic based boilerplate removal tool
Language:Python1 0
zanachka/number-parser
Parse numbers written in natural language
Language:Python1 0
zanachka/price-parser
Extract price amount and currency symbol from a raw text string
Language:Python1 0
zanachka/python-readability
fast python port of arc90's readability tool, updated to match latest readability.js!
Language:Python1 0

html-extraction

miso-belica/sumy

bookieio/breadability

html-extract/hext

Whomrx666/Xtract-html

zanachka/article-extraction-benchmark

Whomrx666/Xtract-htmlV2

zanachka/dateparser

zanachka/extruct

shmdoc/unit-parser

zanachka/html-text

zanachka/jusText

zanachka/number-parser

zanachka/price-parser

zanachka/python-readability