
Remove DIVs, style stuff and normalize HTML preserving structure information

Primary LanguagePythonMIT LicenseMIT


PyPI Version Supported Python Versions Build Status Coverage report

Clean and normalize HTML. Preserve embeddings (e.g. Twitter, Instagram, etc)

Install the library with pip:

pip install clear-html

Example usage with lxml:

from lxml.html import fromstring
from clear_html import clean_node, cleaned_node_to_html

        <div style="color:blue" id="main_content">
            Some text to be
            <div>cleaned up!</div>
node = fromstring(html)
cleaned_node = clean_node(node)
cleaned_html = cleaned_node_to_html(cleaned_node)

Example usage with Parsel:

from parsel import Selector
from clear_html import clean_node, cleaned_node_to_html

selector = Selector(text="""<html>
                                <div style="color:blue" id="main_content">
                                    Some text to be
                                    <div>cleaned up!</div>
selector = selector.css("#main_content")
cleaned_node = clean_node(selector[0].root)
cleaned_html = cleaned_node_to_html(cleaned_node)

Both of the different approaches above would print the following:


<p>Some text to be</p>

<p>cleaned up!</p>


Other interesting functions:

  • cleaned_node_to_text: convert the cleaned node to plain text
  • formatted_text.clean_doc: low level method to control more aspects of the cleaning up