/beautifulsoup

mirror of BeautifulSoup

Primary LanguagePythonOtherNOASSERTION

= Introduction =

  >>> from bs4 import BeautifulSoup
  >>> soup = BeautifulSoup("<p>Some<b>bad<i>HTML")
  >>> print soup.prettify()
  <html>
   <body>
    <p>
     Some
     <b>
      bad
      <i>
       HTML
      </i>
     </b>
    </p>
   </body>
  </html>
  >>> soup.find(text="bad")
  u'bad'

  >>> soup.i
  <i>HTML</i>

  >>> soup = BeautifulSoup("<tag1>Some<tag2/>bad<tag3>XML", "xml")
  >>> print soup.prettify()
  <?xml version="1.0" encoding="utf-8">
  <tag1>
   Some
   <tag2 />
   bad
   <tag3>
    XML
   </tag3>
  </tag1>

= About Beautiful Soup 4 =

This is a nearly-complete rewrite that removes Beautiful Soup's custom
HTML parser in favor of a system that lets you write a little glue
code and plug in any HTML or XML parser you want.

Beautiful Soup 4.0 comes with glue code for four parsers:

 * Python's standard HTMLParser
 * lxml's HTML and XML parsers
 * html5lib's HTML parser

HTMLParser is the default, but I recommend you install one of the
other parsers, or you'll have problems handling real-world markup.

== The module name has changed ==

Previously you imported the BeautifulSoup class from a module also
called BeautifulSoup. To save keystrokes and make it clear which
version of the API is in use, the module is now called 'bs4':

    >>> from bs4 import BeautifulSoup

== It works with Python 3 ==

Beautiful Soup 3.1.0 worked with Python 3, but the parser it used was
so bad that it barely worked at all. Beautiful Soup 4 works with
Python 3, and since its parser is pluggable, you don't sacrifice
quality.

Special thanks to Thomas Kluyver for getting Python 3 support to the
finish line.

== Better method names ==

Methods and attributes have been renamed to comply with PEP 8. The old names
still work. Here are the renames:

 * replaceWith -> replace_with
 * replaceWithChildren -> replace_with_children
 * findAll -> find_all
 * findAllNext -> find_all_next
 * findAllPrevious -> find_all_previous
 * findNext -> find_next
 * findNextSibling -> find_next_sibling
 * findNextSiblings -> find_next_siblings
 * findParent -> find_parent
 * findParents -> find_parents
 * findPrevious -> find_previous
 * findPreviousSibling -> find_previous_sibling
 * findPreviousSiblings -> find_previous_siblings
 * nextSibling -> next_sibling
 * previousSibling -> previous_sibling

Methods have been renamed for compatibility with Python 3.

 * Tag.has_key() -> Tag.has_attr()

   (This was misleading, anyway, because has_key() looked at
   a tag's attributes and __in__ looked at a tag's contents.)

Some attributes have also been renamed:

 * Tag.isSelfClosing -> Tag.is_empty_element
 * UnicodeDammit.unicode -> UnicodeDammit.unicode_markup
 * Tag.next -> Tag.next_element
 * Tag.previous -> Tag.previous_element

So have some arguments to popular methods:

 * BeautifulSoup(parseOnlyThese=...) -> BeautifulSoup(parse_only=...)
 * BeautifulSoup(fromEncoding=...) -> BeautifulSoup(from_encoding=...)

== Generators are now properties ==

The generators have been given more sensible (and PEP 8-compliant)
names, and turned into properties:

 * childGenerator() -> children
 * nextGenerator() -> next_elements
 * nextSiblingGenerator() -> next_siblings
 * previousGenerator() -> previous_elements
 * previousSiblingGenerator() -> previous_siblings
 * recursiveChildGenerator() -> recursive_children
 * parentGenerator() -> parents

So instead of this:

 for parent in tag.parentGenerator():
     ...

You can write this:

 for parent in tag.parents:
     ...

(But the old code will still work.)

== tag.string is recursive ==

tag.string now operates recursively. If tag A contains a single tag B
and nothing else, then A.string is the same as B.string. So:

<a><b>foo</b></a>

The value of a.string used to be None, and now it's "foo".

== Empty-element tags ==

Beautiful Soup's handling of empty-element tags (aka self-closing
tags) has been improved, especially when parsing XML. Previously you
had to explicitly specify a list of empty-element tags when parsing
XML. You can still do that, but if you don't, Beautiful Soup now
considers any empty tag to be an empty-element tag.

The determination of empty-element-ness is now made at runtime rather
than parse time. If you add a child to an empty-element tag, it stops
being an empty-element tag.

== Entities are always converted to Unicode ==

An HTML or XML entity is always converted into the corresponding
Unicode character. There are no longer any smartQuotesTo or
convertEntities arguments. (Unicode, Dammit still has smart_quotes_to,
but its default is now to turn smart quotes into Unicode.)

== CDATA sections are normal text, if they're understood at all. ==

Currently, the lxml and html5lib HTML parsers ignore CDATA sections in
markup:

 <p><![CDATA[foo]]></p> => <p></p>

A future version of html5lib will turn CDATA sections into text nodes,
but only within tags like <svg> and <math>:

 <svg><![CDATA[foo]]></svg> => <p>foo</p>

The default XML parser (which uses lxml behind the scenes) turns CDATA
sections into ordinary text elements:

 <p><![CDATA[foo]]></p> => <p>foo</p>

In theory it's possible to preserve the CDATA sections when using the
XML parser, but I don't see how to get it to work in practice.

== Miscellaneous other stuff ==

If the BeautifulSoup instance has .is_xml set to True, an appropriate
XML declaration will be emitted when the tree is transformed into a
string:

    <?xml version="1.0" encoding="utf-8">
    <markup>
     ...
    </markup>

The ['lxml', 'xml'] tree builder sets .is_xml to True; the other tree
builders set it to False. If you want to parse XHTML with an HTML
parser, you can set it manually.