jedie/python-creole

problem with html2creole AttributeError: 'NoneType' object has no attribute 'parent'

binarytemple opened this issue · 3 comments

I'm getting an error when running the following code

f = open("/tmp/test.html","r")
html2creole(unicode(fr,errors='ignore'))

In [54]: html2creole(unicode(fr,errors='ignore'))

In [53]: html2creole(unicode(fr,errors='ignore'))

AttributeError Traceback (most recent call last)

/tmp/ in ()

/usr/local/lib/python2.7/dist-packages/creole/init.pyc in html2creole(html_string, debug, parser_kwargs, emitter_kwargs, unknown_emit)
110 warnings.warn("parser_kwargs argument in html2creole would be removed in the future!", PendingDeprecationWarning)
111
--> 112 document_tree = parse_html(html_string, debug=debug)
113
114 emitter_kwargs2 = {

/usr/local/lib/python2.7/dist-packages/creole/init.pyc in parse_html(html_string, debug)
91
92 h2c = HtmlParser(debug=debug)
---> 93 document_tree = h2c.feed(html_string)
94 if debug:
95 h2c.debug()

/usr/local/lib/python2.7/dist-packages/creole/html_parser/parser.pyc in feed(self, raw_data)
157 # print("-"*79)

158

--> 159 HTMLParser2.feed(self, data)
160
161 return self.root

/usr/lib/python2.7/HTMLParser.pyc in feed(self, data)
107 """
108 self.rawdata = self.rawdata + data
--> 109 self.goahead(0)
110
111 def close(self):

/usr/lib/python2.7/HTMLParser.pyc in goahead(self, end)
151 k = self.parse_starttag(i)
152 elif startswith("</", i):
--> 153 k = self.parse_endtag(i)
154 elif startswith("<!--", i):
155 k = self.parse_comment(i)

/usr/local/lib/python2.7/dist-packages/creole/shared/html_parser.pyc in parse_endtag(self, i)
98 return j
99 # --- changed end -----------------------------------------------------

--> 100 self.handle_endtag(tag.lower())
101 self.clear_cdata_mode()
102 return j
/usr/local/lib/python2.7/dist-packages/creole/html_parser/parser.pyc in handle_endtag(self, tag)
255 self._go_up()
256 else:
--> 257 self.cur = self.cur.parent
258
259 #-------------------------------------------------------------------------

Here's the actual html code (I don't know if I can attach files)

<html>
 <head>
  <title>
   Regions - Online Help - EN
  </title>
  <link href="AppStyles.css" type="text/css" rel="stylesheet" />
  <link href="pagestyles.css" type="text/css" rel="stylesheet" />
  <link href="style_blue.css" type="text/css" rel="stylesheet" />
  <script type="text/javascript" src="static_page.js">
  </script>
  <meta http-equiv="Cache-Control" content="no-cache" />
  <meta http-equiv="Pragma" content="no-cache" />
  <meta http-equiv="expires" content="FRI, 13 APR 1999 01:00:00 GMT" />
  <meta name="ROBOTS" content="NOINDEX, NOFOLLOW, NOARCHIVE" />
 </head>
 <body class="page_body">
  <p>
   <span class="breadcrumbs">
    <a href="Welcome.htm" title="">
     Home
    </a>
    &nbsp;&gt;&nbsp;
    <a href="Welcome.htm" title="">
     Welcome
    </a>
    &nbsp;&gt;&nbsp;
    <a href="reporting1.htm" title="">
     Reporting
    </a>
    &nbsp;&gt;&nbsp;
    <a href="regions.htm" title="">
     Regions
    </a>
   </span>
  </p>
  <p>
   <span class="heading">
    Regions
   </span>
  </p>
  <p>
   This demographic report allows you to view the regional breakdown of mentions by country.&nbsp;
   <br />
   Viewed via the
   <img alt="" style="border:0px solid;" src="./images/regions.gif" />
   icon in the
   <strong>
    Icon Panel
   </strong>
   . It can also be viewed by double-clicking on the
   <strong>
    <a href="summary_dashboard1.htm">
     Summary&nbsp;Dashboard
    </a>
   </strong>
   and selecting the&nbsp;appropriate&nbsp;option. &nbsp;This has two components,
   <strong>
    Report
   </strong>
   and
   <strong>
   </strong>
   <strong>
    <a href="data_explorer.htm">
     Data Explorer
    </a>
   </strong>
   .
  </p>
  <br />
  <span style="font-size: 18px;">
   <strong>
    Report
   </strong>
  </span>
  <br />
  <br />
  <img alt="" style="border:0px solid;" src="./images/Regions.png" />
  <br />
  <br />
  You can change the way that the mentions are displayed using the drop down list accessed via the
  <img alt="" style="border:0px solid;" src="./images/config-over.gif" />
  icon.
  <br />
  <br />
  <strong>
   <span style="font-size: 16px;">
   </span>
  </strong>
  <strong>
   <a href="data_explorer.htm">
    Data Explorer
   </a>
  </strong>
  <br />
  <br />
  The
  <strong>
  </strong>
  <strong>
   <a href="data_explorer.htm">
    Data Explorer
   </a>
  </strong>
  displays the mentions that make up the data shown in the
  <strong>
   Report
  </strong>
  panel. In addition there is the ability to filter the mentions by country via the filter located to the right of the
  <strong>
   Email
  </strong>
  button.&nbsp;
 </body>
</html>

Sorry for the late response.

You have to cut out the body content and put this to html2creole()

made something like this:

body_re = re.compile(r'<body[^>]*>(.*?)</body>', re.S | re.I)

f = open("/tmp/test.html","r")
html = f.read()
f.close()
content = body_re.findall(html)
creole = html2creole(content)

I found a bug related to "AttributeError: 'NoneType' object has no attribute 'parent'" and fix it with: 9e5b5dd

I create a new relase v1.0.2

Dude. It works like beautiful now. U are the man!

Here is a full sample that works with the html from my first post, no need for the regex.

from creole import *
f = open("/tmp/blah.html","r")
html = f.read()
f.close()
creole = html2creole(unicode(html))
print creole