problem with html2creole AttributeError: 'NoneType' object has no attribute 'parent'
binarytemple opened this issue · 3 comments
I'm getting an error when running the following code
f = open("/tmp/test.html","r")
html2creole(unicode(fr,errors='ignore'))
In [54]: html2creole(unicode(fr,errors='ignore'))
In [53]: html2creole(unicode(fr,errors='ignore'))
AttributeError Traceback (most recent call last)
/tmp/ in ()
/usr/local/lib/python2.7/dist-packages/creole/init.pyc in html2creole(html_string, debug, parser_kwargs, emitter_kwargs, unknown_emit)
110 warnings.warn("parser_kwargs argument in html2creole would be removed in the future!", PendingDeprecationWarning)
111
--> 112 document_tree = parse_html(html_string, debug=debug)
113
114 emitter_kwargs2 = {
/usr/local/lib/python2.7/dist-packages/creole/init.pyc in parse_html(html_string, debug)
91
92 h2c = HtmlParser(debug=debug)
---> 93 document_tree = h2c.feed(html_string)
94 if debug:
95 h2c.debug()
/usr/local/lib/python2.7/dist-packages/creole/html_parser/parser.pyc in feed(self, raw_data)
157 # print("-"*79)
158
--> 159 HTMLParser2.feed(self, data)
160
161 return self.root
/usr/lib/python2.7/HTMLParser.pyc in feed(self, data)
107 """
108 self.rawdata = self.rawdata + data
--> 109 self.goahead(0)
110
111 def close(self):
/usr/lib/python2.7/HTMLParser.pyc in goahead(self, end)
151 k = self.parse_starttag(i)
152 elif startswith("</", i):
--> 153 k = self.parse_endtag(i)
154 elif startswith("<!--", i):
155 k = self.parse_comment(i)
/usr/local/lib/python2.7/dist-packages/creole/shared/html_parser.pyc in parse_endtag(self, i)
98 return j
99 # --- changed end -----------------------------------------------------
--> 100 self.handle_endtag(tag.lower())
101 self.clear_cdata_mode()
102 return j
/usr/local/lib/python2.7/dist-packages/creole/html_parser/parser.pyc in handle_endtag(self, tag)
255 self._go_up()
256 else:
--> 257 self.cur = self.cur.parent
258
259 #-------------------------------------------------------------------------
Here's the actual html code (I don't know if I can attach files)
<html>
<head>
<title>
Regions - Online Help - EN
</title>
<link href="AppStyles.css" type="text/css" rel="stylesheet" />
<link href="pagestyles.css" type="text/css" rel="stylesheet" />
<link href="style_blue.css" type="text/css" rel="stylesheet" />
<script type="text/javascript" src="static_page.js">
</script>
<meta http-equiv="Cache-Control" content="no-cache" />
<meta http-equiv="Pragma" content="no-cache" />
<meta http-equiv="expires" content="FRI, 13 APR 1999 01:00:00 GMT" />
<meta name="ROBOTS" content="NOINDEX, NOFOLLOW, NOARCHIVE" />
</head>
<body class="page_body">
<p>
<span class="breadcrumbs">
<a href="Welcome.htm" title="">
Home
</a>
>
<a href="Welcome.htm" title="">
Welcome
</a>
>
<a href="reporting1.htm" title="">
Reporting
</a>
>
<a href="regions.htm" title="">
Regions
</a>
</span>
</p>
<p>
<span class="heading">
Regions
</span>
</p>
<p>
This demographic report allows you to view the regional breakdown of mentions by country.
<br />
Viewed via the
<img alt="" style="border:0px solid;" src="./images/regions.gif" />
icon in the
<strong>
Icon Panel
</strong>
. It can also be viewed by double-clicking on the
<strong>
<a href="summary_dashboard1.htm">
Summary Dashboard
</a>
</strong>
and selecting the appropriate option. This has two components,
<strong>
Report
</strong>
and
<strong>
</strong>
<strong>
<a href="data_explorer.htm">
Data Explorer
</a>
</strong>
.
</p>
<br />
<span style="font-size: 18px;">
<strong>
Report
</strong>
</span>
<br />
<br />
<img alt="" style="border:0px solid;" src="./images/Regions.png" />
<br />
<br />
You can change the way that the mentions are displayed using the drop down list accessed via the
<img alt="" style="border:0px solid;" src="./images/config-over.gif" />
icon.
<br />
<br />
<strong>
<span style="font-size: 16px;">
</span>
</strong>
<strong>
<a href="data_explorer.htm">
Data Explorer
</a>
</strong>
<br />
<br />
The
<strong>
</strong>
<strong>
<a href="data_explorer.htm">
Data Explorer
</a>
</strong>
displays the mentions that make up the data shown in the
<strong>
Report
</strong>
panel. In addition there is the ability to filter the mentions by country via the filter located to the right of the
<strong>
Email
</strong>
button.
</body>
</html>
Sorry for the late response.
You have to cut out the body content and put this to html2creole()
made something like this:
body_re = re.compile(r'<body[^>]*>(.*?)</body>', re.S | re.I)
f = open("/tmp/test.html","r")
html = f.read()
f.close()
content = body_re.findall(html)
creole = html2creole(content)
I found a bug related to "AttributeError: 'NoneType' object has no attribute 'parent'" and fix it with: 9e5b5dd
I create a new relase v1.0.2
Dude. It works like beautiful now. U are the man!
Here is a full sample that works with the html from my first post, no need for the regex.
from creole import *
f = open("/tmp/blah.html","r")
html = f.read()
f.close()
creole = html2creole(unicode(html))
print creole