Some bills (maybe 1/7 of them) give module 'lxml.html' has no attribute 'entities'

Question

Some bills (maybe 1/7 of them) give module 'lxml.html' has no attribute 'entities'

demongolem opened this issue 5 years ago · 1 comments

I just downloaded all the bills using ./run govinfo --bulkdata=BILLSTATUS. Then, I went on to .run bills. I believe there are currently about 50,000 items to be processed from running it. About 7,300 them failed with exactly the same stack trace. Here it is at the bottom of this issue report.

The files are scattered among type (hconres in this case) session (113 in this case). I validated a few of the xml files that were returned from the govinfo run and they we valid and they looked good. This leads me to believe that (and this is my guess) there are certain documents with characters perhaps or something of that sort which cause the error to arise. I will look more into it, however perhaps others have insight into what is going on. Hey, this might just be a problem with me using Python 3 and difference between str and bytes in Python 2 and Python 3. However, so far, I have gotten it to work with Python 3 (work I can share at some point if my version ever fully works).

[hconres25-113] Exception:

Traceback (most recent call last):

File "/home/gwerner/from_greg/congress/tasks/utils.py", line 178, in process_$
results = fetch_func(id, options, *extra_args)

File "/home/gwerner/from_greg/congress/tasks/bills.py", line 101, in process_$
bill_data = form_bill_json_dict(xml_as_dict)

File "/home/gwerner/from_greg/congress/tasks/bills.py", line 173, in form_bil$
'summary': bill_info.summary_for(bill_dict['summaries']['billSummaries']),

File "/home/gwerner/from_greg/congress/tasks/bill_info.py", line 185, in summ$
"text": strip_tags(summary['text']),

File "/home/gwerner/from_greg/congress/tasks/bill_info.py", line 199, in stri$
text = utils.unescape(text)

File "/home/gwerner/from_greg/congress/tasks/utils.py", line 470, in unescape
text = re.sub("&#?\w+;", fixup, text)

File "/usr/lib64/python3.6/re.py", line 191, in sub
return _compile(pattern, flags).sub(repl, string, count)

File "/home/gwerner/from_greg/congress/tasks/utils.py", line 465, in fixup
text = chr(html.entities.name2codepoint[text[1:-1]])

AttributeError: module 'lxml.html' has no attribute 'entities'

Answer 1 · 2020-04-10T12:23:48.000Z

Yeah, I have found how to correct one file, that being hconres2-113. It contained " around a word. If I replace it with the double quote character before passing it to utils.unescape(text) the bill was processed successfully.

Also something like Air Force RDT&E; creates problems because the regex detects &E; and thinks that is fishy, but really it is part of actual text and not HTML encoding.

I think that for Python 3 anyway, this would take care of much of it without resorting to the fixup function (using ht because html is already a variable in the code)

import html as ht text = ht.unescape(text)

So in utils.py, the solution for Python 3 anyway would be to change the bit in the unescape function to this

try:
    text = ht.unescape(text)
except Exception as e:
    print(repr(e))
# this line does not appear necessary for Python 3
# in fact it will cause errors
# text = re.sub("&#?\w+;", fixup, text)