edgi-govdata-archiving/web-monitoring-processing

Page with an unbelievable number of dropdown options fails to parse in the middle

Mr0grog opened this issue · 4 comments

I saw this error in Sentry: https://sentry.io/environmental-data-governance-/db-prod/issues/722659627/
Which led me to a diff involving this content: https://edgi-wm-archive.s3.amazonaws.com/039e5be6ffa4008eb0a702973c7b81fbf0be9fe997b0bacaba5ee352e683b4e3

This diff “works” on my local machine, but takes quite a lot of time. I’m guessing this failed in production because things were under load. I’d have to dig through logs to see what actually happened there. In any case, though, the diff that results from this is just absolutely terrible. When we parse the document and get all the text:

def _get_text(html):
"Extract textual content from HTML."
soup = BeautifulSoup(html, 'lxml')
[element.extract() for element in
soup.find_all(string=lambda text: isinstance(text, Comment))]
return soup.find_all(text=True)

we get 333 reasonable text nodes and then the entire rest of the document is stuffed into the last one with several spaces in-between every character of HTML code. In the middle of one of the value attributes in this HTML, it just effectively stops parsing:

<option value="20100920121200">2010-09-20 12:12:00 GMT</option>
<option value="20100920120900">2010-09-20 12:09:00 GMT</option>
<option value="20100920120600">2010-09-20 12:06:00 GMT</option>
<option value="20100920120300">2010-09-20 12:03:00 GMT</option>
<option value="20100920115900">2010-09-20 11:59:00 GMT</option>
<option value="20100920115600">2010-09-20 11:56:00 GMT</option>
<option value="20100920115300">2010-09-20 11:53:00 GMT</option>
<option value="20100920115000">2010-09-20 11:50:00 GMT</option>

Instead, it treats it like:

<option value="20100920121200">2010-09-20 12:12:00 GMT</option>
<option value="20100920120900">2010-09-20 12:09:00 GMT</option>
<option value="20100920120600">2010-09-20 12:06:00 GMT</option>
<option value="20100920120300">2010-09-20 12:03:00 GMT</option>
<option value="20100920115900">2010-09-20 11:59:00 GMT</option>
<option value="20100920115600">2010-09-20 11:56:00 GMT</option>
<option value="2010">0   9   2   0   1   1   5   3   0   0   "   &gt;   2   0   1   0   -   0   9   -   2   0       1   1   :   5   3   :   0   0       G   M   T   /   o   p   t   i   o   n   &gt;   \n   o   p   t   i   o   n       v   a   l   u   e   =   "   2   0   1   0   0   9   2   0   1   1   5   0   0   0   "   &gt;   2   0   1   0   -   0   9   -   2   0       1   1   :   5   0   :   0   0       G   M   T   /   o   p   t   i   o   n   &gt;

Need to dig in and see what’s happening here.

The problem appears to be somewhere in Beautiful Soup, not lxml. This doesn’t exhibit the above problem:

html = requests.get('https://edgi-wm-archive.s3.amazonaws.com/039e5be6ffa4008eb0a702973c7b81fbf0be9fe997b0bacaba5ee352e683b4e3').text
root = etree.HTML(html)
print(etree.tostring(root, pretty_print=True))

Looking at the logs, it’s unclear what is happening that is actually raising an exception on the server. We’re only getting a log line for an exception where we send the diff algorithm to the ProcessPoolExecutor. Either way, this crazy parsing is an issue.

Worth noting: parsing with html5-parser also does not exhibit this issue (see #138).

html = requests.get('https://edgi-wm-archive.s3.amazonaws.com/039e5be6ffa4008eb0a702973c7b81fbf0be9fe997b0bacaba5ee352e683b4e3').text
from html5_parser import parse
root = parse(html, treebuilder='soup')
print(root)

Woohoo, this was fixed in BeautifulSoup and solved when we upgraded in #336! (Still, this is another potentially good reason to switch to html-parser.)