Parser freezes terminal with no error
lampnout opened this issue · 4 comments
Thank you very much for maintaining this module!
I ran into an issue while parsing a feed. Most specifically, the terminal freezes when parsing the feed, with no error output, as the screenshot shows:
The URL of the feed I'm trying to parse is:
https://www.cyber.gov.au/acsc/view-all-content/publications/rss
The w3 validator says the feed is valid (https://validator.w3.org/feed/check.cgi?url=https%3A%2F%2Fwww.cyber.gov.au%2Facsc%2Fview-all-content%2Fpublications%2Frss), however makes some comments about interoperability:
I'm not sure what exactly is the issue and I'd need your thoughts on this. Is there anything that can be done in feedparser to make this parse work or alternatively print out a verbose error message?
Thanks in advance
I tried to save the current state of that RSS file. But even wget
is not able to load the content.
$ wget https://www.cyber.gov.au/acsc/view-all-content/publications/rss
--2023-03-09 16:23:30-- https://www.cyber.gov.au/acsc/view-all-content/publications/rss
Auflösen des Hostnamens www.cyber.gov.au (www.cyber.gov.au)… 184.25.239.48, 184.25.239.96
Verbindungsaufbau zu www.cyber.gov.au (www.cyber.gov.au)|184.25.239.48|:443 … verbunden.
HTTP-Anforderung gesendet, auf Antwort wird gewartet …
Then I used Microsoft Edge (not my choice!) to open the feed URL. It opened as a plain text file I stored on my locale drive.
Then I parsed that file (see rss.txt) which works fine.
p = pathlib.Path.home() / '_t' / 'rss.txt'
>>> p.exists()
True
>>> f = feedparser.parse(p)
>>> f
{'bozo': False, 'entries': [{'title': 'Essential Eight Maturity Model to ISM Mapping', 'title_detail': {'type': 'text/plain', 'language': None, 'base': 'https://www.cyber.gov.au/', 'value': 'Essential Eight Maturity Model to ISM Mapping'}, 'links': [{'rel': 'alternate', 'type': 'text/html', 'href': 'https://www.cyber.gov.au/acsc/view-all-content/publications/essential-eight-maturity-model-ism-mapping'}], 'link': 'https://www.cyber.gov.au/acsc/view-all-content/publications/essential-eight-maturity-model-ism-mapping', 'summary': 'This publication provides a mapping between Maturity Level Two and Maturity Level Three of the Essential Eight Maturity Model and the controls within the Information Security Manual (ISM).', 'summary_detail': {'type': '
SNIPPED
So it seems that the parsing is not the problem but something is wrong about the file transfer.
Except the MS Edge Download I did all that on Debian 11 with feedparser 6.0.10 and Python 3.9.10.
The server is blocking user agents it doesn't like. You'll need to set a fake user agent that looks like a desktop browser.
url = "https://www.cyber.gov.au/acsc/view-all-content/publications/rss"
agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5)"
result = feedparser.parse(url, agent=agent)
result["items"][0]["title"] # 'Essential Eight Maturity Model to ISM Mapping'
The server is blocking user agents it doesn't like.
How stupid is that!? 😆
But it seems that feedparser is freezing or waiting endless. Can we do something about it? Kind of an timeout or something like this?
Years prior, I was unwilling to add timeouts because they wouldn't work in the supported Python 2.x series. I'm going to add timeouts when feedparser switches to the requests library internally.
In another project (listparser), I've introduced several changes that I want to bring over to feedparser, including using requests as the HTTP library.