kurtmckee/feedparser

Parser freezes terminal with no error

lampnout opened this issue · 4 comments

Thank you very much for maintaining this module!

I ran into an issue while parsing a feed. Most specifically, the terminal freezes when parsing the feed, with no error output, as the screenshot shows:
image

The URL of the feed I'm trying to parse is:

https://www.cyber.gov.au/acsc/view-all-content/publications/rss

The w3 validator says the feed is valid (https://validator.w3.org/feed/check.cgi?url=https%3A%2F%2Fwww.cyber.gov.au%2Facsc%2Fview-all-content%2Fpublications%2Frss), however makes some comments about interoperability:
image

I'm not sure what exactly is the issue and I'd need your thoughts on this. Is there anything that can be done in feedparser to make this parse work or alternatively print out a verbose error message?

Thanks in advance

buhtz commented

I tried to save the current state of that RSS file. But even wget is not able to load the content.

$ wget https://www.cyber.gov.au/acsc/view-all-content/publications/rss
--2023-03-09 16:23:30--  https://www.cyber.gov.au/acsc/view-all-content/publications/rss
Auflösen des Hostnamens www.cyber.gov.au (www.cyber.gov.au)… 184.25.239.48, 184.25.239.96
Verbindungsaufbau zu www.cyber.gov.au (www.cyber.gov.au)|184.25.239.48|:443 … verbunden.
HTTP-Anforderung gesendet, auf Antwort wird gewartet …

Then I used Microsoft Edge (not my choice!) to open the feed URL. It opened as a plain text file I stored on my locale drive.
Then I parsed that file (see rss.txt) which works fine.

p = pathlib.Path.home() / '_t' / 'rss.txt'
>>> p.exists()
True
>>> f = feedparser.parse(p)
>>> f
{'bozo': False, 'entries': [{'title': 'Essential Eight Maturity Model to ISM Mapping', 'title_detail': {'type': 'text/plain', 'language': None, 'base': 'https://www.cyber.gov.au/', 'value': 'Essential Eight Maturity Model to ISM Mapping'}, 'links': [{'rel': 'alternate', 'type': 'text/html', 'href': 'https://www.cyber.gov.au/acsc/view-all-content/publications/essential-eight-maturity-model-ism-mapping'}], 'link': 'https://www.cyber.gov.au/acsc/view-all-content/publications/essential-eight-maturity-model-ism-mapping', 'summary': 'This publication provides a mapping between Maturity Level Two and Maturity Level Three of the Essential Eight Maturity Model and the controls within the Information Security Manual (ISM).', 'summary_detail': {'type': '

SNIPPED

So it seems that the parsing is not the problem but something is wrong about the file transfer.

Except the MS Edge Download I did all that on Debian 11 with feedparser 6.0.10 and Python 3.9.10.

The server is blocking user agents it doesn't like. You'll need to set a fake user agent that looks like a desktop browser.

url = "https://www.cyber.gov.au/acsc/view-all-content/publications/rss"
agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5)"
result = feedparser.parse(url, agent=agent)
result["items"][0]["title"]  # 'Essential Eight Maturity Model to ISM Mapping'
buhtz commented

The server is blocking user agents it doesn't like.

How stupid is that!? 😆

But it seems that feedparser is freezing or waiting endless. Can we do something about it? Kind of an timeout or something like this?

Years prior, I was unwilling to add timeouts because they wouldn't work in the supported Python 2.x series. I'm going to add timeouts when feedparser switches to the requests library internally.

In another project (listparser), I've introduced several changes that I want to bring over to feedparser, including using requests as the HTTP library.