kurtmckee/feedparser

feedparser.parse on https://www.howtogeek.com/feed/ issue

Closed this issue · 3 comments

feedparser 6.0.10
ubuntu 24.04

feedparser.parse(https://www.howtogeek.com/feed/) is failing ( around https timeout? )

Traceback (most recent call last):
File "xxxx/filter_rss4.py", line 47, in
original_feed = feedparser.parse(original_feed_url)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/feedparser/api.py", line 216, in parse
data = _open_resource(url_file_stream_or_string, etag, modified, agent, referrer, handlers, request_headers, result)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/feedparser/api.py", line 115, in _open_resource
return http.get(url_file_stream_or_string, etag, modified, agent, referrer, handlers, request_headers, result)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/feedparser/http.py", line 171, in get
f = opener.open(request)
^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/urllib/request.py", line 515, in open
response = self._open(req, data)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/urllib/request.py", line 532, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/urllib/request.py", line 492, in _call_chain
result = func(*args)
^^^^^^^^^^^
File "/usr/lib/python3.12/urllib/request.py", line 1392, in https_open
return self.do_open(http.client.HTTPSConnection, req,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/urllib/request.py", line 1348, in do_open
r = h.getresponse()
^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/http/client.py", line 1428, in getresponse
response.begin()
File "/usr/lib/python3.12/http/client.py", line 331, in begin
version, status, reason = self._read_status()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/http/client.py", line 300, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

I'm not able to reproduce on Python 3.12 using feedparser 6.0.11 on Windows.

Feedparser will eventually remove its custom HTTP client code in favor of requests. For now I recommend installing requests, calling requests.get(), and handling exceptions yourself. If the request is successful, pass the response into feedparser.

same issue with requests.get()

I solved with pycurl

	buffer = BytesIO()
	c = pycurl.Curl()
	c.setopt(c.URL, original_feed_url)
	c.setopt(c.WRITEDATA, buffer)
	c.perform()
	c.close()
	original_feed = feedparser.parse(buffer.getvalue().decode('utf-8'))

Based on past experience, this is occasionally an issue with user agents getting blocked (and sometimes, inconsistently). It may be that pycurl is passing a User-Agent header that doesn't trigger blocking from the site.

I'm glad that you were able to work around this!