HTTP Forbidden with urlopen but not with CURL
HerveMignot opened this issue · 4 comments
When using urlopen(), a HTTP 403 is returned while it is working fine with curl or from a web browser.
Example:
from urllib.request import urlopen
response = urlopen('https://dpaste.org/guj5/raw')
raises an HTTPError: HTTP Error 403: Forbidden
curl -s https://dpaste.org/guj5/raw
is working fine.
I found this while updating dpaste-magic Jupyter command (new raw output+move to dpaste.org).
Yes this is intentional, its a Cloudflare feature: https://support.cloudflare.com/hc/en-us/articles/200170086 Is the raw mode actually usable for you now that it contains HTML? If so I can whitelist it.
#120 is related.
Actually, I would prefer if you simply send a User-Agent. That solves the check. And it gives me more insight and control about the 'good' bots.
If you run into other issues in future I can whitelist more specifically.
>>> from urllib.request import Request, urlopen
>>> r = Request("https://dpaste.org/guj5/raw")
>>> r.add_header("User-Agent", "dpaste-magic Jupyter integration")
>>> urlopen(r).read()
b'<!DOCTYPE html>\n<html>\n<head>\n <title>guj5</title>\n <meta name="viewport" content="width=device-width, initial-scale=1.0"/>\n <meta name="robots" content="noindex, nofollow"/>\n</head>\n<body>\n\n<pre>ceci est un\nligne\nde \ncode\n</pre>\n\n</body>\n</html>\n'
I have added a basic HTML Parser on the raw mode to get the content of the <PRE>
div.
So yes, the raw mode is usable.
(BTW, I have a spurious new line at the end of the pasted text if there is none in the original text when getting back the paste, but I need to investigate if it is coming from HTMLParser
or not, since the PRE
div content seems to be fine).
I'll add a User-Agent, that's the best option IMHO. I need to reengineer a little bit since I was using an IPython function to get the dpaste, so without control over user agents, but I can do it directly within my code.
Thank you for the quick answer.
Yes I guess it's the parser. I just double checked it and the pre tags should be fine.
... r = Request("https://dpaste.org/jtzB/raw")
... r.add_header("User-Agent", "dpaste-magic Jupyter integration")
... urlopen(r).read()
b'<!DOCTYPE html>\n<html>\n<head>\n <title>jtzB</title>\n <meta name="viewport" content="width=device-width, initial-scale=1.0"/>\n <meta name="robots" content="noindex, nofollow"/>\n</head>\n<body>\n\n<pre>test</pre>\n\n</body>\n</html>\n'
Cloudflare also allows me to block (hopefully) reliable requests from TOR networks which are the only source of legally problematic content for me. I'll keep watching it for a bit and hope I can bring back the 'real' raw view.