DarrenOfficial/dpaste

HTTP Forbidden with urlopen but not with CURL

HerveMignot opened this issue · 4 comments

When using urlopen(), a HTTP 403 is returned while it is working fine with curl or from a web browser.

Example:

from urllib.request import urlopen
response = urlopen('https://dpaste.org/guj5/raw')

raises an HTTPError: HTTP Error 403: Forbidden

curl -s https://dpaste.org/guj5/raw is working fine.

I found this while updating dpaste-magic Jupyter command (new raw output+move to dpaste.org).

Yes this is intentional, its a Cloudflare feature: https://support.cloudflare.com/hc/en-us/articles/200170086 Is the raw mode actually usable for you now that it contains HTML? If so I can whitelist it.

#120 is related.

Actually, I would prefer if you simply send a User-Agent. That solves the check. And it gives me more insight and control about the 'good' bots.

If you run into other issues in future I can whitelist more specifically.

>>> from urllib.request import Request, urlopen
>>> r = Request("https://dpaste.org/guj5/raw")
>>> r.add_header("User-Agent", "dpaste-magic Jupyter integration")
>>> urlopen(r).read()
b'<!DOCTYPE html>\n<html>\n<head>\n  <title>guj5</title>\n  <meta name="viewport" content="width=device-width, initial-scale=1.0"/>\n  <meta name="robots" content="noindex, nofollow"/>\n</head>\n<body>\n\n<pre>ceci est un\nligne\nde \ncode\n</pre>\n\n</body>\n</html>\n'

I have added a basic HTML Parser on the raw mode to get the content of the <PRE> div.
So yes, the raw mode is usable.
(BTW, I have a spurious new line at the end of the pasted text if there is none in the original text when getting back the paste, but I need to investigate if it is coming from HTMLParser or not, since the PRE div content seems to be fine).

I'll add a User-Agent, that's the best option IMHO. I need to reengineer a little bit since I was using an IPython function to get the dpaste, so without control over user agents, but I can do it directly within my code.

Thank you for the quick answer.

Yes I guess it's the parser. I just double checked it and the pre tags should be fine.

... r = Request("https://dpaste.org/jtzB/raw")
... r.add_header("User-Agent", "dpaste-magic Jupyter integration")
... urlopen(r).read()
b'<!DOCTYPE html>\n<html>\n<head>\n  <title>jtzB</title>\n  <meta name="viewport" content="width=device-width, initial-scale=1.0"/>\n  <meta name="robots" content="noindex, nofollow"/>\n</head>\n<body>\n\n<pre>test</pre>\n\n</body>\n</html>\n'

Cloudflare also allows me to block (hopefully) reliable requests from TOR networks which are the only source of legally problematic content for me. I'll keep watching it for a bit and hope I can bring back the 'real' raw view.