Login failure due to Cloudflare intercepting requests
Closed this issue · 56 comments
Here is the response im getting. Seems it is being blocked by Cloudflare...
Sorry, you have been blocked
You are unable to access kenpom.com
<span class="cf-no-screenshot error"></span>
</div>
</div>
</div><!-- /.captcha-container -->
<div class="cf-section cf-wrapper">
<div class="cf-columns two">
<div class="cf-column">
<h2 data-translate="blocked_why_headline">Why have I been blocked?</h2>
<p data-translate="blocked_why_detail">This website is using a security service to protect itself from online attacks. The action you just performed triggered the security solution. There are several actions that could trigger this block including submitting a certain word or phrase, a SQL command or malformed data.</p>
</div>
<div class="cf-column">
<h2 data-translate="blocked_resolve_headline">What can I do to resolve this?</h2>
<p data-translate="blocked_resolve_detail">You can email the site owner to let them know you were blocked. Please include what you were doing when this page came up and the Cloudflare Ray ID found at the bottom of this page.</p>
</div>
</div>
</div><!-- /.section -->
Thanks for the report! This was previously discovered in #24. A patch was subsequently issued and merged into master
as part of #25, however the release on PyPi hasn't caught up to what we've got here.
If you need this functionality in the immediate term, you could install from the GitHub source but I imagine @j-andrews7 will push a new release with all the bug fixes we've got in the hopper before the season starts.
Thanks guys!!
Any update on this issue?
Thanks boss! Love what you've done here
This should be resolved in the latest release. Or at least the tests pass. If everything's still broken, blame @esqew.
Do not actually do that - he's the only reason this ever got fixed. Thanks Sean!
pip install --upgrade kenpompy
should get you the latest version with this fix and a few others.
Working fine for me.
What's pip freeze | grep kenpom
give ya?
kenpompy==0.3.3
Hm. @esqew, have you noticed any inconsistencies with this? Working fine with python 3.8-3.10.
@trludt can you try a few more times and see if you can get through? We are somewhat at Ken's power here, and if for whatever reason your IP is subject to more stringent measures, there's likely not a ton we can do.
I've tried a bunch still no luck... This is wild. I probably pulled data once per day of the season last year. Could that have restricted my IP? And does a VPN change anything?
I asked Ken about this before I published it, and he basically said as long as people didn't abuse it, he was totally cool with it. I can't imagine once a day is a problem.
Not sure about the VPN, though it could make sense if it's changing your IP geolocation to an area identified as more likely to be an attack.
I just shot him an email. Hopefully he gets back to me. I've tried the browser = (email, pass) line at least 50 times today and still errors every time.
FYI I'm using Spyder v5.1.5 and Python 3.9.12 in my script if that helps towards anything
I unfortunately cannot reproduce this on my end, and the CI/CD tests checked out fine last night.
@trludt Where does your VPN reside geographically? Cloudflare will more closely scrutinize connections from IPs in higher-risk areas. Separately, can you provide a stripped down example that demonstrates this? Is it possible you're also trying to log in a bunch of times or scrape a ton of data in a short period?
Thanks for the context. Without being able to reproduce this issue, it's likely this is something relating to how Cloudflare is handling your specific connection or sessions.
It's possible Cloudflare's heuristics have detected either/both that (among other possibilities) (a) your account has logged in from multiple geographies in too short of a time span ("impossible travel" anomaly detection), or (b) your script, at one point or another, launched too many requests and Cloudflare is now taking it upon itself to block further requests for an indefinite amount of time.
If you're able, can you set your debugger to break on this exception and dump the contents of str(browser.page.contents)
here to ensure that this exception is firing correctly at the very least?
Would that be in my script or altering the utils.py file under the Kenpompy code? @esqew
Neither, but there may be an easier way now that I think about it, this simplified script may be easier to get the contents of the page that Cloudflare is throwing to see if there's anything specific about your case we can pick out from it:
from kenpompy.utils import login
try:
browser = login('<email>', '<password'>)
except Exception as e:
print(str(browser.page.contents))
raise e
Ultimately what I would really like to understand is what HTTP status code Cloudflare is throwing to see if that might be more illuminating in this situation, but the way we've got MechanicalSoup set up doesn't currently capture the right details to enable this. Hopefully the HTML Cloudflare is sending back will indicate something helpful to preclude this next level of debugging.
`['html', '[if lt IE 7]> <![endif]', '[if IE 7]> <![endif]', '[if IE 8]> <![endif]', '[if gt IE 8]><!',
<title>Attention Required! | Cloudflare</title> <style>body{margin:0;padding:0}</style> <script> if (!navigator.cookieEnabled) { window.addEventListener('DOMContentLoaded', function () { var cookieEl = document.getElementById('cookie-alert'); cookieEl.style.display = 'block'; }) } </script>Sorry, you have been blocked
You are unable to access kenpom.com
Why have I been blocked?
This website is using a security service to protect itself from online attacks. The action you just performed triggered the security solution. There are several actions that could trigger this block including submitting a certain word or phrase, a SQL command or malformed data.
What can I do to resolve this?
You can email the site owner to let them know you were blocked. Please include what you were doing when this page came up and the Cloudflare Ray ID found at the bottom of this page.
Cloudflare Ray ID: 766f3588cdb85485 • Performance & security by Cloudflare
<script>(function(){function d(){var b=a.getElementById("cf-footer-item-ip"),c=a.getElementById("cf-footer-ip-reveal");b&&"classList"in b&&(b.classList.remove("hidden"),c.addEventListener("click",function(){c.classList.add("hidden");a.getElementById("cf-footer-ip").classList.remove("hidden")}))}var a=document;document.addEventListener&&a.addEventListener("DOMContentLoaded",d)})();</script>Sorry that was when I was connected to VPN
['html', '[if lt IE 7]> <![endif]', '[if IE 7]> <![endif]', '[if IE 8]> <![endif]', '[if gt IE 8]><!',
<title>Attention Required! | Cloudflare</title> <style>body{margin:0;padding:0}</style> <script> if (!navigator.cookieEnabled) { window.addEventListener('DOMContentLoaded', function () { var cookieEl = document.getElementById('cookie-alert'); cookieEl.style.display = 'block'; }) } </script>Sorry, you have been blocked
You are unable to access kenpom.com
Why have I been blocked?
This website is using a security service to protect itself from online attacks. The action you just performed triggered the security solution. There are several actions that could trigger this block including submitting a certain word or phrase, a SQL command or malformed data.
What can I do to resolve this?
You can email the site owner to let them know you were blocked. Please include what you were doing when this page came up and the Cloudflare Ray ID found at the bottom of this page.
Cloudflare Ray ID: 766f4301ccfdf7e4 • Performance & security by Cloudflare
<script>(function(){function d(){var b=a.getElementById("cf-footer-item-ip"),c=a.getElementById("cf-footer-ip-reveal");b&&"classList"in b&&(b.classList.remove("hidden"),c.addEventListener("click",function(){c.classList.add("hidden");a.getElementById("cf-footer-ip").classList.remove("hidden")}))}var a=document;document.addEventListener&&a.addEventListener("DOMContentLoaded",d)})();</script>So it's definitely a Cloudflare interstitial (so the exception itself is working as intended). Unfortunately without a reliable way to reproduce this behavior outside your specific environment there's not much else we can advise you to do. Cloudflare deliberately doesn't share a ton of details to make it more difficult to circumvent their protections. Unless you're able to get in touch with KenPom himself (or someone who handles his Cloudflare account) and can look at the logs specific to the Cloudflare Ray ID values from the markup you posted, we're just shooting in the dark.
If you have the resources available, you may consider attempting to try this same code from another machine and/or using another KenPom account to see if it's reproducible across environments/accounts. The only other course of action I could recommend at this point would be to drop this altogether for at least a couple days to let any rate limiting that may be at play get fully reset.
I tried a different KenPom account and that didn't work. Also tried from my PC instead of my Macbook, no luck. I shot Ken an email with the Cloudflare Ray ID and my IP... hopefully I can get unbanned 🙏
This tool rocks and would hate to lose access to it. Truly had no intention of abusing it or overloading his site.
I appreciate @esqew @j-andrews7 for the assistance, much appreciated!
@trludt Please let me know if he reaches out to you on this issue. I am having similar results being blocked by Cloudflare. I originally had my own scraping pipeline developed through beautifulsoup that I used last year (about daily similar to you). Recently I ran my script to see if it still worked for this season and I received the 403 forbidden code. I then tried this python library as an alternative and received the same block. In my personal pipeline I also tried modifying User-Agent status and several other request headers and nothing resolved the issue.
@trludt and @AVA-27 I have also received the same LinkNotFoundError (now is replaced by Exception: Opening kenpom.com failed - request was intercepted by Cloudflare protection) after forcing kenpompy update via pip)
I have also reached out to Ken via email and will let you know if I hear anything.
For context, I use a VPN (NordVPN) out of Atlanta (physically located in TN). Last night at about 2:00 AM CST I received the first error (on VPN), and tried again disconnected (still received LinkNotFound error).
Today, after updating, I explicitly tried to re-run while being disconnected from VPN to see if there was a difference, but I'm afraid I might already be blacklisted b/c of the re-geotagging of my first request via VPN.
@RobMepham I never utilized a VPN to access the website and still cannot gain access so I am not sure if that is playing a role or not.
Edit: I also reached out to Ken via email; if I receive a response I will update as well.
Good to know I'm not alone! Currently working on migrating my script over to a lambda function on AWS, which was ultimately my plan all along. Maybe this is a good thing my hand is being forced 🤣
Can confirm I am experiencing this issue as well.
So it seems Cloudflare interception may be more prevalent than I had initially anticipated. At this point, we know (or can reasonably infer) that the following factors play at least something of a role in determining whether or not to intercept requests, likely among many, many others:
- Requesting against KenPom.com without including a
User-Agent
header - Using a "browser" without JavaScript support or otherwise disabled (
curl kenpom.com
will, more often than not, result in the same Cloudflare interstitial) - Using while connected to a public VPN provider
- Connecting from an IP block associated with a "high risk" geolocation
The latter two are a bit out of our scope of control in this instance, but I will be carving time in the coming days to tee up some experimentation where we can try to reconfigure how the requests are sent for those affected by this. Watch this space!
This has been incredibly frustrating. I've been having my friend run this on his computer. Sometimes it works, sometimes it doesn't. It hasn't been working in an AWS lambda function either, triggering the Cloudflare exception.
Considering we all already pay for the access to the data, I wish we could at least get a response back from him why he's doing this
It's a busy time of year for him, I expect he's travelling a lot and such. He has responded to me in the past via e-mail, so I'd just advise some patience. It is an off-label use and all.
Like @esqew said, there are a few kind of hack-y workarounds we can try, but there's no guarantee. Given that neither of us is able to reliably replicate the issue, we may need folks to test, so stay tuned.
I had a few hours this morning to start experimenting specifically with how the User-Agent
value actually affects what comes back from the server.
In short, using valid User-Agent
values from modern browsers (h/t to @pzb's list) actually triggers the Cloudflare interstitial, and flipping back to the standard Mozilla/7.0
we've got in main
right now works as intended (in my environment, anyway). On this same tested-working connection path to the target site, curl
actually works as well when given the explicit User-Agent
value to use: curl "https://kenpom.com" -A "Mozilla/7.0"
As such, I think it's fair to surmise the User-Agent
header as it's currently configured in utils.py
is unlikely the culprit causing these interstitials to pop.
Knowing this, I would deduce that the blocks that have been reported are moreso driven by the IP ranges from which folks are running this code. Since popular VPN and cloud providers' IP blocks are well known, it is quite trivial for Cloudflare to block traffic originating from these sources. Since I haven't used Cloudflare in many years I had to do some research on how this is typically configured - turns out this blocking is fine-tuned by site owners/administrators. From a thread describing a similar issue on Cloudflare Community:
At Cloudflare you have many options to block. You can filter for ASN, IP, IP ranges, country etc… and many more. But if you are blocked, then the website owners configured it to do so. Feel free to contact the website owner and ask him these questions
What I am curious about is if this protection would also apply to a setup which would present itself as a modern browser with full-fledged support for JavaScript (a la Selenium, pyppeteer
) that's running from within one of these IP ranges, but I dread the idea of bringing on something that has such a huge comptuational overhead compared to what we have working right now.
In any event, I'm going to start experimenting down this route running in an Azure/Colab/other cloud environment and report back. If this does seem to alleviate some of this blockage we'll have to see what we can do in terms of marking these more resource-heavy libraries as extras to be used in the event that all else fails.
I would also really like to see if Ken does return anyone's email and what his opinion on the whole thing is, but I'm not holding my breath - he is a pretty busy guy.
I'm curious if he's open to someone helping him build an API that people can pay a little extra for access to the data. I brought that up in my last email to him
Wanted to chime in here for anyone still having issues. We were able to workaround the 403 error by adding this line to our code.
browser.set_user_agent('any-random-thing')
Added that line right below "browser = mechanicalsoup.StatefulBrowser()"
This was following advice from this post https://stackoverflow.com/questions/48506614/403-error-with-mechanicalsoup.
Edit: Looks like there might be a something else involved. Our data analyst was using Juypter labs to run his code and even with the change he was still getting 403'd. Maybe an issue with module versions or how juypter formats the traffic. I installed python fresh with all updated modules and it worked running from the python shell from two computers both with different outgoing IPs
Works in Python 3.11.0
kenpompy-0.3.3
mechanicalsoup-1.2.0
pandas-1.5.1
bs4-0.0.1
beautifulsoup4-4.11.1
requests-2.28.1
lxml-4.9.0 (manually installed because windows)
python-dateutil-2.8.2
certifi-2022.9.24
charset-normalizer-2.1.1
idna-3.4
@jkiddUA kudos to you, I was able to find a workaround. Updating to python 3.11 has fixed the Cloudflare issue, and I can now load data successfully.
I updated to python 3.11, however I tried to reinstall kenpompy, and receive an error "Error: failed to build wheel for lxml". How did you get around this once you upgraded?
I updated to python 3.11, however I tried to reinstall kenpompy, and receive an error "Error: failed to build wheel for lxml". How did you get around this once you upgraded?
So I had to download the lxml wheel (.whl) from here https://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml. Just download the correct one for your machine and place it in your working directory. In my case it was lxml‑4.9.0‑cp311‑cp311‑win_amd64.whl because I'm using a 64bit windows machine. And then from my virtual environment I ran the command, pip install lxml‑4.9.0‑cp311‑cp311‑win_amd64.whl.
Once that's successfully installed you should be able to run the kenpompy install.
Hi all, my testing the use of browser control libraries to overcome the reported Cloudflare issues remains ongoing as my availability permits. To recap what we know for sure:
- Connections originating from infrastructure with outbound IPs assigned to popular public cloud providers (Azure/AWS/etc.) will be blocked/filtered
- Connections originating with outbound IPs known to be associated with popular public VPN providers will be blocked/filtered
- Connections without a "valid"
User-Agent
HTTP header will be blocked/filtered
Potential mitigations:
-
Update to the latest version of
kenpompy
that's been published topypi
, if you haven't already:python3 -m pip install --upgrade kenpompy
This will ensure that the latest patch with an appropriate
User-Agent
header setting is in place. -
Some users indicate that updating to Python 3.11 may helpref, but it's unclear if this will work for 100% of users experiencing Cloudflare-related issues, or even why it works in the first place
-
Using a browser automation framework instead of the
MechanicalSoup
stateful browser (currently being developed and tested locally) -
For those who must run their workload in a public cloud, it may be worth exploring whether an HTTP proxy might help to mitigate this - I'm looking to see if it might be helpful to allow the passing of a preconfiguredUPDATE: this is not likely to work after quick tests with cURL and some publicly-available proxies; KenPom.com returnsMechanicalSoup
instance tologin()
to allow for this type of configuration up front400 Bad Request
for each. May be worthwhile to continue a bit more in-depth testing, but for now I'm considering this a non-viable option -
Write to KenPom and
beg/pleadask how we might be able to sidestep this protection for scraping purposes
Hi all, I haven't lost sight of this! My local testing continues as my availability allows with a selenium
-based retrofit for the MechanicalSoup functionality we use today. I know progress has been excruciatingly slow, and that's simply a biproduct of my full-time role (and my life more generally) being as hectic as ever with the end of the year fast approaching.
Interestingly, using selenium
with a "headless" WebDriver is filtered by Cloudflare 100% of the time. This is slightly surprising (at least to me) because primitive/headless-only clients (like cURL
), with the proper User-Agent
header, are not normally filtered in the same way.
As a result, those who do require headless functionality are likely going to find this solution untenable, at least if/until we find a workaround. I'm aware of projects like selenium-stealth
which have in the past been able to overcome this detection of headless webdrivers, but are now largely unmaintained (ostensibly due to the strides these providers have made in detecting them anyway). I'm open to any and all suggestions from the folks here if someone is aware of something more modern that might have a better success rate.
(Somewhat separately but for complete transparency, I had attempted to do this retrofit using a pyppeteer
base a few weeks ago, but its asynchronous-first design leads to quite a bit of complexity trying to implement it in a reliable, robust way that also supports everyone who's using it in a Jupyter environment. In the same vein, unless I've missed something huge, I don't think it's got any mitigations for the Cloudflare detections that I take that back - seems there is a selenium
does not considering it's the same Chrome/Chromium browser at the core.pyppeteer-stealth
package that's actively maintained and may be worth re-exploring this thread as a result.)
The next test I'm taking on is running non-headlessly from within an Azure environment to see if our hypothesis holds true that this should bypass Cloudflare's heuristics. If successful, I'll publish an experimental branch for others to test in the near future. More to come!
I was trying to figure this out yesterday and I can't find anything that would indicate it is an issue specific to MechanicalSoup
. In fact, I actually found the problem appeared to originate in the python requests
library. There is clearly something that is not "liking" the way the library is sending the requests (Cloudflare most likely has a ruleset in place since majority of scraping is through python). I can't quite put my finger on it.
I tested with a popular ruby library, HTTParty
while on my VPN (through NordVPN) and got a 200 and like you mentioned cURL with the same exact headers works.
I am going to continue digging when I get a bit more time.
Hello,
I am new to this package, and I just purchased my kenpom account a few days ago. I am also having the same issue:
Exception: Opening kenpom.com failed - request was intercepted by Cloudflare protection
I am wondering if anyone has had any luck outside of this website to scrape data for NCAA mens while these issues with Cloudflare are being investigated further.
Thanks and looking forward to using this package when this is resolved
@johnfeldhausen Thanks for the report. Can you give us a bit more detail around your environment? Are you running on cloud infrastructure or through a VPN?
@esqew - I did get this to work with the following version (upgraded from 3.10 to 3.11):
Python 3.11.0 (v3.11.0:deaf509e8f, Oct 24 2022, 14:43:23) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
.
I deduce that this may be something with TLS fingerprinting in older versions of the ssl
wrapper library. Here is the version of openssl (which is pertinent for the TLS fingerprint issue):
OpenSSL 1.1.1q 5 Jul 2022
@mbrundige Fascinating. I definitely missed an OpenSSL upgrade in the 3.11 RC2 release notes, but sure enough OpenSSL 1.1.1q is mentioned. I don’t doubt TLS fingerprinting plays a major role in Cloudflare heuristics, so this could indeed make a lot of sense if the changes in OpenSSL materially affect how this is carried out.
This would also track with earlier reports that a Ruby-based HTTP library didn't have issues being filtered when MechanicalSoup
's requests
baseline did
In any event, this would certainly preclude the need for any browser-based workarounds.
@steveroks @trludt @AVA-27 @RobMepham @Harrisoneller If possible, would each of you mind updating to the latest kenpompy
and Python 3.11.0 and report back if this resolves your Cloudflare-related issues?
I will assume that if we don't hear back from anyone in the next week or so that the proposed fix (updating to Python 3.11) is working for those still otherwise experiencing this issue, and will close this issue accordingly.
I'll also add a small section to README
or a Wiki page to summarize the issue & fix as a place to point future users to.
Closing due to inactivity per my previous comment. We will consider the guidance to update to Python 3.11.x as the best solution. On the flipside, I am open to more reports in the future that might contradict that, in which case we'll re-open this and do further investigation on a case-by-case basis.
Thanks to everyone for their assistance to date!
@esqew I found a workaround for Python < 3.11
@mbrundige was right that it was a TLS fingerprinting issue.
It seems it is a cipher negotiation failure between requests and kenpom.
I noticed that the ciphers were different in 3.11 than in 3.9.
So, I installed Python 3.11 in a virtual env and ran the following to export the ciphers:
python -c "import ssl; print(ssl._DEFAULT_CIPHERS);"
I then went back to my 3.9 environment and appended the 3.11 ciphers:
requests.packages.urllib3.util.ssl_.DEFAULT_CIPHERS += \ ':@SECLEVEL=2:ECDH+AESGCM:ECDH+CHACHA20:ECDH+AES:DHE+AES:!aNULL:!eNULL:!aDSS:!SHA1:!AESCCM'
I don't believe this is the "safest" workaround as it could break some requests, but it works for this use case.
For anyone that may proceed with this, I recommend to export your current python environment's SSL ciphers and keep them so you can overwrite this change in case your requests package breaks.
Let me know if anyone needs anymore context on this workaround.
EDIT:
After reading up more on this issue, it is unsafe to change the ciphers this way.
Instead, the requests development team recommends to enable certain ciphers for certain sites.
Reference
We can create a requests session with the new ciphers and pass that session to mechanicalsoup.StatefulBrowser
I modified some code in the reference above.
import mechanicalsoup
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.ssl_ import create_urllib3_context
CIPHERS = (
':@SECLEVEL=2:ECDH+AESGCM:ECDH+CHACHA20:ECDH+AES:DHE+AES:!aNULL:!eNULL:!aDSS:!SHA1:!AESCCM'
)
class DESAdapter(HTTPAdapter):
"""
A TransportAdapter that re-enables 3DES support in Requests.
"""
def init_poolmanager(self, *args, **kwargs):
context = create_urllib3_context(ciphers=CIPHERS)
kwargs['ssl_context'] = context
return super(DESAdapter, self).init_poolmanager(*args, **kwargs)
def proxy_manager_for(self, *args, **kwargs):
context = create_urllib3_context(ciphers=CIPHERS)
kwargs['ssl_context'] = context
return super(DESAdapter, self).proxy_manager_for(*args, **kwargs)
s = requests.Session()
s.mount('https://kenpom.com/index.php', DESAdapter())
browser = mechanicalsoup.StatefulBrowser(s)
browser.set_user_agent('Mozilla/5.0')
r = browser.open('https://kenpom.com/index.php')
Result: Response [200]
Absolutely brilliant analysis @nickostendorf! Really appreciate you providing this info.
At first glance I am ok with the solution as proposed but would like to take some time to fully understand the (primarily security-related) implications (if any) of the fix for my own knowledge. I also want to make sure we logically gate this to apply only for those running Python 3.10.x and below.
Once I've gotten a grasp on it I plan to branch master
and test in my own Azure environment. If you have time of your own in the next day or so and you're so inclined, feel free to take a stab at a PR for this.
I'm happy to report that the proposed fix seems to alleviate the Cloudflare issue in all the environments in which I've been able to re-create the Cloudflare filtering issue! I would encourage everyone who is so inclined to pull a new copy of the repo locally and run the test suite to confirm that this is working for them.
Thanks again to you @nickostendorf - will be lodging a PR against master
in a few minutes to have @j-andrews7 merge and push a new release to PyPi.
@esqew 100% coverage on the test suite!
Great work and thank you for implementing.
Thanks for the work all. I'll push a new package release to PyPi in the next day or two.
New release is out. Merry Chrysler.