j-andrews7/kenpompy

Login failure due to Cloudflare intercepting requests

Closed this issue · 56 comments

Hello,

I am getting the following error when trying to use the login function: LinkNotFoundError()

I am using the following code:

browser = login(email, password)

linknotfounderror

This used to work fine last season

Appreciate the help!

Here is the response im getting. Seems it is being blocked by Cloudflare...

Sorry, you have been blocked

You are unable to access kenpom.com

          <span class="cf-no-screenshot error"></span>
        
      </div>
    </div>
  </div><!-- /.captcha-container -->

  <div class="cf-section cf-wrapper">
    <div class="cf-columns two">
      <div class="cf-column">
        <h2 data-translate="blocked_why_headline">Why have I been blocked?</h2>

        <p data-translate="blocked_why_detail">This website is using a security service to protect itself from online attacks. The action you just performed triggered the security solution. There are several actions that could trigger this block including submitting a certain word or phrase, a SQL command or malformed data.</p>
      </div>

      <div class="cf-column">
        <h2 data-translate="blocked_resolve_headline">What can I do to resolve this?</h2>

        <p data-translate="blocked_resolve_detail">You can email the site owner to let them know you were blocked. Please include what you were doing when this page came up and the Cloudflare Ray ID found at the bottom of this page.</p>
      </div>
    </div>
  </div><!-- /.section -->
esqew commented

Thanks for the report! This was previously discovered in #24. A patch was subsequently issued and merged into master as part of #25, however the release on PyPi hasn't caught up to what we've got here.

If you need this functionality in the immediate term, you could install from the GitHub source but I imagine @j-andrews7 will push a new release with all the bug fixes we've got in the hopper before the season starts.

Thanks guys!!

Any update on this issue?

Thanks boss! Love what you've done here

This should be resolved in the latest release. Or at least the tests pass. If everything's still broken, blame @esqew.

Do not actually do that - he's the only reason this ever got fixed. Thanks Sean!

pip install --upgrade kenpompy should get you the latest version with this fix and a few others.

Screenshot 2022-11-07 at 9 22 47 PM

Still getting a Cloudflare error?

Working fine for me.

What's pip freeze | grep kenpom give ya?

kenpompy==0.3.3

Hm. @esqew, have you noticed any inconsistencies with this? Working fine with python 3.8-3.10.

@trludt can you try a few more times and see if you can get through? We are somewhat at Ken's power here, and if for whatever reason your IP is subject to more stringent measures, there's likely not a ton we can do.

I've tried a bunch still no luck... This is wild. I probably pulled data once per day of the season last year. Could that have restricted my IP? And does a VPN change anything?

I asked Ken about this before I published it, and he basically said as long as people didn't abuse it, he was totally cool with it. I can't imagine once a day is a problem.

Not sure about the VPN, though it could make sense if it's changing your IP geolocation to an area identified as more likely to be an attack.

I just shot him an email. Hopefully he gets back to me. I've tried the browser = (email, pass) line at least 50 times today and still errors every time.

FYI I'm using Spyder v5.1.5 and Python 3.9.12 in my script if that helps towards anything

esqew commented

I unfortunately cannot reproduce this on my end, and the CI/CD tests checked out fine last night.

@trludt Where does your VPN reside geographically? Cloudflare will more closely scrutinize connections from IPs in higher-risk areas. Separately, can you provide a stripped down example that demonstrates this? Is it possible you're also trying to log in a bunch of times or scrape a ton of data in a short period?

Screenshot 2022-11-08 at 8 05 21 AM

Screenshot 2022-11-08 at 8 10 10 AM

I am always running into the Cloudflare error on that login step. I'm not always using the VPN (ExpressVPN) but when I am using it, my connection runs through primarily big US cities (Atlanta, Dallas, Phoenix)

esqew commented

Thanks for the context. Without being able to reproduce this issue, it's likely this is something relating to how Cloudflare is handling your specific connection or sessions.

It's possible Cloudflare's heuristics have detected either/both that (among other possibilities) (a) your account has logged in from multiple geographies in too short of a time span ("impossible travel" anomaly detection), or (b) your script, at one point or another, launched too many requests and Cloudflare is now taking it upon itself to block further requests for an indefinite amount of time.

If you're able, can you set your debugger to break on this exception and dump the contents of str(browser.page.contents) here to ensure that this exception is firing correctly at the very least?

Would that be in my script or altering the utils.py file under the Kenpompy code? @esqew

esqew commented

Neither, but there may be an easier way now that I think about it, this simplified script may be easier to get the contents of the page that Cloudflare is throwing to see if there's anything specific about your case we can pick out from it:

from kenpompy.utils import login
try:
    browser = login('<email>', '<password'>)
except Exception as e:
    print(str(browser.page.contents))
    raise e

Ultimately what I would really like to understand is what HTTP status code Cloudflare is throwing to see if that might be more illuminating in this situation, but the way we've got MechanicalSoup set up doesn't currently capture the right details to enable this. Hopefully the HTML Cloudflare is sending back will indicate something helpful to preclude this next level of debugging.

`['html', '[if lt IE 7]> <![endif]', '[if IE 7]> <![endif]', '[if IE 8]> <![endif]', '[if gt IE 8]><!',

<title>Attention Required! | Cloudflare</title> <style>body{margin:0;padding:0}</style> <script> if (!navigator.cookieEnabled) { window.addEventListener('DOMContentLoaded', function () { var cookieEl = document.getElementById('cookie-alert'); cookieEl.style.display = 'block'; }) } </script>

Sorry, you have been blocked

You are unable to access kenpom.com

Why have I been blocked?

This website is using a security service to protect itself from online attacks. The action you just performed triggered the security solution. There are several actions that could trigger this block including submitting a certain word or phrase, a SQL command or malformed data.

What can I do to resolve this?

You can email the site owner to let them know you were blocked. Please include what you were doing when this page came up and the Cloudflare Ray ID found at the bottom of this page.

Cloudflare Ray ID: 766f3588cdb85485 Your IP: Click to reveal 185.92.26.78 Performance & security by Cloudflare

<script>(function(){function d(){var b=a.getElementById("cf-footer-item-ip"),c=a.getElementById("cf-footer-ip-reveal");b&&"classList"in b&&(b.classList.remove("hidden"),c.addEventListener("click",function(){c.classList.add("hidden");a.getElementById("cf-footer-ip").classList.remove("hidden")}))}var a=document;document.addEventListener&&a.addEventListener("DOMContentLoaded",d)})();</script>
<script> window._cf_translation = {}; </script> , '\n']`

Sorry that was when I was connected to VPN

['html', '[if lt IE 7]> <![endif]', '[if IE 7]> <![endif]', '[if IE 8]> <![endif]', '[if gt IE 8]><!',

<title>Attention Required! | Cloudflare</title> <style>body{margin:0;padding:0}</style> <script> if (!navigator.cookieEnabled) { window.addEventListener('DOMContentLoaded', function () { var cookieEl = document.getElementById('cookie-alert'); cookieEl.style.display = 'block'; }) } </script>

Sorry, you have been blocked

You are unable to access kenpom.com

Why have I been blocked?

This website is using a security service to protect itself from online attacks. The action you just performed triggered the security solution. There are several actions that could trigger this block including submitting a certain word or phrase, a SQL command or malformed data.

What can I do to resolve this?

You can email the site owner to let them know you were blocked. Please include what you were doing when this page came up and the Cloudflare Ray ID found at the bottom of this page.

Cloudflare Ray ID: 766f4301ccfdf7e4 Your IP: Click to reveal 136.29.142.190 Performance & security by Cloudflare

<script>(function(){function d(){var b=a.getElementById("cf-footer-item-ip"),c=a.getElementById("cf-footer-ip-reveal");b&&"classList"in b&&(b.classList.remove("hidden"),c.addEventListener("click",function(){c.classList.add("hidden");a.getElementById("cf-footer-ip").classList.remove("hidden")}))}var a=document;document.addEventListener&&a.addEventListener("DOMContentLoaded",d)})();</script>
<script> window._cf_translation = {}; </script> , '\n']
esqew commented

So it's definitely a Cloudflare interstitial (so the exception itself is working as intended). Unfortunately without a reliable way to reproduce this behavior outside your specific environment there's not much else we can advise you to do. Cloudflare deliberately doesn't share a ton of details to make it more difficult to circumvent their protections. Unless you're able to get in touch with KenPom himself (or someone who handles his Cloudflare account) and can look at the logs specific to the Cloudflare Ray ID values from the markup you posted, we're just shooting in the dark.

If you have the resources available, you may consider attempting to try this same code from another machine and/or using another KenPom account to see if it's reproducible across environments/accounts. The only other course of action I could recommend at this point would be to drop this altogether for at least a couple days to let any rate limiting that may be at play get fully reset.

I tried a different KenPom account and that didn't work. Also tried from my PC instead of my Macbook, no luck. I shot Ken an email with the Cloudflare Ray ID and my IP... hopefully I can get unbanned 🙏

This tool rocks and would hate to lose access to it. Truly had no intention of abusing it or overloading his site.

I appreciate @esqew @j-andrews7 for the assistance, much appreciated!

@trludt Please let me know if he reaches out to you on this issue. I am having similar results being blocked by Cloudflare. I originally had my own scraping pipeline developed through beautifulsoup that I used last year (about daily similar to you). Recently I ran my script to see if it still worked for this season and I received the 403 forbidden code. I then tried this python library as an alternative and received the same block. In my personal pipeline I also tried modifying User-Agent status and several other request headers and nothing resolved the issue.

@trludt and @AVA-27 I have also received the same LinkNotFoundError (now is replaced by Exception: Opening kenpom.com failed - request was intercepted by Cloudflare protection) after forcing kenpompy update via pip)

I have also reached out to Ken via email and will let you know if I hear anything.

For context, I use a VPN (NordVPN) out of Atlanta (physically located in TN). Last night at about 2:00 AM CST I received the first error (on VPN), and tried again disconnected (still received LinkNotFound error).

Today, after updating, I explicitly tried to re-run while being disconnected from VPN to see if there was a difference, but I'm afraid I might already be blacklisted b/c of the re-geotagging of my first request via VPN.

@RobMepham I never utilized a VPN to access the website and still cannot gain access so I am not sure if that is playing a role or not.
Edit: I also reached out to Ken via email; if I receive a response I will update as well.

Good to know I'm not alone! Currently working on migrating my script over to a lambda function on AWS, which was ultimately my plan all along. Maybe this is a good thing my hand is being forced 🤣

Can confirm I am experiencing this issue as well.

esqew commented

So it seems Cloudflare interception may be more prevalent than I had initially anticipated. At this point, we know (or can reasonably infer) that the following factors play at least something of a role in determining whether or not to intercept requests, likely among many, many others:

  • Requesting against KenPom.com without including a User-Agent header
  • Using a "browser" without JavaScript support or otherwise disabled (curl kenpom.com will, more often than not, result in the same Cloudflare interstitial)
  • Using while connected to a public VPN provider
  • Connecting from an IP block associated with a "high risk" geolocation

The latter two are a bit out of our scope of control in this instance, but I will be carving time in the coming days to tee up some experimentation where we can try to reconfigure how the requests are sent for those affected by this. Watch this space!

This has been incredibly frustrating. I've been having my friend run this on his computer. Sometimes it works, sometimes it doesn't. It hasn't been working in an AWS lambda function either, triggering the Cloudflare exception.

Considering we all already pay for the access to the data, I wish we could at least get a response back from him why he's doing this

It's a busy time of year for him, I expect he's travelling a lot and such. He has responded to me in the past via e-mail, so I'd just advise some patience. It is an off-label use and all.

Like @esqew said, there are a few kind of hack-y workarounds we can try, but there's no guarantee. Given that neither of us is able to reliably replicate the issue, we may need folks to test, so stay tuned.

esqew commented

I had a few hours this morning to start experimenting specifically with how the User-Agent value actually affects what comes back from the server.

In short, using valid User-Agent values from modern browsers (h/t to @pzb's list) actually triggers the Cloudflare interstitial, and flipping back to the standard Mozilla/7.0 we've got in main right now works as intended (in my environment, anyway). On this same tested-working connection path to the target site, curl actually works as well when given the explicit User-Agent value to use: curl "https://kenpom.com" -A "Mozilla/7.0" As such, I think it's fair to surmise the User-Agent header as it's currently configured in utils.py is unlikely the culprit causing these interstitials to pop.

Knowing this, I would deduce that the blocks that have been reported are moreso driven by the IP ranges from which folks are running this code. Since popular VPN and cloud providers' IP blocks are well known, it is quite trivial for Cloudflare to block traffic originating from these sources. Since I haven't used Cloudflare in many years I had to do some research on how this is typically configured - turns out this blocking is fine-tuned by site owners/administrators. From a thread describing a similar issue on Cloudflare Community:

At Cloudflare you have many options to block. You can filter for ASN, IP, IP ranges, country etc… and many more. But if you are blocked, then the website owners configured it to do so. Feel free to contact the website owner and ask him these questions

What I am curious about is if this protection would also apply to a setup which would present itself as a modern browser with full-fledged support for JavaScript (a la Selenium, pyppeteer) that's running from within one of these IP ranges, but I dread the idea of bringing on something that has such a huge comptuational overhead compared to what we have working right now.

In any event, I'm going to start experimenting down this route running in an Azure/Colab/other cloud environment and report back. If this does seem to alleviate some of this blockage we'll have to see what we can do in terms of marking these more resource-heavy libraries as extras to be used in the event that all else fails.

I would also really like to see if Ken does return anyone's email and what his opinion on the whole thing is, but I'm not holding my breath - he is a pretty busy guy.

I'm curious if he's open to someone helping him build an API that people can pay a little extra for access to the data. I brought that up in my last email to him

Wanted to chime in here for anyone still having issues. We were able to workaround the 403 error by adding this line to our code.
browser.set_user_agent('any-random-thing')

Added that line right below "browser = mechanicalsoup.StatefulBrowser()"

This was following advice from this post https://stackoverflow.com/questions/48506614/403-error-with-mechanicalsoup.

Edit: Looks like there might be a something else involved. Our data analyst was using Juypter labs to run his code and even with the change he was still getting 403'd. Maybe an issue with module versions or how juypter formats the traffic. I installed python fresh with all updated modules and it worked running from the python shell from two computers both with different outgoing IPs

Works in Python 3.11.0
kenpompy-0.3.3
mechanicalsoup-1.2.0
pandas-1.5.1
bs4-0.0.1
beautifulsoup4-4.11.1
requests-2.28.1
lxml-4.9.0 (manually installed because windows)
python-dateutil-2.8.2
certifi-2022.9.24
charset-normalizer-2.1.1
idna-3.4

@jkiddUA kudos to you, I was able to find a workaround. Updating to python 3.11 has fixed the Cloudflare issue, and I can now load data successfully.

I updated to python 3.11, however I tried to reinstall kenpompy, and receive an error "Error: failed to build wheel for lxml". How did you get around this once you upgraded?

I updated to python 3.11, however I tried to reinstall kenpompy, and receive an error "Error: failed to build wheel for lxml". How did you get around this once you upgraded?

So I had to download the lxml wheel (.whl) from here https://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml. Just download the correct one for your machine and place it in your working directory. In my case it was lxml‑4.9.0‑cp311‑cp311‑win_amd64.whl because I'm using a 64bit windows machine. And then from my virtual environment I ran the command, pip install lxml‑4.9.0‑cp311‑cp311‑win_amd64.whl.

Once that's successfully installed you should be able to run the kenpompy install.

esqew commented

Hi all, my testing the use of browser control libraries to overcome the reported Cloudflare issues remains ongoing as my availability permits. To recap what we know for sure:

  • Connections originating from infrastructure with outbound IPs assigned to popular public cloud providers (Azure/AWS/etc.) will be blocked/filtered
  • Connections originating with outbound IPs known to be associated with popular public VPN providers will be blocked/filtered
  • Connections without a "valid" User-Agent HTTP header will be blocked/filtered

Potential mitigations:

  • Update to the latest version of kenpompy that's been published to pypi, if you haven't already:

    python3 -m pip install --upgrade kenpompy

    This will ensure that the latest patch with an appropriate User-Agent header setting is in place.

  • Some users indicate that updating to Python 3.11 may helpref, but it's unclear if this will work for 100% of users experiencing Cloudflare-related issues, or even why it works in the first place

  • Using a browser automation framework instead of the MechanicalSoup stateful browser (currently being developed and tested locally)

  • For those who must run their workload in a public cloud, it may be worth exploring whether an HTTP proxy might help to mitigate this - I'm looking to see if it might be helpful to allow the passing of a preconfigured MechanicalSoup instance to login() to allow for this type of configuration up front UPDATE: this is not likely to work after quick tests with cURL and some publicly-available proxies; KenPom.com returns 400 Bad Request for each. May be worthwhile to continue a bit more in-depth testing, but for now I'm considering this a non-viable option

  • Write to KenPom and beg/plead ask how we might be able to sidestep this protection for scraping purposes

esqew commented

Hi all, I haven't lost sight of this! My local testing continues as my availability allows with a selenium-based retrofit for the MechanicalSoup functionality we use today. I know progress has been excruciatingly slow, and that's simply a biproduct of my full-time role (and my life more generally) being as hectic as ever with the end of the year fast approaching.

Interestingly, using selenium with a "headless" WebDriver is filtered by Cloudflare 100% of the time. This is slightly surprising (at least to me) because primitive/headless-only clients (like cURL), with the proper User-Agent header, are not normally filtered in the same way.

As a result, those who do require headless functionality are likely going to find this solution untenable, at least if/until we find a workaround. I'm aware of projects like selenium-stealth which have in the past been able to overcome this detection of headless webdrivers, but are now largely unmaintained (ostensibly due to the strides these providers have made in detecting them anyway). I'm open to any and all suggestions from the folks here if someone is aware of something more modern that might have a better success rate.

(Somewhat separately but for complete transparency, I had attempted to do this retrofit using a pyppeteer base a few weeks ago, but its asynchronous-first design leads to quite a bit of complexity trying to implement it in a reliable, robust way that also supports everyone who's using it in a Jupyter environment. In the same vein, unless I've missed something huge, I don't think it's got any mitigations for the Cloudflare detections that selenium does not considering it's the same Chrome/Chromium browser at the core. I take that back - seems there is a pyppeteer-stealth package that's actively maintained and may be worth re-exploring this thread as a result.)

The next test I'm taking on is running non-headlessly from within an Azure environment to see if our hypothesis holds true that this should bypass Cloudflare's heuristics. If successful, I'll publish an experimental branch for others to test in the near future. More to come!

I was trying to figure this out yesterday and I can't find anything that would indicate it is an issue specific to MechanicalSoup. In fact, I actually found the problem appeared to originate in the python requests library. There is clearly something that is not "liking" the way the library is sending the requests (Cloudflare most likely has a ruleset in place since majority of scraping is through python). I can't quite put my finger on it.

I tested with a popular ruby library, HTTParty while on my VPN (through NordVPN) and got a 200 and like you mentioned cURL with the same exact headers works.

I am going to continue digging when I get a bit more time.

Hello,

I am new to this package, and I just purchased my kenpom account a few days ago. I am also having the same issue:

Exception: Opening kenpom.com failed - request was intercepted by Cloudflare protection

I am wondering if anyone has had any luck outside of this website to scrape data for NCAA mens while these issues with Cloudflare are being investigated further.

Thanks and looking forward to using this package when this is resolved

esqew commented

@johnfeldhausen Thanks for the report. Can you give us a bit more detail around your environment? Are you running on cloud infrastructure or through a VPN?

@esqew - I did get this to work with the following version (upgraded from 3.10 to 3.11):

Python 3.11.0 (v3.11.0:deaf509e8f, Oct 24 2022, 14:43:23) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin.

I deduce that this may be something with TLS fingerprinting in older versions of the ssl wrapper library. Here is the version of openssl (which is pertinent for the TLS fingerprint issue):

OpenSSL 1.1.1q 5 Jul 2022

esqew commented

@mbrundige Fascinating. I definitely missed an OpenSSL upgrade in the 3.11 RC2 release notes, but sure enough OpenSSL 1.1.1q is mentioned. I don’t doubt TLS fingerprinting plays a major role in Cloudflare heuristics, so this could indeed make a lot of sense if the changes in OpenSSL materially affect how this is carried out.

This would also track with earlier reports that a Ruby-based HTTP library didn't have issues being filtered when MechanicalSoup's requests baseline did

In any event, this would certainly preclude the need for any browser-based workarounds.

@steveroks @trludt @AVA-27 @RobMepham @Harrisoneller If possible, would each of you mind updating to the latest kenpompy and Python 3.11.0 and report back if this resolves your Cloudflare-related issues?

esqew commented

I will assume that if we don't hear back from anyone in the next week or so that the proposed fix (updating to Python 3.11) is working for those still otherwise experiencing this issue, and will close this issue accordingly.

I'll also add a small section to README or a Wiki page to summarize the issue & fix as a place to point future users to.

esqew commented

Closing due to inactivity per my previous comment. We will consider the guidance to update to Python 3.11.x as the best solution. On the flipside, I am open to more reports in the future that might contradict that, in which case we'll re-open this and do further investigation on a case-by-case basis.

Thanks to everyone for their assistance to date!

@esqew I found a workaround for Python < 3.11

@mbrundige was right that it was a TLS fingerprinting issue.
It seems it is a cipher negotiation failure between requests and kenpom.

I noticed that the ciphers were different in 3.11 than in 3.9.
So, I installed Python 3.11 in a virtual env and ran the following to export the ciphers:

python -c "import ssl; print(ssl._DEFAULT_CIPHERS);"

I then went back to my 3.9 environment and appended the 3.11 ciphers:

requests.packages.urllib3.util.ssl_.DEFAULT_CIPHERS += \ ':@SECLEVEL=2:ECDH+AESGCM:ECDH+CHACHA20:ECDH+AES:DHE+AES:!aNULL:!eNULL:!aDSS:!SHA1:!AESCCM'

I don't believe this is the "safest" workaround as it could break some requests, but it works for this use case.

For anyone that may proceed with this, I recommend to export your current python environment's SSL ciphers and keep them so you can overwrite this change in case your requests package breaks.

Let me know if anyone needs anymore context on this workaround.

EDIT:

After reading up more on this issue, it is unsafe to change the ciphers this way.
Instead, the requests development team recommends to enable certain ciphers for certain sites.
Reference

We can create a requests session with the new ciphers and pass that session to mechanicalsoup.StatefulBrowser
I modified some code in the reference above.

import mechanicalsoup
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.ssl_ import create_urllib3_context

CIPHERS = (
    ':@SECLEVEL=2:ECDH+AESGCM:ECDH+CHACHA20:ECDH+AES:DHE+AES:!aNULL:!eNULL:!aDSS:!SHA1:!AESCCM'
)

class DESAdapter(HTTPAdapter):
    """
    A TransportAdapter that re-enables 3DES support in Requests.
    """
    def init_poolmanager(self, *args, **kwargs):
        context = create_urllib3_context(ciphers=CIPHERS)
        kwargs['ssl_context'] = context
        return super(DESAdapter, self).init_poolmanager(*args, **kwargs)

    def proxy_manager_for(self, *args, **kwargs):
        context = create_urllib3_context(ciphers=CIPHERS)
        kwargs['ssl_context'] = context
        return super(DESAdapter, self).proxy_manager_for(*args, **kwargs)

s = requests.Session()
s.mount('https://kenpom.com/index.php', DESAdapter())
browser = mechanicalsoup.StatefulBrowser(s)
browser.set_user_agent('Mozilla/5.0')
r = browser.open('https://kenpom.com/index.php')

Result: Response [200]

esqew commented

Absolutely brilliant analysis @nickostendorf! Really appreciate you providing this info.

At first glance I am ok with the solution as proposed but would like to take some time to fully understand the (primarily security-related) implications (if any) of the fix for my own knowledge. I also want to make sure we logically gate this to apply only for those running Python 3.10.x and below.

Once I've gotten a grasp on it I plan to branch master and test in my own Azure environment. If you have time of your own in the next day or so and you're so inclined, feel free to take a stab at a PR for this.

esqew commented

I'm happy to report that the proposed fix seems to alleviate the Cloudflare issue in all the environments in which I've been able to re-create the Cloudflare filtering issue! I would encourage everyone who is so inclined to pull a new copy of the repo locally and run the test suite to confirm that this is working for them.

Thanks again to you @nickostendorf - will be lodging a PR against master in a few minutes to have @j-andrews7 merge and push a new release to PyPi.

@esqew 100% coverage on the test suite!
Great work and thank you for implementing.

Thanks for the work all. I'll push a new package release to PyPi in the next day or two.

New release is out. Merry Chrysler.