mongodb/mongo-php-driver

Sporadic "Authentication failed" errors

719media opened this issue · 8 comments

I know this isn't too helpful of a report, so my feelings won't be hurt if you just close it :)

I noticed I had a connection string of the form
mongodb+srv://username:password@example.mongodb.net/test?retryWrites=true&w=majority
and I assume the "test" db name is from some old tutorial from mongo. I seem to recall thinking that was odd years ago when I first saw it.

Anyway, today, on a dev mongodb 7.0.1 cluster, there were some sporadic issues with the mongo client throwing "Authentication failed" errors periodically. Many connections were working simultaneously, but I'd say one out of every 10-20 connections was throwing this error (using the same connection string).

During troubleshooting, I copied the connection string from the mongodb atlas webapp, and noticed that the difference between the provided string and the one that I had was that the previous connection string I have used for several years had this "test" db name parameter in it:

My old connection string:
mongodb+srv://username:password@example.mongodb.net/test?retryWrites=true&w=majority
Atlas example provided connection string:
mongodb+srv://username:password@example.mongodb.net/?retryWrites=true&w=majority

Once I removed it, all of the connection failures stopped.

Any idea? Seems... strange!

Thanks

The database name that is part of the connection string is the default auth database, which can also be given in the authSource URI option. It is not a "default database" of some sort, as the driver API requires you to fetch a Database instance from the Client. I'm not sure why this started being an issue on 7.0.1, as I'd expect auth to fail on other systems as well if the user you are authenticating with was not created in the test database.

I'd say you did the right thing by using the connection string given to you by the Atlas Webapp, and I wouldn't specify a database name as part of the connection string unless Atlas tells you to do so. I believe Atlas always creates users in the admin database, meaning that you don't need to specify a default auth database (the database name in the connection string) or an authSource URI option. This is valid for SCRAM authentication, but exceptions may apply when different authentication mechanisms (such as X.509) are involved.

I believe Atlas always creates users in the admin database, meaning that you don't need to specify a default auth database (the database name in the connection string) or an authSource URI option.

Note that SRV resolution may also provide an authSource URI option (see: SRV spec), so even if there is no database name or authSource option in the connection string presented in the Atlas UI that doesn't mean that one isn't being used. That said, "admin" is also used by default (per the Auth spec).


It is not a "default database" of some sort

While this is true for PHP (and our own libraries), I'll note that other languages and/or ODMs might use that for some purpose. I have gone to great lengths to ask the Atlas team to remove this from the connection string presented to PHP users, so I'm glad it's no longer displayed.

  • I created DOCSP-6259 in 2019 to request its removal from our docs landing page. That was resolved soon after, but I see the page today no longer shows any connection string (just a '' placeholder).
  • I also created MMS-5936 in 2019 and then PRODTRIAGE-1477 in 2021 to request the same for Atlas UIs; however, it doesn't look like either of those issues moved forward.

Note: all three issues above are internal, so I'm mainly sharing for @alcaeus' benefit.

I copied the connection string from the mongodb atlas webapp

@719media: Can you share where exactly in the UI you copied this from? Was it the "Connect to " screen where you select a driver/language (as an alternative to Compass and other tools) and are presented with either a connection string or code sample?

If so, that's great to hear and I'll plan to follow up on two of the unresolved JIRA issues above to let the Atlas team know my original concerns were addressed.

Yes, it is from "Connect to" screen as you describe.

Thanks for the explanations.

@719media: Thanks for confirming. Two more questions:

  • What auth mechanism are you using with this Atlas cluster (assuming that didn't change between tests)?
  • What PHP driver version are you using?

Anyway, today, on a dev mongodb 7.0.1 cluster, there were some sporadic issues with the mongo client throwing "Authentication failed" errors periodically. Many connections were working simultaneously, but I'd say one out of every 10-20 connections was throwing this error (using the same connection string).

I assume you're using an application server (e.g. FPM) where libmongoc clients would be persisted. Are you able to reproduce this at all using a CLI script that connects, issues a ping command, and terminates, given many independent executions? Alternatively, specifying 'disableClientPersistence' => true in the driver options (third param to Manager or Client constructor) in a web script and hitting it repeatedly with something like ab might be another test.

If this only appears with persisted libmongoc clients, I think we'd have something to continue investigating. But if this can be reproduced via CLI scripts or with disableClientPersistence: true, we'd have to approach this from a different angle (maybe ensuring the driver is successfully parsing authSource from SRV lookups, assuming you're using a DB-based auth mechanism like SCRAM).

Using SCRAM.

OK so I made and endpoint that connects and does a simple ping, and then tested it with the two connection strings:

connection string without /test:
ab -n 100 -c 10 https://example.com/test
100 successes (consistently)

connection string with /test:
ab -n 100 -c 10 https://example.com/test
96 successes, 4 failures (consistently)

connection string with /test:
ab -n 100 -c 2 https://example.com/test
92 successes, 8 failures (consistently)

connection string with /test, and using 'disableClientPersistence' => true:
ab -n 100 -c 2 https://example.com/test
100 successes (consistently)

So, it appears the driver persistence has some sort of issue.

Using the mongodb-library version 1.16.0, driver information below

MongoDB support => enabled
MongoDB extension version => 1.16.2
MongoDB extension stability => stable
libmongoc bundled version => 1.24.3
libmongoc SSL => enabled
libmongoc SSL library => OpenSSL
libmongoc crypto => enabled
libmongoc crypto library => libcrypto
libmongoc crypto system profile => disabled
libmongoc SASL => disabled
libmongoc ICU => disabled
libmongoc compression => enabled
libmongoc compression snappy => disabled
libmongoc compression zlib => enabled
libmongoc compression zstd => disabled
libmongocrypt bundled version => 1.8.1
libmongocrypt crypto => enabled
libmongocrypt crypto library => libcrypto

So I switched to another user in the connection string, and couldn't reproduce the error. I updated the new user's permissions to make them identical to the user causing the problem, and still couldn't reproduce. I switched back to the original user and the problem still persisted.

I then rebooted php-fpm and nginx, and the problem disappeared.

Could've sworn that I rebooted the server several times prior, and the problem persisted, but regardless, I am no longer able to reproduce. I'll keep an eye on it, but it could just be something about some old "persisted" connections being bad and staying the pool? I'm not sure how all that works so can't really comment helpfully there :)

Could be that the problem happened during the 6.0 to 7.0 upgrade, and the php server was never rebooted and something there messed it up? Sorry, I am not able to be more helpful here.

Anyway, if it crops up again I can reopen, but regardless, for now, this issue may be closed as I can't reproduce.

On a whim, I decided to try and change user authentication parameters (specifically password) to see if I could reproduce in some way. This wasn't the behavior that led to the problem in the first place (the credentials there have been the same for many months), but I figured I'd give it a try anyway.

I wasn't able to reproduce the issue, but I did notice that the "old" password connection string continued to work until I rebooted php-fpm. I suppose this makes sense, as the persisted connection is most likely not going through auth again... but it was interesting to note...

Given that disableClientPersistence: true had a 100% success rate and the problem was seemingly resolved by rebooting FPM (and thus clearing any persisted libmongoc clients).

If the URI was sometimes problematic and leading to an auth error, I expect those were requests that hit new FPM workers and thus had to recreate their connections using the problematic URI. And most requests were hitting older workers are re-using persisted clients.

I did notice that the "old" password connection string continued to work until I rebooted php-fpm. I suppose this makes sense, as the persisted connection is most likely not going through auth again...

I wonder if this might be the result of the server keeping existing, authenticated connections open and usable despite the server-side credentials changing. db.changeUserPassword() doesn't talk about any impact (or lack thereof) on existing connections, but I wouldn't be surprised if those are left as-is. Otherwise, changing a user password would conceivably drop all connections on a server (I've never heard of that happening). This question might be more suited for the community forums if you'd like a definitive answer.