Possible back end issues with new unified login?

Question

Possible back end issues with new unified login?

Opened this issue 2 months ago · 4 comments

Brief description of problem/feature

I know that @yeisenberg was trying to look at the Mendota admin page earlier today. He said that he got it to load once, surprisingly! But that it failed later. It has always failed for me since unified login, timing out at 60 seconds before it finishes loading.

But then I noticed that the Mendota server restarted itself this morning as well. My guess it was while Yochai or I were trying to access the admin page. Here are some of the errors that I see in the logs (pasted only the part of the stack trace that includes references to our code):

2024-11-04 11:18:24,426 - [ERROR] - from application in play-akka.actor.default-dispatcher-2480
Internal server error, for (HEAD) [/signIn] ->

play.api.Application$$anon$1: Execution exception[[SQLException: Timed out waiting for a free available connection.]]
...
        at scala.slick.jdbc.PlayDatabase.withTransaction(PlayDatabase.scala:6) ~[com.typesafe.play.play-slick_2.10-0.8.1.jar:0.8.1]
        at models.daos.slick.DBTableDefinitions$UserTable$.find(DBTableDefinitions.scala:66) ~[sidewalk-webpage.sidewalk-webpage-8.0.0.jar:8.0.0]
        at controllers.UserController$$anonfun$2.apply(UserController.scala:117) ~[sidewalk-webpage.sidewalk-webpage-8.0.0.jar:8.0.0]
        at controllers.UserController$$anonfun$2.apply(UserController.scala:117) ~[sidewalk-webpage.sidewalk-webpage-8.0.0.jar:8.0.0]
        at scala.Option.getOrElse(Option.scala:120) ~[org.scala-lang.scala-library-2.10.7.jar:na]
        at controllers.UserController.logPageVisit(UserController.scala:117) ~[sidewalk-webpage.sidewalk-webpage-8.0.0.jar:8.0.0]
        at controllers.UserController$$anonfun$signIn$1.apply(UserController.scala:35) ~[sidewalk-webpage.sidewalk-webpage-8.0.0.jar:8.0.0]
        at controllers.UserController$$anonfun$signIn$1.apply(UserController.scala:33) ~[sidewalk-webpage.sidewalk-webpage-8.0.0.jar:8.0.0]
...

And then this:

2024-11-04 11:23:43,151 - [ERROR] - from play.nettyException in New I/O worker #46
Exception caught in Netty
java.lang.OutOfMemoryError: GC overhead limit exceeded
...
        at scala.slick.jdbc.PlayDatabase.withTransaction(PlayDatabase.scala:6) ~[com.typesafe.play.play-slick_2.10-0.8.1.jar:0.8.1]
        at models.daos.slick.DBTableDefinitions$UserTable$.find(DBTableDefinitions.scala:66) ~[sidewalk-webpage.sidewalk-webpage-8.0.0.jar:8.0.0]
        at controllers.UserController$$anonfun$2.apply(UserController.scala:117) ~[sidewalk-webpage.sidewalk-webpage-8.0.0.jar:8.0.0]
        at controllers.UserController$$anonfun$2.apply(UserController.scala:117) ~[sidewalk-webpage.sidewalk-webpage-8.0.0.jar:8.0.0]
        at scala.Option.getOrElse(Option.scala:120) ~[org.scala-lang.scala-library-2.10.7.jar:na]
        at controllers.UserController.logPageVisit(UserController.scala:117) ~[sidewalk-webpage.sidewalk-webpage-8.0.0.jar:8.0.0]
...

Potential solution(s)

Since the authentication info is now centralized, I wonder if we're having problems with concurrent reads/writes. One thing we might want to do is to check for any places where we're using withTransaction instead of withSession where we don't need to be in the authentication code. I assume that this essentially takes out a lock on a table in order to make updates to it. But there's rarely a case where we need to use withTransaction if we're just doing a read. And I specifically see scala.slick.jdbc.PlayDatabase.withTransaction mentioned in the code above.

I could also spend some time looking into what various database connections are doing during sign in and when loading the admin page. More investigation needed!

Answer 1 · 2024-11-05T04:07:32.000Z

Is this only affecting the admin page loading or will normal users be affected too? Sent from phone

…

On Mon, Nov 4, 2024 at 6:45 PM Michael Saugstad ***@***.***> wrote: Brief description of problem/feature I know that @yeisenberg <https://github.com/yeisenberg> was trying to look at the Mendota admin page earlier today. He said that he got it to load once, surprisingly! But that it failed later. It has always failed for me since unified login, timing out at 60 seconds before it finishes loading. But then I noticed that the Mendota server restarted itself this morning as well. My guess it was while Yochai or I were trying to access the admin page. Here are some of the errors that I see in the logs (pasted only the part of the stack trace that includes references to our code): 2024-11-04 11:18:24,426 - [ERROR] - from application in play-akka.actor.default-dispatcher-2480 Internal server error, for (HEAD) [/signIn] -> play.api.Application$$anon$1: Execution exception[[SQLException: Timed out waiting for a free available connection.]] ... at scala.slick.jdbc.PlayDatabase.withTransaction(PlayDatabase.scala:6) ~[com.typesafe.play.play-slick_2.10-0.8.1.jar:0.8.1] at models.daos.slick.DBTableDefinitions$UserTable$.find(DBTableDefinitions.scala:66) ~[sidewalk-webpage.sidewalk-webpage-8.0.0.jar:8.0.0] at controllers.UserController$$anonfun$2.apply(UserController.scala:117) ~[sidewalk-webpage.sidewalk-webpage-8.0.0.jar:8.0.0] at controllers.UserController$$anonfun$2.apply(UserController.scala:117) ~[sidewalk-webpage.sidewalk-webpage-8.0.0.jar:8.0.0] at scala.Option.getOrElse(Option.scala:120) ~[org.scala-lang.scala-library-2.10.7.jar:na] at controllers.UserController.logPageVisit(UserController.scala:117) ~[sidewalk-webpage.sidewalk-webpage-8.0.0.jar:8.0.0] at controllers.UserController$$anonfun$signIn$1.apply(UserController.scala:35) ~[sidewalk-webpage.sidewalk-webpage-8.0.0.jar:8.0.0] at controllers.UserController$$anonfun$signIn$1.apply(UserController.scala:33) ~[sidewalk-webpage.sidewalk-webpage-8.0.0.jar:8.0.0] ... And then this: 2024-11-04 11:23:43,151 - [ERROR] - from play.nettyException in New I/O worker #46 Exception caught in Netty java.lang.OutOfMemoryError: GC overhead limit exceeded ... at scala.slick.jdbc.PlayDatabase.withTransaction(PlayDatabase.scala:6) ~[com.typesafe.play.play-slick_2.10-0.8.1.jar:0.8.1] at models.daos.slick.DBTableDefinitions$UserTable$.find(DBTableDefinitions.scala:66) ~[sidewalk-webpage.sidewalk-webpage-8.0.0.jar:8.0.0] at controllers.UserController$$anonfun$2.apply(UserController.scala:117) ~[sidewalk-webpage.sidewalk-webpage-8.0.0.jar:8.0.0] at controllers.UserController$$anonfun$2.apply(UserController.scala:117) ~[sidewalk-webpage.sidewalk-webpage-8.0.0.jar:8.0.0] at scala.Option.getOrElse(Option.scala:120) ~[org.scala-lang.scala-library-2.10.7.jar:na] at controllers.UserController.logPageVisit(UserController.scala:117) ~[sidewalk-webpage.sidewalk-webpage-8.0.0.jar:8.0.0] ... Potential solution(s) Since the authentication info is now centralized, I wonder if we're having problems with concurrent reads/writes. One thing we might want to do is to check for any places where we're using withTransaction instead of withSession where we don't *need* to be in the authentication code. I assume that this essentially takes out a lock on a table in order to make updates to it. But there's rarely a case where we need to use withTransaction if we're just doing a read. And I specifically see scala.slick.jdbc.PlayDatabase.withTransaction mentioned in the code above. I could also spend some time looking into what various database connections are doing during sign in and when loading the admin page. More investigation needed! — Reply to this email directly, view it on GitHub <#3726>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAML55JRM5ZDDXLDEGVV2RLZ7AWLPAVCNFSM6AAAAABRFRP4VCVHI2DSMVQWIX3LMV43ASLTON2WKOZSGYZTIMRXHAZDMNI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Answer 2 · 2024-11-05T04:24:57.000Z

I really don't have much information, and I don't even know for sure that it's related to the admin page. I just know that Yochai and I tried to load the admin page on a server within a couple hours of when that server restarted (I only know it restarted because I got an email notification). I don't know if there was any specific negative effect on anyone, nor do I know what triggered it right now!

Just adding what little information I have for now, and as I get reports of problems over the next couple weeks, I'll continue to document here!

Answer 3 · 2024-11-05T16:50:04.000Z

Gotcha. Thanks Mikey.

Answer 4 · 2024-11-18T19:47:53.000Z

Started to make some attempts at fixes in #3741 and #3737

I'm seeing a lot of errors where the server can't find a free available connection, while our connection pool only has ~100 open connections (with our max set to 200 I believe). One thought I've had is to try to increase the max number of connections per city. We've messed with the min number in the past (#3316) so that we don't have cities where nothing is happening hogging idle connections. But maybe we should raise the cap on connections for cities when a lot of activity is happening? It's possible that we're having issues when trying to run clustering while trying to load the Admin page at the same time, for example. Documentation below, if we continue to run into problems then I'll look through all of these settings and try out some tweaks to see if we can make any headway.
https://www.playframework.com/documentation/2.3.x/SettingsJDBC#Configuring-the-JDBC-pool