aws/aws-secretsmanager-jdbc

Secret not getting refreshed

ankit-joinwal opened this issue · 10 comments

Describe the bug
With default configuration, the secret cache should refresh the secret every 1 hour. However, as per the logs of our application, only DescribeSecret API call is made by the library. GetSecret API calls are not made every hour.
To Reproduce
Steps to reproduce the behavior:

Integrated library in java project

<dependency>
			<groupId>com.amazonaws.secretsmanager</groupId>
			<artifactId>aws-secretsmanager-jdbc</artifactId>
			<version>${aws-secretsmanager-jdbc.version}</version>
		</dependency>
		<dependency>
			<groupId>org.mariadb.jdbc</groupId>
			<artifactId>mariadb-java-client</artifactId>
			<version>2.6.0</version>
		</dependency>
		<dependency>
			<groupId>com.amazonaws.secretsmanager</groupId>
			<artifactId>aws-secretsmanager-caching-java</artifactId>
			<version>1.0.1</version>
		</dependency>

Set application properties to use Secrets for connecting to database

spring:
  datasource:
    url: jdbc-secretsmanager:mariadb:aurora://<rest_of_db_url>
    username: <secret_name_here>
    driver-class-name: com.amazonaws.secretsmanager.sql.AWSSecretsManagerMariaDBDriver
    hikari:
      maximum-pool-size: 1
      max-lifetime: 60000

logging:
  level:
    com:
      amazonaws:
        request: DEBUG

Expected behavior
Cache is refreshed every hour. Can be verified using sample log

Sending Request: POST https://secretsmanager.us-west-2.amazonaws.com / Headers: (amz-sdk-invocation-id: 9086716c-0b6d-e758-b77f-87240be95f54, Content-Length: 147, Content-Type: application/x-amz-json-1.1, User-Agent: aws-sdk-java/1.12.148 Linux/4.14.287-215.504.amzn2.x86_64 OpenJDK_64-Bit_Server_VM/11.0.16.1+9-LTS java/11.0.16.1 kotlin/1.5.20 vendor/Amazon.com_Inc. cfg/retry-mode/legacy exec-env/AWS_ECS_EC2 AwsSecretCache/1.1.0.0, X-Amz-Target: secretsmanager.GetSecretValue, ) 

Actual Behaviour:
Only DescribeSecret call is made every hour approximately.

Verified using below log pattern

Sending Request: POST https://secretsmanager.us-west-2.amazonaws.com / Headers: (amz-sdk-invocation-id: af7b78bb-d81b-3024-250d-845ce0677141, Content-Length: 96, Content-Type: application/x-amz-json-1.1, User-Agent: aws-sdk-java/1.12.148 Linux/4.14.287-215.504.amzn2.x86_64 OpenJDK_64-Bit_Server_VM/11.0.16.1+9-LTS java/11.0.16.1 kotlin/1.5.20 vendor/Amazon.com_Inc. cfg/retry-mode/legacy exec-env/AWS_ECS_EC2 AwsSecretCache/1.1.0.0, X-Amz-Target: secretsmanager.DescribeSecret, ) 

Environment:
Java VM | OpenJDK 64-Bit Server VM
Java VM version | 11.0.16.1+9-LTS
Java version | 11.0.16.1
Java vendor | Amazon.com Inc.
OS | Linux
OS version | 4.14.287-215.504.amzn2.x86_64

aws-secretsmanager-caching-java | 1.0.2
aws-secretsmanager-jdbc | 1.0.8

Additional context
Secret was rotated also in between the 1 hour windows, but still cache was not refreshed on the application side.
Also, this is happening intermittently. Sometimes cache is refreshed after secret rotation and in some cases it does not.

This is expected behavior: 'DescribeSecret' is being called to see if the secret has been modified since the last cache refresh. If the secret hasn't been modified, then the cache is current and no refresh takes place. If the secret has been modified, you should see a following call of 'GetSecretValue' and a cache refresh.

Something else to take into account: if the connection fails due to an authentication error, the JDBC driver will immediately attempt a refresh on the database credentials.

Does that answer your question?

On some instances, the authentication error (after password rotation) is not triggering cache refresh.

Error from log

caused by java.sql.SQLInvalidAuthorizationSpecException: (conn=522718) Access denied for user 'xxxxx'@'xxxxx' (using password: YES)Current charset is UTF-8. If password has been set using other charset, consider using option 'passwordCharacterEncoding'

Even after ~15-20 connection attempts the secret was not refreshed.

This is leaving application in an unhealthy state because connections are not available in the pool and some ECS containers are stopped because of failing health checks.

I'm curious what the error code you're getting with this error. Does it match the error code that forces a refresh for MariaDb? (I believe it is, but I know these exceptions can be wrapped in other exceptions)

Is the secret being rotated while being used? Is that the root cause of this issue? Are you rotating using our standard rotation lambdas?

When using JDBC driver and with rotation, it is recommended to have two active secrets (or users) and rotate between them. See the alternating users rotation strategy. So long as the cache is refreshed at twice the rotation rate, the cache will always contain working credentials.

Error code is not logged but because in most of the cases the secret is refreshed, I would believe it is matching with the expected error code.

We have single user strategy so the secret would be in use while it is rotated. Not sure if this is the root cause. We are rotating using the standard approach of lambdas.

Sounds good. The immediate refresh the JDBC driver does is perfectly valid to use if you don't mind the small delay in establishing the connection after rotation. I don't see any jitter in the retry logic so I think it'll retry 5 times immediately.

The first possible path forward might be to switch to the alternating users strategy where the cache will always have valid credentials.

The second might be to determine the error code that's coming back on the failed connections. We'd need to determine if that's different from the connections that end up succeeding. There's PR #18 for MySQL that loops and keeps unwrapping the exception until it either hits null or the error code we're looking for, which is what made me think it might be similar in MariaDb.

I am interested on the second path you mentioned.

I infact see the exceptions nested.


caused by java.sql.SQLTransientConnectionException: ServiceDB - Connection is not available, request timed out after 5000ms.
…g.jooq.impl.ProviderEnabledConnection.prepareStatement(ProviderEnabledConnection.java:109)
…g.jooq.impl.SettingsEnabledConnection.prepareStatement(SettingsEnabledConnection.java:73)
              org.jooq.impl.AbstractResultQuery.prepare(AbstractResultQuery.java:274)
                    org.jooq.impl.AbstractQuery.execute(AbstractQuery.java:338)
                org.jooq.impl.AbstractResultQuery.fetch(AbstractResultQuery.java:354)
            org.jooq.impl.AbstractResultQuery.fetchInto(AbstractResultQuery.java:1550)
                     org.jooq.impl.SelectImpl.fetchInto(SelectImpl.java:3746)
…impl.LanguageTaggingRepoImpl.getContentTagsByContentId(LanguageTaggingRepoImpl.java:190)
com.xxx.service.repository.impl.LanguageTaggingRepoImpl$$FastClassBySpringCGLIB$$50e1b143.invoke(<generated>)
com.xxx.service.repository.impl.LanguageTaggingRepoImpl$$EnhancerBySpringCGLIB$$5ac83399.getContentTagsByContentId(<generated>)
…ervice.impl.TaggingServiceImpl.getQuestionLanguageTags(TaggingServiceImpl.java:53)
…service.service.impl.TaggingServiceImpl.getContentTags(TaggingServiceImpl.java:35)
…g.service.resolver.TaggingQueryResolver.getContentTags(TaggingQueryResolver.java:23)
jdk.internal.reflect.GeneratedMethodAccessor219.invoke(Unknown Source)
         graphql.execution.ExecutionStrategy.fetchField(ExecutionStrategy.java:279)
…aphql.execution.ExecutionStrategy.resolveFieldWithInfo(ExecutionStrategy.java:210)
       graphql.execution.AsyncExecutionStrategy.execute(AsyncExecutionStrategy.java:60)
           graphql.execution.Execution.executeOperation(Execution.java:159)
                    graphql.execution.Execution.execute(Execution.java:105)
                                graphql.GraphQL.execute(GraphQL.java:613)
                graphql.GraphQL.parseValidateAndExecute(GraphQL.java:538)
                           graphql.GraphQL.executeAsync(GraphQL.java:502)
jdk.internal.reflect.GeneratedMethodAccessor218.invoke(Unknown Source)
caused by java.sql.SQLNonTransientConnectionException: Communications link failure with primary. No active connection found for master. 
….failover.AbstractMastersListener.throwFailoverMessage(AbstractMastersListener.java:559)
…er.impl.MastersReplicasListener.checkInitialConnection(MastersReplicasListener.java:350)
…over.impl.MastersReplicasListener.initializeConnection(MastersReplicasListener.java:179)
…rg.mariadb.jdbc.internal.failover.FailoverProxy.<init>(FailoverProxy.java:120)
     org.mariadb.jdbc.internal.util.Utils.retrieveProxy(Utils.java:608)
       org.mariadb.jdbc.MariaDbConnection.newConnection(MariaDbConnection.java:150)
                        org.mariadb.jdbc.Driver.connect(Driver.java:89)
caused by java.sql.SQLNonTransientConnectionException: Could not connect to HostAddress{host='xxxxxx.us-west-2.rds.amazonaws.com', port=3306}. (conn=538090) Access denied for user 'xxxxxx'@'xxxxxx' (using password: YES)Current charset is UTF-8. If password has been set using other charset, consider using option 'passwordCharacterEncoding'
…ernal.util.exceptions.ExceptionFactory.createException(ExceptionFactory.java:73)
….jdbc.internal.util.exceptions.ExceptionFactory.create(ExceptionFactory.java:185)
…jdbc.internal.protocol.AbstractConnectProtocol.connect(AbstractConnectProtocol.java:500)
org.mariadb.jdbc.internal.protocol.AuroraProtocol.loop(AuroraProtocol.java:170)
…failover.impl.AuroraListener.reconnectFailedConnection(AuroraListener.java:213)
…over.impl.MastersReplicasListener.initializeConnection(MastersReplicasListener.java:176)
…rg.mariadb.jdbc.internal.failover.FailoverProxy.<init>(FailoverProxy.java:120)
     org.mariadb.jdbc.internal.util.Utils.retrieveProxy(Utils.java:608)
       org.mariadb.jdbc.MariaDbConnection.newConnection(MariaDbConnection.java:150)
                        org.mariadb.jdbc.Driver.connect(Driver.java:89)
caused by java.sql.SQLInvalidAuthorizationSpecException: (conn=538090) Access denied for user 'xxxx'@'xxxx' (using password: YES)Current charset is UTF-8. If password has been set using other charset, consider using option 'passwordCharacterEncoding'
…ernal.util.exceptions.ExceptionFactory.createException(ExceptionFactory.java:66)
….jdbc.internal.util.exceptions.ExceptionFactory.create(ExceptionFactory.java:189)
…protocol.AbstractConnectProtocol.authenticationHandler(AbstractConnectProtocol.java:769)
…rnal.protocol.AbstractConnectProtocol.createConnection(AbstractConnectProtocol.java:555)
…jdbc.internal.protocol.AbstractConnectProtocol.connect(AbstractConnectProtocol.java:498)
org.mariadb.jdbc.internal.protocol.AuroraProtocol.loop(AuroraProtocol.java:170)
…failover.impl.AuroraListener.reconnectFailedConnection(AuroraListener.java:213)
…over.impl.MastersReplicasListener.initializeConnection(MastersReplicasListener.java:176)
…rg.mariadb.jdbc.internal.failover.FailoverProxy.<init>(FailoverProxy.java:120)
     org.mariadb.jdbc.internal.util.Utils.retrieveProxy(Utils.java:608)
       org.mariadb.jdbc.MariaDbConnection.newConnection(MariaDbConnection.java:150)
                        org.mariadb.jdbc.Driver.connect(Driver.java:89)

The Access denied for user 'xxxx'@'xxxx' error has VendorCode= 1045 which is what the library is looking for.

Ok, I understand the issue now. Unwrapping the exception is not something the driver currently does, but we're reviewing this case and cases like it. We do accept PRs too if you'd like. In the meantime, switching to multi-user secret rotation is likely the most expedient fix.

@danmancuso Could you please help with review #114

in this PR, I have provided fix only for MariaDB driver. But i think this solution can be applied to other drivers too. If you are ok with that, i can update PR for other drivers too.

Resolved with #136