awslabs/amazon-kinesis-producer

ExpiredTokenException

boivie-at-sony opened this issue · 12 comments

I have setup the KPL to post to Kinesis using an assumed role by using STSAssumeRoleSessionCredentialsProvider as credentials provider.

Now, this worked well for some time, but I ended up after a few hours with these errors:

17:28:54.774 [kpl-callback-pool-0-thread-0] WARN  c.s.bui.kinesis.KinesisOutput - Record failed to put, partitionKey=42, attempts:
Delay after prev attempt: 1508 ms, Duration: 4 ms, Code: 400, Message: {"__type":"ExpiredTokenException","message":"The security token included in the request is expired"}
17:28:54.774 [kpl-callback-pool-0-thread-0] ERROR c.s.bui.kinesis.KinesisOutput - Exception while posting to kinesis
com.amazonaws.services.kinesis.producer.UserRecordFailedException: null
    at com.amazonaws.services.kinesis.producer.KinesisProducer$MessageHandler.onPutRecordResult(KinesisProducer.java:188) [amazon-kinesis-producer-0.10.2.jar:na]
    at com.amazonaws.services.kinesis.producer.KinesisProducer$MessageHandler.access$000(KinesisProducer.java:127) [amazon-kinesis-producer-0.10.2.jar:na]
    at com.amazonaws.services.kinesis.producer.KinesisProducer$MessageHandler$1.run(KinesisProducer.java:134) [amazon-kinesis-producer-0.10.2.jar:na]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_72-internal]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_72-internal]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_72-internal]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_72-internal]
    at java.lang.Thread.run(Thread.java:745) [na:1.8.0_72-internal]

Does the Java code renew credentials often enough and hand them over to the native daemon?

Hmm... I think this can be attributed to clock drift. Closing it.

Now it has happened again on a production system, and it was definitely not clock drift:

[2016-06-02 03:48:40.434936] [0x00007fba4c9d9700] [error] [retrier.cc:59] PutRecords failed: {"__type":"ExpiredTokenException","message":"The security token included in the request is expired"}
03:48:40.439 [kpl-callback-pool-0-thread-0] WARN  c.s.bui.kinesis.KinesisOutput - Record failed to put, partitionKey=42, attempts:
Delay after prev attempt: 1501 ms, Duration: 8 ms, Code: 400, Message: {"__type":"ExpiredTokenException","message":"The security token included in the request is expired"}
03:48:40.439 [kpl-callback-pool-0-thread-0] WARN  c.s.bui.kinesis.KinesisOutput - Record failed to put, partitionKey=music-prod, attempts:
Delay after prev attempt: 1501 ms, Duration: 8 ms, Code: 400, Message: {"__type":"ExpiredTokenException","message":"The security token included in the request is expired"}

It doesn't recover even after a few hours.

As said earlier, we're using the STSAssumeRoleSessionCredentialsProvider to get temporary and time limited credentials, and I know IAM is very picky about refreshing them before they expire.

Is this really supported well enough?

bump. happens in 0.12.1 with DefaultCredentialsProvider

I also see this in 0.12.1 with the DefaultCredentialsProvider, while including the STS libs in the classpath.

Thanks for reporting this. It appears that when the credentials are being refreshed those aren't making into the client.

@jeremysears do you mean the DefaultAWSCredentialsProviderChain?

My suspicion is that the code that handles sending the credentials to the native component isn't seeing the credential change, allowing the credentials to expire. I still need to investigate this some more.

Can everyone who is seeing this tell me what type of credentials are you using:

  • Instance Profile
  • STS Credentials
  • ECS Container Credentials
  • Other Credentials

Yes, the DefaultAWSCredentialsProviderChain. I'm not sure if this is related, but I included the STS libs so that I could run some utilities locally w/ the role we use on our servers. However, I see the error on our servers, where we're using EC2 Container Credentials (no STS, but the libs are now available). W/o the STS lib available in the classpath we don't see this issue on our servers, w/ no other code changes.

I think I have an idea of the cause. Both the ECS Credentials, and the STS Credentials have the possibility of throwing an exception if something goes wrong on their remote side. If my suspicion is correct the credential update thread is getting killed when an unhandled exception occurs.

From the reports it sounds like this doesn't happen all the time, but does happen occasionally. I'll see about adding some checks around around the credential retrieval, and providing some type of auto restart should the thread be killed.

I also see this when the STS libs are not included in the classpath.

@jeremysears If my guess is correct it's not related to the STS credentials, but to any credentials that can throw an exception on retrieval. Regardless of whether this is the cause, strengthening the credential refresh thread would be beneficial.

Could those affected please comment, or add a reaction to assist us in prioritizing the change.

Thanks

We just experienced this in Production yesterday, in 2 of 6 servers in our data center using KPL 0.12.3 with STSAssumeRoleSessionCredentialsProvider. They both started at about the same time. Unfortunately we hadn't noticed until 16 hours after it started :-( and by then the KPL had ballooned from 15M to 3.5G. +1 for addressing this issue!

Facing a similar issue with InstanceProfileCredentials

Also experiencing with InstanceProfileCredentials