lazywithclass/winston-cloudwatch

InvalidSequenceTokenException: The given sequenceToken is invalid. The next expected sequenceToken is:

Opened this issue · 15 comments

In our project we have 6 microservices running and we using Seneca for them.
We have a centralized logger implemented using winston.
Recentry we tried to add logs to cloudwatch so we have added Winston cloudwatch.
But wheneverI start my development server, each microservice starts which puts a log to cloudwatch.
So during this, each time I get this InvalidSequenceTokenException for atleast 2 of the microservices start logs.
I am not sure why this is happening.
Any help would be appreciated.

Because of this exception, some logs are not reaching to cloudwatch. We are facing same issue on staging as well as on prod.
Is there a way to retry these missing logs?

Are you using 6 different streams? If I remember correctly (it's been years since I've used this library in production), you should assign to each different source of logs.

There should be a best practice somewhere in AWS Cloudwatch documentation explaining this.

@nish20490 @lazywithclass

Same happens to me, we have just two services (worker and API) on in AWS, each of them has 2 instances.
Initially services used different streams but instances shared it and everything worked fine but after some time I noticed that services stop log anything after ~6-24 since the last deploy. Restart or deploy fixes it for the next ~6-24 hours.

I saw DataAlreadyAcceptedException and InvalidSequenceTokenException errors in logs and decided that the reason is the shared streams, I reconfigured them so now each instance has it's own steram - but unfortunately it didn't help, I continue to see the exactly same behaviour and the same errors in logs

I also think this closed ticket may be related #59.
And the similar issue in another library for logging mirkokiefer/bunyan-cloudwatch#20 (they also provided info why DataAlreadyAcceptedException may happen instead of InvalidSequenceTokenException

UPD:
I've created this PR which supposedly should fix the issue with DataAlreadyAcceptedException at least #158.
I also noticed that in InvalidSequenceTokenException error handling we don't try to use expectedSequenceToken from the error payload. Maybe we should add this in a separate PR, it will allow us to get the correct token without an extra API call.

Did the proposed PR solved your issue? If so I will review and test it in my environment later this week.

@lazywithclass I tested it yesterday and unfortunately it's not :(.
Maybe I did something wrong in the PR but after I deployed this fix I stopped seeing DataAlreadyAcceptedException or InvalidSequenceTokenException errors in the logs but they still suddenly stopped appearing after some time.

Ok I will have to give this a look, but I am studying for September's exam session, so I cannot guarantee a prompt fix.

Now the lib upon InvalidSequenceTokenException will try to get a new sequence token and retry sending the logs. However, if there are many resources racing for the same stream (e.g., 4 hosts running the same app from behind load balancer), it might result in the InvalidSequenceTokenException reappearing, and in this case, the error will be thrown and the logs (from that batch) discarded.

How about introducing an option retryCount (or something) where we could decide how many times the lib should try to get the correct token (for example I'm okay if it will try let's say 30 times). Would that work? If so, I can submit a PR.

Any updates? Have the same issue.

I'm also experiencing this issue, it occurs when we have multiple tests running in parallel which overloads the stream, or it tries to reuse the same sequence token.

Any news on that? I'm getting the same issue here.

Hey guys, has anyone worked on this issue yet? We're getting a lot of InvalidSequenceTokenException because even though our services use separate streams, most of our services have several instances, which eventually hits the error because of the same race condition mentioned by @kubakrzempek.

@sowmenappd In our case, we just migrated to pino with the logs emitted just as stdout and streamed to CW by AWS. I think it's a more reliable and stable approach, but in case you want to stream them by yourself there's pino-cloudwatch transport.

@Lisenish thank you, yes this definitely seems like a far better approach. However, wouldn't the pino-cloudwatch method also suffer from the same issue if there are multiple instances of the same service?

I completely agree with you! Streaming the logs is definitely the way to go, especially when it comes to scaling. It just makes everything so much easier to manage and maintain.

@sowmenappd I didn't use pino-cloudwatch actually so can't tell for sure, I just used default stdout and streamed stdout to CW by AWS without using any libraries. I didn't encounter any issues with this approach (although IIRC it creates a separate stream for each machine, which is actually a correct way too)