tools: traffic generation
nmoutschen opened this issue · 5 comments
This would add the capability to generate traffic to various services to simulate live traffic in the staging environment.
This would require supporting the entire order workflow to ensure that micro-services are in a consistent state with the requests (otherwise, this would generate errors).
Ideas
Here are two ways to solve this problem:
- Run tests-e2e at regular intervals, or multiple tests in parallel in an EC2 instance or CodeBuild.
- Run end-to-end tests with Step Functions.
The first solution has the benefit of using existing scripts, but is likely to be more costly or not support long delays between operations. The latter solution is more flexible, would allow to wait for random periods of time between step, but would require to create these flows manually.
I use to do a SAR app for traffic gen, almost like a SAR per functionality. Would that work?
- SAR application has a lambda, URL/Endpoint, payload, and rates
- Scale and growth per endpoint
- Use StepFunctions if you want a coordinated effort.
I think this will likely depend on what you're trying to achieve.
If you're wanting to identify performance (p90,p99,p100) of various customer paths then something like Gatling can model these quite well. It has the added advantage of being able to be run locally and produce a nice pretty set of graphs.
If you're wanting to do more involved like capacity testing, pre loading data, failure testing then it might be better to produce e2e testing code.
@dgomesbr @msailes One thing that would be really nice here is the ability to simulate traffic realistically, especially on the overall flow. Calling a single API would be limiting because it would either be limited to get requests or would generate unhandled items (e.g. if you do createOrder on the GraphQL API, this will generate a packaging request in the warehouse service). Even then, completing packaging requests would require a valid order created in the first place, so having something that goes through the complete flow is useful.
There is already one complete flow defined as an end-to-end test. Adding unhappy paths (e.g. when a package cannot be created, when it's created but not with all items, etc.) would ensure that we hit all aspects of the workflow. My thoughts were that those could then be used in the staging environment as part of the CI/CD pipeline: deploy a new version of a service, these tests would be running either as a one-shot or continuously, then if all is good deploy to production.
Another thing that'd be nice is having the ability to generate constant traffic for the dashboards, making sure that they're working as expected, etc. where this would help as well.
Now, when talking about generating constant traffic or running tests continuously, I'm torn between having a fixed resource (e.g. EC2 or running the e2e tests from CodeBuild), which would be the easiest thing to do, or create the end-to-end flows in Step Functions and have a scheduled event to trigger new executions at specific time interval. The good thing with Step Functions is that we could generate artificial delays between the different steps to simulate human behaviors (someone in the warehouse will not create a package immediately after a customer creates an order), but it's more work. However, a more realistic simulation of the flow would allow things like analytics on the staging environment, validating that they work as expected (e.g. tracking how many orders are pending doesn't work if they're fulfilled within less than a minute).
By the way, someone on Twitter mentioned that we could use CloudWatch canaries, that could be an alternative to Step Functions/running CodeBuild periodically.
I think there are two clear use cases here. The first is wanting to exercise as many paths in the workflow as possible to ensure there are no regressions as you release new code. I think this makes sense to be written as code which can be executed against any environment as part of a CI/CD pipeline.
The second is a dev tool, I want to write really good monitoring tooling so I need to generate suitable data. The difference is that in this case you probably don't care as much about testing outcomes and you want more data. You probably wouldn't run this in production without some kind of separation because you wouldn't want to effect real customer traffic.
When I first read this issue I thought of CloudWatch canaries, however I think they're a different use case again. I think they would be especially good for helping to alert for problems with dependencies. For example have info sec changed a firewall. Has a third party made a breaking change in their API.
Indeed, first case would be adding more end-to-end workflow tests as regression test on the business-level flow. I think it makes sense to have another ticket to track this. That said, these types of tests could be run continuously to validate that no drift has happened (as you mentioned, third party changes, security policy changes, etc.). I'm wary of testing that in prod as there are no way to distinguish what is prod traffic vs what is testing there.
However, for this case, it's important that this generates the right outcome. E.g. trying to package an item that doesn't exist will generate errors, which will not help on the monitoring flow. Maybe running the end-to-end tests over and over from a dev machine or allowing people to spin up an EC2 instance that will run them could be good enough.