openwallet-foundation/didcomm-mediator-service

Run the ACA-Py mediator in debug mode or tracing to determine what takes 3 seconds to connect

Closed this issue · 11 comments

Please figure how to get more data on what is happening in the ACA-Py mediator such that making a connection takes 3 seconds (at least one my machine) to establish a connection. In theory, it shouldn't take that long... what is happening?

Some more data points, and considerations. Running the ACA-Py Alice-Faber demo with mediation connects "instantly" -- no perceptible delay. Contrast that with 3 second delay on identical hardware/docker as when running a stock Aries Mediator Service.

The Aries Mediator Service has a controller involved that is essentially a no-op. As we look at the implementation we have deployed, please look into the interactions between the ACA-Py instance and the controller. Perhaps there is a delay there?

Based on this, I would say this is the highest priority -- understanding the purely docker local test case. It seems to take nearly as long as OpenShift.

@WadeBarnes -- recommend that you focus on getting 0.8.1-rc2 tested so that we can deploy it when necessary.
FYI -- @cvarjao @esune

To be clear on what was tested on my machine:

  • Ran docker compose to run Aries Mediator Service as configured in this repo, along with docker compose to run the Locust load testing tool
    • Ran various tests, with the most relevant one being a "1" and "1" test (a single run) where it took 2.8 seconds to complete the test.
    • This is comparable to the performance we are seeing on OpenShift.
  • Ran the Alice-Faber demo with mediation on Faber
    • The connection was more or less instantaneous.

My conclusion is that something is different between the stock Alice-Faber and Aries Mediator Service, but very little different between Aries Mediator Service on Docker and on OpenShift. Thus, we can focus on the Aries Mediator Service use case running locally, and trying to understand where the almost 3 seconds is happening.

Additional info -- Wade tells me we are NOT deploying the Mediator Controller on OpenShift, so that takes that out of the picture.

@WadeBarnes -- can you please add a note about where the OpenShift deployments/config parameters are set. One thing we might do is adjust this AMS configuration to better match that.

Mediator Agent configs and settings for OCP can be found here; https://github.com/bcgov/openshift-aries-mediator-service/tree/main/openshift/templates/aries-mediator-agent

You'll note the --webhook-url startup flag is defined, but it's not used. It ends up as --webhook-url ${MEDIATOR_CONTROLLER_WEBHOOK}. I've removed it from the configuration while working on the askar instance, and I've removed it from the existing instance running in the test environment.

@usingtechnology — can you please provide an update here on progress? Thanks!

Will return my focus to this today. Was helping Syro with Mediator + Redis, had a breakthrough yesterday so I think he's good to carry on.

I ran a lot of code locally (elk-stack, run_demo, playground, aries-mediator-service + von-network) and produced some numbers.
Exported CSV from ELK and then ran the files through a little script that would calculate the elapsed time for each thread_id, so the last column in these CSVs is not from ELK.

There are differences between the two (run_demo with mediation and playground with external mediator service), but nothing too far out of wack. The initial connection to mediator using playground was a second longer, but then we see in the run_demos that connection between alice and faber is a similar duration, so could just be down to timing and resources on machine. The most significant time in each thread is generally Dispatcher.handle_message.END but nothing consistent that would indicate a bug. We can see that similar messages on run_demo take longer that on playground but not always.

In the playground + aries-mediator-service we do have ngrok for all agents (faber, alice, mediator) and mediator has a caddy reverse proxy... so there are more hoops to jump through but not seeing anything significant. One difference for sure is run_demo does have webhook delivery, playground does not.

I can certainly run this more times to get a larger sample size, and I could playground with multitenancy - but for this initial look I wanted apples-to-apples.

You can review the breakdown.txt for a comparison of the longer operations.

playground-elapsed.csv
run_demo-elapsed.csv
breakdown.txt

Ran 3 scenarios (all locally, removed ngrok from the equation, use local ledger).
This is using load test. 5 mins, max out at 100 users, load 10 every 30 seconds and use 50% of the users every second. Once connected they'll ping.

loadtest-100-no-mediator-elapsed - this is using multi-tenant acapy with NO mediation. So just connecting directly to another agent and pinging each other.

loadtest-100-mediator-elapsed - this introduces a mediator, so connect to the mediator, then connect to another agent and ping.

loadtest-100-mediator-redis-elapsed - same as above, except the mediator service has redis persistent queue.

Will do a better analysis later, just getting this up here. Keep in mind all the infrastructure is in local docker containers, so no tweaking and refining and setting up resources properly. But apples to apples. Initial look through is multitenant acapy gets slow pretty quickly. Adding a mediator adds a slight bit of overhead, but adding redis - more stable - but a real bottleneck. Numbers near the end of the test for pings take a long time. Initial thought is that with 50 agents pinging it just takes time to work through the queued messages.

Anyway, I'll dig in a bit more, just leaving this here for now to share the information.

loadtest-100-mediator-elapsed.csv
loadtest-100-mediator-redis-elapsed.csv
loadtest-100-no-mediator-elapsed.csv

Adding some more load tests here.
This set is just blasting a multi-tenant acapy with a request for 50 users - 25 per second. Numbers are not great. But between all the testing I am going to now focus on what happens during Dispatcher.handle_message.END as that i always the largest trace number in a given thread_id. No idea what that trace encapsulates at the moment, so time to find out.

create-50-connect-mediator-elapsed.csv
create-50-connect-mediator-redis-elapsed.csv
create-50-connect-mediator-redis.csv

Quick update (I won't upload the CSVs for this...) if i add tracting timing in the Dispatcher.handle_message I can further breakdown the timing for that message. There are two areas of concern, one is finding the inbound connection which can sometimes account for 1/2 the duration. Timings are not consistent, but certainly a large percentage of the thread. The other is awaiting the handler (ex ConnectionResponseHandler). So now time to deep dive...

Ok, latest update. I think we can put this to rest for now. Running the demo and running a similar script using the playground I can get very similar numbers when logging levels are ERROR and when the demo uses an actual Postgres instance. The numbers I was seeing seemed to indicate some kind of overhead in the non-demo instances. I then realized one big difference is the demos do not open/close postgres db connections. Adding that to the demo and basically turning off logging seemed to sort that out. In fact, the playground version with logging off now performs better than demo.

demo version - faber connect to mediator thread elapsed time: 0.974344969
playground version - faber connect mediator thread elapsed: 0.626765966415405

faber-alice-mediator-connect-5-elapsed.csv
faber-alice-mediator-playground-2-elapsed.csv

Adding this here, not sure where to put this. But during the load testing (no mediator, using multi-tenant ACA-Py). This similar to the last load test comment - create 50 users, 10 per second. This had a lot more tracing in it as I was trying to narrow down what in the initial connection flow was taking the most time. So, very interesting to see that we end up with bottlenecks when we try to resolve the inbound connection. You can see that we end up with entries bunched up (that are for different wallets) in the ConnectionManager.find_inbound_connection resolve_inbound_connection . I think the basic takeaway is we are hitting bottlenecks and waiting, throughput degrades rapidly with load.

50-multi-mediator-1-elapsed.csv