openwallet-foundation/didcomm-mediator-service

Use Load Testing to Evaluate and Improve the Performance of ACA-Py as a Mediator

Closed this issue · 6 comments

Currently we are seeing intermittent issues with production ACA-Py as a Mediator (ACA-PyAAM) instances. The initial establishment of connections are slow and in case hangs.

This may lead to use switching to some other approaches to the implementation of a mediator.

This epic will be used to track the overall progress of that issue. A variety of issues will be tracked releted to this activity.

Leaving this not here for now.

Just running the mediator service locally, and the load agent locally. Followed instructions for download aries-framework-javascript in place for load-testing.

In one terminal (mediator service)

docker compose up
... wait for mediation URL, copy it...

paste into /load-testing/.env

In a second terminal (load test)

cd load-testing
docker compose build
docker compose up

Run a test (locustMediatorPing.py) with Number of users = 1 and Spawn Rate = 1 and consistently see the /load-agent/load-agent/locustMediatatorPing.py_on_start take 5400ms. So that is the time it takes to create an agent and get it initialized and have an active connection with the mediator. The larger the number of users, the longer the median time is... Fails creep in when users > 20.

I tried to separate the logic for waiting on the active connection to the mediator into a separate locust file, but timings become meaningless as once the connection is active, the time is effectively 0ms.

I did split the logic out so the agent start command called 2 functions, one to do some initialize stuff, the other to wait on the connection to be active. I could only print the results out to the terminal, so not a real test but could see that (depending on the load) the time for the connection to be active was hovering betwee 1000 and 1200ms. On the mediator side, this around 600ms. Splitting that logic out really affected the stability of the tests so it was only useful as an investigation tactic.

I am not familiar with AFJ at all, but notice that the mediation connection setup is buried in the Agent.initialize call. I am sure there are very excellent reasons for that, but using it to test the mediator service makes it difficult to determine connection time versus agent start-up time.

In that regard, I am not sure that load testing (at least not the /load-agent/load-agent/locustMediatatorPing.py_on_start numbers) will reflect the real world? In any case, the agent will already exist and be initialized (only needing the mediator connection to be created and become active), and each agent should not be constrained by N other agents fighting for the resources to initialize (that's what the higher number of users taking longer to initialize per user shows).

From the current load test, it is tough to tell if aca-py as a mediator is at fault. It appears that it is locust creating all the agents and initializing them which is the greatest bottleneck. Once the mediator is reached and begins the connection process, it's reasonably (to me) quick. Although the response rate for connecting degrades over time / heavier load. Cycling through the logs shows us the following:

connection number created at update at elapsed ms
1 2023-04-12T00:27:20.266063Z 2023-04-12T00:27:21.411436Z 1145.373
25 2023-04-12T00:27:28.345704Z 2023-04-12T00:27:31.178400Z 2832.696

Not sure where to go from here.

Thanks, @usingtechnology — interesting stuff.

First, weird that simply splitting the logic into two functions affected the stability that much. That doesn’t bode well.

On where to go next, it would be interesting to see if we can instrument the ACA-Py component (mediator). ACA-Py has some tracing capabilities built in (see: https://aca-py.org/main/testing/AgentTracing/). Is there anyway we can make use of that on the ACA-Py side to better understand what’s happening there? It would be particularly useful to use that while running larger load tests. I’m not quite ready to give up on the load testing tool.

@reflectivedevelopment — any thoughts on what Jason has found in doing this work? What steps to take next?

Just a comment about the instability, I believe that has more to do with "breaking up" the AFJ agent.initialize() - which encapsulates the connect to mediator. So I was calling initialize() in one function, then listening for the connection to mediator to complete in another so I could check timings. Super hacky and meant that the initialized agent was being used in unintended ways. So I think we can ignore that for now.

I did add in some timings within the locust start up that would approximate the time spent connecting but not 100% accurate. It still leaves the actual thing we want to test (how the acapy mediator reacts) untested. I carved out timings from the log by looking at the "Connection promoted to active" statements and calculating elapsed time of the connection record (updated_at - created_at).

So for a 10 user peak we see numbers like:

Name # reqs # fails Avg Min Max Med req/s failures/s
/load-agent/load-agent/locustMediatorPing.py_on_start 10 0(0.00%) 16696 14847 17854 7000 0.15 0.00
connection locust afj elapsed time acapy elapsed time
2 1713.528765976429 1158.381103515625
3 2092.8806179761887 1329.47802734375
4 2391.7039229869843 1485.55712890625
5 2132.138907968998 1527.626953125
6 2296.8131259679794 1546.7607421875
7 2202.5959919691086 1432.44873046875
8 2166.1356709599495 1496.35107421875
9 2505.3976070284843 1509.7099609375
10 2667.12097799778 1408.538818359375

missed the first connection statement.

The acapy (mediator) side is roughly 1 second quicker than the locust (afj agent) side. And the startup times are significantly higher than the time to connect. So a lot more involved creating/initializing an agent than connecting to a mediator.

After discussing with @swcurran - we are going to flip the testing to gather stats from the acapy/mediator side. So we can use the locust load tests for load but are less concerned about those timings. We need to gather from acapy. Will be looking at running an ELK stack and how we can get the numbers from there.

The main focus of the Locust load testing wasn't to provide exact numbers for connection timing. It was simply to provide a "real" world onboarding of new users, and see how many users we could get to.

The testing can be modified to re-use wallets, so you startup x number of agents, then re-run the tests and use the same wallets.

Simply modify https://github.com/hyperledger/aries-mediator-service/blob/main/load-testing/load-agent/agent.js#L33 to use a predetermined value for the wallet, example, use a port number to determine the wallet id. We could then add a flag to use a predefined wallet, or to use a random wallet.

Typically speaking, I can get the load testing framework to have 2000 + agents running on a single machine, but need to keep CPU utilization and memory into consideration. ( 128 GB VM should work )

I think from a big picture perspective, we can see a few things.

  1. Forming a new connection in Aca-Py is expensive. If we can reduce the cost of forming new connections, it would allow us to scale a single agent better
  2. Mediation needs to scale across multiple processors and machines to scale effectively.
  3. Re-using a connection doesn't seem relevant when we can only get around 800 - 2000 users connected to a mediator. A large new user event can simply overwhelm a mediator.

One additional thought. The best way to scale a single mediator is to increase the single core performance. Currently Aca-py is mostly single threaded. By increasing the performance of the single core (upgrade to a better cpu with better single core performance), we can find the best performance gains.