AxonIQ/axon-server-se

Memory swapping when eventstream is read at various positions

robinvandenbogaard opened this issue · 6 comments

We're experiencing swapping behaviour in our production application. We suspect it occurs when multiple applications read the event streams at different positions.
We've created a test application that can reproduce this behaviour and when monitored you can see memory swapping caused by the axon-server.

We're using axon-server-se 3.6.11 in production. When attempted to use 2023.1.0-dev but the same results occured.
In the attachment tre-axon-poc.zip we've added the test project that uses 3.6.11.

It contains a docker-compose file. Use the README.MD to start everything up.
It starts the axon server.
It will start an application that generates 3 aggregates per second using a command. Resulting in many aggregates.
It will start two applications that maintain the aggregates and generate events and commands to fill the aggregate using a saga. The setup will generate 10.000 events per aggregate. One of these will fall behind on the eventstream eventually by a large amount of events.
It will start an querystore application that only streams the events and does nothing with it.

We're doing this to generate a bulky event stream to give the axon something to iterate over.

If desired we can give you a few testrun statistics about events intake and when the swapping occured with above setup.

Thanks very much for this sample application. With this application I was able to see the increase of virtual memory due to the reference to older events files being kept open. Can you run this test again with the 2023.1.0-dev docker image, adding one more environment parameter to the axon server container:
- axoniq.axonserver.event.force-clean=true

To automate the initialization of Axon Sever add:
- axoniq.axonserver.autocluster.first=axonserver
- axoniq.axonserver.autocluster.contexts=default

This does require the 2023.1.0 version, it will not do anything on 4.6.11.

We have tried the settings as described but it does not seem to resolve the issue. 30 minutes into the new test run the system starts swapping again.

Can you describe once again how you see that the system starts swapping? The values shown for the virtual memory used on the server may be misleading, as they will add the total size of a memory mapped file, even if it is not used at the moment (and not loaded in memory).

We are using splunk to monitor our nodes on which Axon server is running. This specifc node is only being used by Axon when we run our tests. In the provided screenshot below you can see the swap memory over a long period.

Starting on the left with a test run which increases the swap till about 65% at that point we removed the Axon container from the node and the swap memory is being cleared.

We then started a second run and then the swap started rising till about 40%. We then stopped all applications connected to Axon server but left Axon server running till the next day. The swap dropped a few percentages but remained stable throughout the night and is only cleared when we removed the Axon server container.

The third, fourth and fifth runs where exactly the same but we stopped the applications right after the swap started rising.

image

After much experimentation, I found that docker starts using a swap file if available. Even if the machine has sufficient memory it will still start to write to the swap file. I tried to set the swappiness factor to 0 on the host, but still it was writing to the swap file. There are options when you create a docker container to disable swapping, but these are not supported in docker-compose.
My recommendation, in this case, is to remove the swapfile from the host. The Axon Server has more than enough memory, and swapping to file will only decrease the performance.

I ran your docker compose file on my local machine with 8GB available for docker, assigned 3GB to Axon Server, no swap file, and the memory usage remained stable for more than half an hour, adding 10k events per second.

Sorry for the late reply but we have been working on a solution to disable the use of swap. But in the current setup (docker-swarm) we are unable to do it with the compose files as you already mentioned in your reply.

We did manage to disable / remove the swap file from the host nodes and this has yielded positive results in performance.

From our perspective we have determined that there is no problem with Axon and therefor we can close this issue.