eiffel-community/eiffel-intelligence

Weaker performance on newer releases

Closed this issue · 29 comments

Description

Hi, we have been using version 3.0.0 for quite a long time and decided to try out upgrading to version 3.2.4. It seemed to be all good but the performance is much weaker. It seems to be around 4-5 times slower than 3.0.0. I saw that this could have been because of the change to only use one thread pool and configuration parameters don't seem to help much. I looks like I have tried everything at this point..

I am not saying that it should be brought back to a previous implementations but the current one isnt good either

Motivation

The performance is quite bad and because of that we have to stay on version 3.0.0

Benefits

faster processing and possibly more areas it could be used in

Possible Drawbacks

possibly using more threads in total

Hi @domis322. In #475 the processing was changed due to problems with unlimited creating of thread. I hope you are aware of you can change the amount of threads that EI should use via application properties (see https://github.com/eiffel-community/eiffel-intelligence/blob/master/src/main/resources/application.properties#L80). You could try to increase these setting to see if the performance improves.

@domis322 , out of curiosity, could you share what organization you work for? It would be interesting to hear about your use case for using Eiffel Intelligence

@m-linner-ericsson yes, I have experimented with them quite a bit but they do not seem to affect the performance that much, even when turned up to insane numbers it still looks like its limited on something. The number of pids is increasing but its not processing any faster.

Hi @domis322 ,
Could you please provide more details about the previous version that you have used.
In the description, you said that you observed the single thread pool in latest version(3.2.4) but
the earlier version(3.0.0) is also using single thread pool.

We have observed not that much deviation in performance between the two versions(2.2.4) and 3.2.4.
Could you provide the below parameter values that you have used in your application where you seen the performance deviation.
threads.core.pool.size
threads.queue.capacity
threads.max.pool.size

@RajuBeemreddy1
i thought this was only changed in version 3.1.0 as per release notes.
image

Just noticed that the docker image that we are using(below) seems to be even older than some 2.x.x versions. I guess the images in dockerhub have been a bit abandoned but this is the one that we're using. I believe it was still has the older threading implementation
image

Tested both versions with these settings at first, since this is what we were using:

threads.core.pool.size=200
threads.queue.capacity=7000
threads.max.pool.size=250
scheduled.threadpool.size=200

After noticing that newer version was slower I tried increasing everything proportionally up to 10 times but that didn't seem increase the performance

Hi @domis322 ,

With below configuration
threads.core.pool.size: 100
threads.queue.capacity: 5000
threads.max.pool.size: 150
scheduled.threadpool.size: 100

I have checked with both versions 3.0.0 and 3.2.4 and subscribed with same subscriptions used in the mail
with following scenarios and performance was same for both the versions:

  • Subscribed to SourceChangeCreatedEvent and posted 1000 events to the queue
  • Subscribed to SourceChangeCreatedEvent and posted 10000 events to the queue
  • Subscribed to 3 events and posted 10000 events in a single script
  • Subscribed to 3 events and posted 10000 events in separate scripts

Regards,
Jainad

@domis322 , Jainad has performed the tests as you can see above but has not been able to reproduce the loss of performance. Do you think there could be some other settings that you've done that could cause these issues? Are there some other environment settings we should take into considerations? Could there be network issues?

Hey
Thanks for your replies. I would say that networking is very unlikely here to be a problem, since they are all running on the same node in our test server. I think we have a mostly default setup on other environment settings.

Is there a way to verify that the application is picking up an environment file?

Maybe a stupid question: Which aggregation rule are we talking about (just making sure that we are talking about the same)?

Hey Thanks for your replies. I would say that networking is very unlikely here to be a problem, since they are all running on the same node in our test server. I think we have a mostly default setup on other environment settings.

Is there a way to verify that the application is picking up an environment file?

Hi @domis322 ,We meant if your environment / setup is the cause .For example network / firewall issues that could have resulted in the degrade of performance .

Maybe a stupid question: Which aggregation rule are we talking about (just making sure that we are talking about the same)?

As shared in mail AllEventRules is the ruleset and subscription is for the condition meta.type == SourceChangeCreatedEvent

Maybe a stupid question: Which aggregation rule are we talking about (just making sure that we are talking about the same)?

As shared in mail AllEventRules is the ruleset and subscription is for the condition meta.type == SourceChangeCreatedEvent

Is it possible to provide some more information in this ticket rather then referring to a mail? Traceability would improve with added information (maybe not for now but for future references)

@jainadc9 Are you aware that some of the files are empty?

Hey Thanks for your replies. I would say that networking is very unlikely here to be a problem, since they are all running on the same node in our test server. I think we have a mostly default setup on other environment settings.
Is there a way to verify that the application is picking up an environment file?

Hi @domis322 ,We meant if your environment / setup is the cause .For example network / firewall issues that could have resulted in the degrade of performance .

the environment was the exact same for both of the versions, in fact I tested them both on the exact same server with the rabbitmq and ei being in separate docker containers within the same machine.
Ill will try to go back and look into it a bit more, the only thing that I could think of why it would do something different is that it might not be picking up the environment file or maybe some of them have changed?

No @domis322 theres nothing changed we are following the same rules and subscriptions and events referred in wiki.
https://github.com/eiffel-community/eiffel-intelligence/blob/master/wiki/templates.md and its the default application properties that we have used and no particular environment file /setup is being used

@jainadc9 / @domis322 , are any of you looking in to this issue at the moment? What do you expect should happen next to make it progress?

I tried rerunning all of the tests I have done before and more and it still looks like it has the same performance issue. I don't really see what could be changed at this moment. Could you specify how you generated the events and what consumption numbers you are getting?

version 3.0.0:
image

version 3.2.4: (second half of the timeframe)
image

I tried to reproduce the issue. I compared version 3.0.0 with 3.2.6 and I haven't observed any performance degradation. I observed opposite result: version 3.2.6 was about 4% faster comparing with 3.0.0.

This issue has been around for too long now without reaching any conclusion through its comments. Should we book a meeting to try to sort out any differences in environment setup or event generation or the like, so we could get to the bottom of this? Would that be ok with you @z-sztrom and @domis322?

@e-backmark-ericsson, yes, let's meet and discuss why the results differ so significantly.

Yeah, we can definitely do that!

Notes from today's meeting.

AllEvents ruleset is used
@domis322 uses dedicated Docker images for each version, while @z-sztrom uses the same image but replaces the war file

Actions:

  • @domis322 will try to just replace the war file in the existing image instead of building a new image for it
  • @z-sztrom will test with the official old image, and then also try to build an image for the new version and check the difference
  • @e-backmark-ericsson to find the Dockerfile used when previously pushing images to Dockerhub

Just one input if someone want to test with same code as in earlier versions in the newer version, someone could tryout add back the unlimited threading in the subscriptionHandler code and run test if that solves the performance issue, the code that need to be resotred can be found in this commit diff:
0937cc2#diff-67101393a8229c6b8e5741ec64e1254bc2ca652f9262a805c5148d37cbc50c54

Check also the Thread pool size properties values that was use in previous Eiffel-Intelligence versions.

But increasing the thread core pool and queue size properties should result in same behavior almost, I think. According to the java docs, setting the thread queue size property to zero will result in a unlimited threadpool queue, so that could be tested as well.

Just some more ideas that can be tested, but what I wrote here maybe have been tested already?

Hi,

So I tried building the images myself on both versions and running them. I tried running the existing image and tried running an image after injecting it with a war from a new version.
In all cases the newer version seems to have worse performance.

@tobiasake I tried the values from the mentioned commit. both before the change and after, they both seem to have similar performance in this case (so not that good). I tried increasing the values to over 10 000 which does seem to be spawning more threads (PIDS in picture bellow). But it doesnt help to consume the events faster.
image
setting thread queue size to zero did not help either.

I think there might be a bottleneck outside the multithreaded part. maybe because in the new verseion getAllDocuments() is called on the main thread?
image

@domis322, please, provide your test results when using completely empty mongoDB.

I recently tried out running both versions connected to a completely empty MongoDB. This seems to have fixed the issue completely. Both of the versions now have the same performance. I couldn't figure exactly what in the database caused the performance issues for the newer version but it has now been running smoothly for over a month. I think the issue can therefore be closed.

Thanks! Closing the issue.