oracle/oci-typescript-sdk

CPU Leak

Opened this issue · 8 comments

When using oci-sdk versions greater than ^1.5.2, including the latest, we are seeing a slow but steady increase of CPU utilization, which eventually grows out of bounds and uses all available CPU on the instance. Reverting to ^1.5.2 fixes the issue for us. This occurs across multiple projects that leverage the oci-sdk, and can be directly attributed to the oci-sdk version, as an identical version with the older ^1.5.2 does not exhibit the leak behavior. A user of the jitsi-autoscaler (which uses oci-sdk) reported this leak to us and ran a profile, and we have confirmed the behavior ourselves but not done the profiling.

The users shows that the system is overwhelmed by timers, in case that helps you debug:

46492.8 ms43.13 % | 97726.0 ms90.67 % | (anonymous) status.js:82 |
46492.8 ms43.13 % | 97726.0 ms90.67 % | ........listOnTimeout internal/timers.js:502 |
46492.8 ms43.13 % | 97726.0 ms90.67 % | ...............processTimers internal/timers.js:482 |
43874.9 ms40.71 % | 43874.9 ms40.71 % | (anonymous) status.js:96 |
43874.9 ms40.71 % | 43874.9 ms40.71 % | ........(anonymous) status.js:94 |
43874.9 ms40.71 % | 43874.9 ms40.71 % | ...............get stats status.js:93 |
43874.9 ms40.71 % | 43874.9 ms40.71 % | ......................(anonymous) status.js:82 |
43874.9 ms40.71 % | 43874.9 ms40.71 % | .............................listOnTimeout internal/timers.js:502 |
43874.9 ms40.71 % | 43874.9 ms40.71 % | ..................................processTimers internal/timers.js:482 |
4877.3 ms4.53 % | 5978.8 ms5.55 % | (anonymous) status.js:124 |
4877.3 ms4.53 % | 5978.8 ms5.55 % | .......get stats status.js:93 |
4877.3 ms4.53 % | 5978.8 ms5.55 % | ...............(anonymous) status.js:82 |
4877.3 ms4.53 % | 5978.8 ms5.55 % | ......................listOnTimeout internal/timers.js:502 |
4877.3 ms4.53 % | 5978.8 ms5.55 % | .............................processTimers internal/timers.js:502 |

Running into the same issue. This issue can be observed if you have a long running service like a microservice that uses oci-sdk. The CPU usage in my case was increasing about 0.1% every 10 minutes, as measured by "ps -p <pid> -o %cpu,%mem" (on a 2 CPU host) seemingly indefinitely. This growth happens despite the service not making any OCI calls.

The sdk probably starts some timer that constantly runs on the background?

Previous to this I was calling OCI APIs directly, and there was near 0% CPU usage by the server. After the switch to oci-sdk, CPU always goes to 100% after 2 days or so. I just now removed the oci-sdk from the service, and CPU usage pattern became normal, consuming only 0.x% generally and not increasing.

oci-sdk version used: 2.7.0.3
node.js: 18.18.2

Thanks for reporting this @aaronkvanmeerten . We are working on the fix internally and will update here once its fixed.

Setting the environment variable OCI_SDK_DEFAULT_CIRCUITBREAKER_ENABLED=false is a workaround to avoid this issue for now.
Docs: https://docs.oracle.com/en-us/iaas/Content/API/SDKDocs/typescriptsdkconcepts.htm#typescriptsdkconcepts_topic_Retry_Circuit_Breakers

Hi @vpeltola, this issue seems to be caused by circuit breakers not shutting down after they are no longer needed. The most recent release of the SDK includes a method in each client that the user can call to shut down these circuit breakers as needed. Please see this example. Let us know if this seems to fix your issue, thanks!

Hmm, the solution shouldn't be to shutdown something (circuit breakers) that I didn't start in the first place. If they were automatically started without the user's knowledge, they should also shutdown automatically. And if/while they are running, they should not leak memory and use progressively more CPU. I think there is still a bug that needs fixing.

Hmm, the solution shouldn't be to shutdown something (circuit breakers) that I didn't start in the first place. If they were automatically started without the user's knowledge, they should also shutdown automatically. And if/while they are running, they should not leak memory and use progressively more CPU. I think there is still a bug that needs fixing.

I agree completely with the sentiment. No other library I have ever used has required me to run extra code to shut down pieces in order to not leak CPU. Something is clearly wrong in this library. Especially because it did not happen before a certain version, I believe it must be some kind of bug that needs fixing.

Hi @vpeltola @aaronkvanmeerten, thank you for your feedback.
For all OCI SDKs, including TypeScript, we've decided to use circuit breakers in our clients by default. This helps prevent overloading OCI services during partial service outages, to improve availability.
There isn't an easy way for us to tell when the user is done with a TypeScript SDK client they've created. A client can be created, used to make a call, and then not needed. Or, it may need to be left open perpetually to be used for a number of API calls over time. Since we don't have a good way to know when a user is done with the client ourselves, we think its best if the user closes the client themselves (by calling .shutdownCircuitBreaker()) when they know they're done using it.
To address @aaronkvanmeerten's comment specifically, manually closing a client that has been manually created is not uncommon across libraries. For example, streaming libraries such as grpc for TS ask users to create a streaming object, use it, and then manually close it with stream.cancel() to ensure it doesn't continue to use resources.
If you have suggestions as to how we could better handle circuit breakers in the SDK, we welcome them, as we do recognize this is an extra step the user has to take. Thanks!

As part of the latest Typescript release, .close() has been added to each client to further address this issue, and to more closely resemble the behavior of clients from the Java SDK. In addition, this method's use is now shown in each of the typescript examples that use clients.