High Wait Time and CPU Usage with Oracle.ManagedDataAccess in High-Concurrency .NET 8 Environment
Closed this issue · 11 comments
Description:
We are experiencing high wait times and CPU usage in our .NET 8 application using Oracle.ManagedDataAccess (ODP.NET Managed Driver) under high concurrency. The issue is most evident in Kubernetes environments with many pods and a high minimum pool size.
Environment:
- .NET 8
- Oracle.ManagedDataAccess (please specify exact version)
- Connection string:
User Id=###;Password=###;Data Source=###;VALIDATE CONNECTION=True;Persist Security Info=True;Max Pool Size=100;Connection Lifetime=120;Min Pool Size=30;CONNECTION TIMEOUT=60;POOLING=True; - 20+ pods (Kubernetes), each with its own pool
- ~2,500 requests/minute
Symptoms:
- Endpoints spend most of their time waiting for Oracle operations, not on SQL execution itself (confirmed by Dynatrace traces).
- SQL queries are fast when run directly on the database.
- High CPU usage in some pods, resolved only by restarting the pod.
- No apparent locks or slowdowns in the Oracle database itself.
What we have tried:
- Increased Connection Lifetime to 600s (testing).
- Monitored pool usage and resource consumption.
- Ensured queries are optimized.
To diagnose the problem, you'll need to collect at least an ODP.NET trace and probably performance profiling information. That will identify which part of ODP.NET operations are slow. For example, the problem could be in creating connections, or some queries could execute slowly on the ODP.NET side, or it's taking a while to retrieve results.
If most ODP.NET operations are behaving slowly, the pod itself could be starved for resources. You could then review if there's enough memory, CPU, etc. If there's not enough, you could add more hardware.
If there are enough resources for the load, we can review the code to determine if it's efficient for its use profile. The ODP.NET team can also review if there are ways to make ODP.NET Core more efficient internally.
Which ODP.NET Core version are you using?
Thans for your quick response Alex.
We are using "Oracle.ManagedDataAccess.Core" Version="23.8.0".
I can only display the hostpot at the time of high consumption, and it is in this process:
Top hotspot
Method
ReaderStream+d_20.MoveNext
Overall contribution
97.8%
Our pods does not time out or exceed limit, and we will monitor resource usage during the next incident.
If you need any specific trace configuration or additional details, please let us know.
One way to improve your connection time is to remove "Validate Connection=true" from the connection string. That will eliminate a round trip for every connection open request. To ensure a valid connection from the pool, you can use Oracle Fast Connection Failover instead. Oracle Application Continuity is also an option; however, FCF is a lighter-weight solution as it primarily handles planned maintenance scenarios, which is the most common outage type.
Did you run a similar workload concurrency with an earlier ODP.NET Core version that didn't have high CPU/wait time? If so, what version was it?
The ReaderStream MoveNext would suggest results being retrieved from the DB. Does this happen with specific data retrieval, such as LOBs or other non-scalar types?
A few more comments from my dev team:
- A full stack list at that time of that ReaderStream+d_20.MoveNext would tell us which operation is taking up so much CPU.
- Are you using async APIs?
- In addition to not setting Validate Connection, not setting a Connection Lifetime would also improve connection performance. The CL setting can create a connection storm situation if many connections have their lifetimes expire at about the same time.
Closing this issue for now. I will reopen once we receive more diagnostics.
First, I apologize for the delay in responding, but I needed to be offline for a few days.
I have some recent insights from what we deployed to the production environment, and we are still experiencing high CPU usage as previously reported.
Our current connection string has the following parameters:
Validate Connection=True;Persist Security Info=True;Max Pool Size=100;Connection Lifetime=14400;Min Pool Size=30;Connection Timeout=60;Pooling=True;
According to recommendations from our DBA team, we have not removed the Validate Connection=true parameter, as it is important for us to validate the connection before returning it to the pool.
Some characteristics of the service where we are facing this issue, and answering some of your previous questions:
- We are indeed using an asynchronous API. It has two integrations with other APIs, but their responses are fast enough.
A more complete trace that I was able to capture during a critical incident is as follows:
Potential root causes for requests that failed with HTTP 500 - Internal Server Error
2.38 % likely failed due to an Oracle.ManagedDataAccess.Client.OracleException
Exception messages of Oracle.ManagedDataAccess.Client.OracleException
OR: Oracle Communication: Failed to connect to server or failed to parse connect string
Stacktraces of exceptions
OracleInternal.ConnectionPool.OracleConnectionDispenser3+<GetAsync>d__36.MoveNext System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop System.Runtime.CompilerServices.AsyncTaskMethodBuilder1+AsyncStateMachineBox1.MoveNext[+] System.Threading.ThreadPoolWorkQueue.Dispatch System.Threading.PortableThreadPool+WorkerThread.WorkerThreadStart Oracle.ManagedDataAccess.Client.OracleConnectionInternal+<OpenInternalAsync>d__597.MoveNext System.Threading.ExecutionContext.RunInternal System.Runtime.CompilerServices.AsyncTaskMethodBuilder1+AsyncStateMachineBox1.MoveNext System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction[+] System.Runtime.CompilerServices.AsyncTaskMethodBuilder1.SetException
OracleInternal.ConnectionPool.OracleConnectionDispenser3+<GetAsync>d__36.MoveNext System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop System.Runtime.CompilerServices.AsyncTaskMethodBuilder1+AsyncStateMachineBox`1.MoveNext[+]
System.Threading.ThreadPoolWorkQueue.Dispatch
System.Threading.PortableThreadPool+WorkerThread.WorkerThreadStart
System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw
Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker+<g__Logged|17_1>d.MoveNext
Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.Rethrow[+]
Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Rethrow[+]
System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification
Attached is a graph showing the moment the POD reaches above 1000mCores
We use Dynatrace for metrics with this .NET application, if you would like to suggest any instrumentation to improve our analysis.
Thanks for your time.
Fábio Dantas.
@alexkeh After the last report, is it possible to reopen the analysis?
@FabioDantasBmg I've reopened the issue. However, I don't see the pod graph you posted. Due to the difficulty in predicting when the error occurs, it may be time to move this diagnosis to an Oracle Support service request so that we can give you a diagnostic ODP.NET drop.
A diagnostic drop will let us capture enhanced traces to more easily identify the problem. You can deploy it to a single pod that manifests the error more often, or try to trigger it if you know what events occur prior. In this way, you minimize the production pods using the diagnostic drop.
Thank you for reopening the issue. We will proceed to open a Service Request with Oracle Support to obtain the diagnostic ODP.NET drop as suggested.
Important clarification: The issue occurs in specific pods, but our pods are created dynamically by HPA (Horizontal Pod Autoscaler). All pods share the same base code, so we cannot isolate unique code to a single pod. However, we can deploy the diagnostic version to all pods in our environment and monitor which ones manifest the issue more frequently to capture the enhanced traces.
We have identified patterns where the issue tends to occur more often, so we can strategically monitor those instances when using the diagnostic drop.
I will reference this GitHub issue (#455) in the Oracle Support SR and provide all the context we've discussed here, including:
- Our current connection string parameters
- The fact that we're using async APIs
- The high CPU usage patterns we're experiencing
- The enhanced traces we'll be able to collect with the diagnostic version
Thank you for your assistance with this investigation.
Hey @FabioDantasBmg
We have experienced something similar to what you have described. Do you have a service request id which we can approach Oracle with?
This issue will be handled via an SR.