microsoft/iomt-fhir

MeasurementCollectionToFhir timeout while connecting to FHIR server

mmacagno opened this issue · 2 comments

We've started experiencing 99%+ failure rate of the MeasurementCollectionToFhir function.
The problem is that the function will time out (after ~20 seconds) on the Observation update in the FHIR server.
The FHIR server is not the problem, in fact we can get updates in FHIR in under 200ms from other services.

The web app machine seems unable to even CURL the endpoint without timeout.

I have currently a Sev A support open with Microsoft Support.

One possible issue could be related to SNAT exhaustion.

https://4lowtherabbit.github.io/blogs/2019/10/SNAT/

Having deployed with the IOMT template, we have no VNET between the IOMT function and the FHIR server.
One of the solutions suggested, besides creating premium VNET, is to improve the app to reuse connections.

My question is whether the IOMT MeasurementCollectionToFhir is reusing connections.
I just deployed the latest code from main, and did not see any improvements

Thanks

Heard back from Azure App services support. Per connection/per instance there is apparently a limit of 128 ports per connection so if we got to the point where one instance of the function app had more than 128 outstanding requests it could explain hitting this issue.

This would also explain why the switch to consumption mitigated the issue since call volume appears to be enough to horizontally scale to additional instances results in more overall ports to use.

For now, the suggestion would be to continue using the consumption plan hosting. In addition we will evaluate adding logic in the FHIR Conversion step to catch these exceptions (see below), wait for a period of time, and retry. This should hopefully allow us to more gracefully handle this scenario in the future and prevent the whole batch from failing and being retried. Even with this fix, for optimal performance some sort of horizonal scaling will be needed so more SNAT ports are available.

System.Net.WebException : Only one usage of each socket address (protocol/network address/port) is normally permitted. Only one usage of each socket address (protocol/network address/port) is normally permitted. ---> System.Net.Http.HttpRequestException : Only one usage of each socket address (protocol/network address/port) is normally permitted.
System.Net.WebException : A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

Adding VNET is another option to mitigate but you would also need to switch to Stream Analytics dedicated clusters in addition to adding the VNET to your function app. See https://docs.microsoft.com/en-us/azure/stream-analytics/connect-job-to-vnet

This was resolved by going to a dynamic scale of the Azure function responsible for the MeasurementColelctionToFhir.
Closing.