/codedive2022

How to model stochastic behavior of failures in telco or IT systems using machine learning

Primary LanguagePythonApache License 2.0Apache-2.0

codedive2022

How to model stochastic behavior of failures in telco or IT systems using machine learning

Abstract:

In large-scale networks in IT or telco, in order to slow down the degradation process of the live system and reduce its impact on the quality of end-user experience, preventive maintenance (PM) with minimal repair at failures is required. Network nodes have stochastic behavior for failures with relation to alarm and health-check status shown before failure happens. The more major or critical alarm generated, the probability of failure increases. To predict failure and to reduce financial and non-financial loss, it is necessary to have a proper approach and proper model to address prioritization of failure for preventive maintenance.

Prometheus diagram

image

Neural network

image

shap explainer

image