Elarm is an Alarm Manager for Erlang. It is designed to be easy to include in an Erlang based system. Most functions are implemented through plugins so it is easy to change the behaviour if necessary.
Elarm is designed to be part of an Erlang system.
But it can also be started quickly from an Erlang shell:
$ git clone git@github.com:esl/elarm.git
[...]
$ cd elarm
$ make
[...]
$ erl -pa ebin -pa deps/gproc/ebin
> application:start(gproc).
ok
> application:start(elarm).
ok
> elarm:raise(partition_full, "/dev/hda2", [{level,90}]).
ok
> elarm:get_alarms().
{ok,[{alarm,partition_full,undefined,"/dev/hda2",
{{2014,5,12},{10,46,45}},
{1399,891605,536270},
indeterminate,<<>>,<<>>,<<>>,
[{level,90}],
[],[],undefined,undefined,new,undefined}]}
> elarm:clear(partition_full, "/dev/hda2").
ok
> elarm:get_alarms().
{ok,[]}
>
Alarms and Events are often mixed up, but there are some important differences.
An Event is stateless, it just says "Something happened", e.g. "failed to open file" or "invoice created", and that is all. Logging tools like lager do event logging.
An Alarm on the other hand has state, it is an indicator that something is wrong in the system. Once an alarm is raised it remains active until the error condition connected to the alarm no longer applies. When the error condition is removed the alarm is cleared.
Elarm keeps a list of all currently active alarms. It is possible for a Manager application to subscribe to all changes in the alarm list.
A user can acknowledge alarm, that is to tell Elarm that he is aware of the alarm. It is also possible to add comments to an alarm. Finally it is possible to manually clear an alarm. Normally an alarm is cleared by the application that raised the alarm, but in some cases that is not done, e.g. if the alarms are received as SNMP traps from another system.
All changes to an alarm are logged in an alarm log.
An application raising an alarm should not have to know too much about the alarm
handling, so when an alarm is raised the application only has to supply a name
of the alarm and the identity of the entity the alarm applies to and optionally
some additional information, as an example the name could be 'partition_full'
and the entity "/dev/hda1"
and the additional information could be "90%"
.
For an operator it is useful to have some additional information when handling
the alarms. This information includes, "severity"
how serious is the alarm,
"probable_cause"
what is the likely reason for the alarm,
"proposed_repair_action"
how can I fix the problem. In Elarm it is possible to
add this information to the alarms by adding configuration data for an alarm. If
no configuration data is found when an alarm is raised, default data is used.
The default data is defined in the elarm.app
file and can be overridden by
data in sys.config
.
To know which alarms need configuration data Elarm is recording all alarms that
have been raised for which there is no configuration. It is possible to query
Elarm for a list of alarms that are missing configuration using
elarm:get_unconfigured/0,1
. Alarm configuration can be added via
elarm:add_configuration/2
.
When Elarm is started it starts one instance of the alarm manager, with the name
elarm_server
. It is possible to run several alarm managers at once. By putting
an environment variable named servers
in the system configuration file, with
the value a list of tuples {ServerName, Opts}
, Elarm will start one alarm
manager for each tuple. It is also possible to manually start a new alarm
manager using elarm:start_server(Name, Opts)
.
An application wanting to raise an alarm just have to call
ok = elarm:raise(partition_full, "/dev/hda2", [{level,90}])
and to clear it
ok = elarm:clear(partition_full, "/dev/hda2")
So Name
and Entity
uniquely identify an alarm.
A Management application can request a list of all currently active alarms by elarm:get_alarms/1
.
To subscribe to all alarm events use elarm:subscribe(Server, Filter)
. This
will return a {Ref, AlarmList}
, where Ref
is a reference that will be
included in all received messages and to cancel the subscription. AlarmList
is
a list of all the currently active alarms that match the filter. For all changes
to the alarms, matching the filter, a message will be received.
The filter is a list of filter elements, if one filter element matches then the filter matches:
FilterElement = all | {type, alarm_type()} | {src, alarm_src()}
all
, matches all alarms{type, Type}
, matches alarms withalarm_type == Type
{src, Src}
, matches alarms withalarm_src == Src
The type
and src
filter elements may appear several times, in that case each
one is tried and if one matches then the filter matches.
The format of the messages are:
-
new alarm:
{elarm, Ref, alarm()}
-
acknowledged alarm:
{elarm, Ref, {ack, alarm_id(), alarm_src(), event_id(), ack_info()}}
-
unacknowledged alarm:
{elarm, Ref, {unack, alarm_id(), alarm_src(), event_id(), ack_info()}}
-
cleared alarm:
{elarm, Ref, {clear, alarm_id(), alarm_src(), event_id()}}
-
manual cleared alarm:
{elarm, Ref, {manual_clear, alarm_id(), alarm_src(), event_id(), user_id()}}
-
comment added:
{elarm, Ref, {add_comment, alarm_id(), alarm_src(), event_id, comment()}}
Alarm Summary gives a summary of the presence or absence of unacknowledged and acknowledged alarms of the various severities. This is useful for e.g. show the status on maps or other overview user interfaces.
To start a subscription use elarm:summary_subscription(Server, Filter)
.
Filter is the same as for alarm list subscriptions. Every time the summary changes a message will be received.
The format of the messages is:
{elarm, Ref, #alarm_summary{}}
An alarm has two states, initially it is new
, to show the user that it is a
new alarm. When the user has seen the alarm he can acknowledge it using
elarm:acknowledge/3
. This will change the alarm state to acknowledged
, and a
timestamp and the UserId
will be added to the alarm.
It is possible to add comments to alarms, e.g. to add notes of troubleshooting
or corrective actions taken. Each comment will be stored with a timestamp and
the UserId
of the user.
Normally alarms are cleared automatically by the application by calling
elarm:clear
, but in some cases it may be necessary to manually clear an alarm,
this can be done using elarm:manual_clear/3
.