/Argus

Alert aggregator and notifications provider for all kinds of monitoring software

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

Argus

build badge codecov badge Code style: black

Argus is a platform for aggregating incidents across network management systems, and sending notifications to users. Users build notification profiles that define which incidents they subscribe to.

This repository hosts the backend built with Django, while the frontend is hosted here: https://github.com/Uninett/Argus-frontend.

Setup

Prerequisites

  • Python 3.7+
  • pip

Dataporten setup

  • Register a new application with the following redirect URL: {server_url}/oidc/complete/dataporten_feide/
    • {server_url} must be replaced with the URL to the server running this project, like http://localhost:8000
  • Add the following permission scopes:
    • profile
    • userid
    • userid-feide

Project setup

  • Create a Python 3.7+ virtual environment
  • pip install -r requirements.txt
  • python manage.py migrate
  • python manage.py initial_setup

Start the server with python manage.py runserver.

Alternative setup using Docker Compose

Site- and deployment-specific settings

Site-specific settings are set as per 12 factor, with environment variables. For more details, see the relevant section in the docs: Setting site-specific settings.

A recap of the environment variables that can be set by default follows.

Environment variables

  • ARGUS_DATAPORTEN_KEY, which holds the id/key for using dataporten for authentication.
  • ARGUS_DATAPORTEN_SECRET, which holds the password for using dataporten for authentication.
  • ARGUS_COOKIE_DOMAIN, the domain the cookie is set for
  • ARGUS_FRONTEND_URL, for redirecting back to frontend after logging in through Feide, and also CORS. Must either be a subdomain of or the same as ARGUS_COOKIE_DOMAIN
  • ARGUS_SEND_NOTIFICATIONS, True in production and False by default, to allow supressing notifications
  • DEBUG, 1 for True, 0 for False
  • TEMPLATE_DEBUG. By default set to the same as DEBUG.
  • DEFAULT_FROM_EMAIL, the email From-address used for notifications sent via email
  • EMAIL_HOST, smarthost (domain name) to send email through
  • EMAIL_HOST_USER, (optional) if the host in EMAIL_HOST needs authentication
  • EMAIL_HOST_PASSWORD, (optional) password if the smarthost needs that
  • EMAIL_PORT, in production by default set to 587
  • SECRET_KEY, used internally by django, should be about 50 chars of ascii noise (but avoid backspaces!)

There are also settings (not env-variables) for which notification plugins to use:

DEFAULT_SMS_MEDIA, which by default is unset, since there is no standardized way of sending SMSes. See Notifications and notification plugins.

DEFAULT_EMAIL_MEDIA, which is included and uses Django's email backend. It is better to switch out the email backend than replcaing this plugin.

A Gmail account with "Allow less secure apps" turned on, was used in the development of this project.

Production gotchas

The frontend and backend currently needs to be on either the same domain or be subdomains of the same domain (ARGUS_COOKIE_DOMAIN).

When running on localhost for dev and test, ARGUS_COOKIE_DOMAIN may be empty.

Running tests

  • python manage.py test src

Mock data

Generating
PYTHONPATH=src python src/argus/incident/fixtures/generate_fixtures.py

This creates the file src/argus/incident/fixtures/incident/mock_data.json.

Loading
python manage.py loaddata incident/mock_data

Running in development

The fastest is to use virtualenv or virtaulenvwrapper or similar to create a safe place to stash all the dependencies.

  1. Create the virtualenv
  2. Fill the activated virtualenv with dependencies:
$ pip install -r requirements/prod.txt
$ pip install -r requirements/dev.txt

Copy the cmd.sh-template to a new name ending with ".sh", make it executable and set the environment variables within. This file must not be checked in to version control, since it contains passwords. You must set DATABASE_URL, DJANGO_SETTINGS_MODULE and SECRET_KEY. If you want to test the frontend you must also set all the DATAPORTEN-settings. Get the values from https://dashboard.dataporten.no/ or create a new application there.

For the database we recommend postgres as we use a postgres-specific feature in the Incident-model.

DJANGO_SETTINGS_MODULE can be set to "argus.site.settings.dev" but we recommend having a localsettings.py in the same directory as manage.py with any overrides. This file also does not belong in version control since it reflects a specific developer's preferences. Smart things first tested in a localsettings can be moved to the other settings-files later on. If you copy the entire logging-setup from "argus.site.settings.dev" to "localsettings.py" remember to set "disable_existing_loggers" to True or logentries will occur twice.

Debugging tips

To test/debug notifications as a whole, use the email subsystem (Media: Email in a NotificationProfile). Set EMAIL_HOST to "localhost", EMAIL_PORT to "1025", and run a dummy mailserver:

$ python3 -m smtpd -n -c DebuggingServer localhost:1025

Notifications sent will then be dumped to the console where the dummy server runs.

Endpoints

/admin/ to access the project's admin pages.

All endpoints require requests to contain a header with key Authorization and value Token {token}, where {token} is replaced by a registered auth token; these are generated per user by logging in through Feide, and can be found at /admin/authtoken/token/.

Auth endpoints
  • GET to /api/v1/auth/user/: returns the logged in user

  • GET to /api/v1/auth/users/<int:pk>/: returns a user by PK

  • POST to /oidc/api-token-auth/: returns an auth token for the posted user

    • Note that this token will expire after 14 days, and can be replaced by posting to the same endpoint.
    • Example request body: { username: <username>, password: <password> }
  • /oidc/login/dataporten_feide/: redirects to Feide login

  • /api/v1/auth/phone-number/:

    • GET: returns the phone numbers of the logged in user

      Example response body:
      [
        {
          "pk": 2,
          "user": 1,
          "phone_number": "+4767676767"
        },
        {
          "pk": 1,
          "user": 1,
          "phone_number": "+4790909090"
        }
      ]
    • POST: creates and returns the phone numbers of the logged in user

      Example request body:
      {
        "pk": 2,
        "phone_number": "+4767676767"
      }
  • /api/v1/auth/phone-number/<int:pk>/:

    • GET: returns the specific phone number of the logged in user

      Example response body:
      {
        "pk": 2,
        "user": 1,
        "phone_number": "+4767676767"
      }
    • PUT: updates and returns one of the logged in user's phone numbers by PK

      • Example request body: same as POST to /api/v1/auth/phone-number/
    • DELETE: deletes one of the logged in user's phone numbers by PK

    The phone number is validated with a python version of the Google library libphonenumber. It will check that the number is in a valid number series. Using a random number with enough digits that is not in a valid series will not work.

Incident endpoints
  • /api/v1/incidents/:

    • GET: returns all incidents - both open and historic

      Query parameters: All query parameters are optional. If a query parameter is not included or empty, for instance `acked=`, then the rows returned are not affected by that filter and shows rows of all kinds of that value, for instance both "acked" and "unacked" in the case of `acked=`.

      Filtering parameters:

      acked=true|false
      Fetch only acked (true) or unacked (false) incidents.
      open=true|false
      Fetch only open (true) or closed (false) incidents.
      stateful=true|false
      Fetch only stateful (true) or stateless (false) incidents.
      source__id__in=ID1[,ID2,..]
      Fetch only incidents with a source with numeric id ID1 or ID2 or..
      source__name__in=NAME1[,NAME2,..]
      Fetch only incidents with a source with name NAME1 or NAME2 or..
      source_incident_id=ID
      Fetch only incidents with source_incident_id set to ID.
      tags=key1=value1,key1=value2,key2=value
      Fetch only incidents with one or more of the tags. Tag-format is "key=value". If there are multiple tags with the same key, only one of the tags need match. If there are multiple keys, one of each key must match.

      So: /api/v1/incidents/?acked=false&open=true&stateful&true&source__id__in=1&tags=location=broomcloset,location=understairs,problem=onfire will fetch incidents that are all of "open", "unacked", "stateful", from source number 1, with "location" either "broomcloset" or "understairs", and that is on fire (problem=onfire).

      Paginating parameters:

      cursor=LONG RANDOM STRING|null
      Go to the page of that cursor. The cursor string for next and previous page is part of the response body./dd>
      page_size=INTEGER
      The number of rows to return. Default is 100.

      So: api/v1/incidents/?cursor=cD0yMDIwLTA5LTIzKzEzJTNBMDIlM0ExNi40NTU4MzIlMkIwMCUzQTAw&page_size=10 will go to the page indicated by "cD0yMDIwLTA5LTIzKzEzJTNBMDIlM0ExNi40NTU4MzIlMkIwMCUzQTAw" and show the next 10 rows from that point onward. Do not attempt to guess the cursor string. null means there is no more to fetch.

      Example response body:
      {
          "next": "http://localhost:8000/api/v1/incidents/?cursor=cD0yMDIwLTA5LTIzKzEzJTNBMDIlM0ExNi40NTU4MzIlMkIwMCUzQTAw&page_size=10",
          "previous": null,
          "results": [
              {
                  "pk": 10101,
                  "start_time": "2011-11-11T11:11:11+02:00",
                  "end_time": "2011-11-11T11:11:12+02:00",
                  "source": {
                      "pk": 11,
                      "name": "Uninett GW 3",
                      "type": {
                          "name": "nav"
                      },
                      "user": 12,
                      "base_url": "https://somenav.somewhere.com"
                  },
                  "source_incident_id": "12345",
                  "details_url": "https://uninett.no/api/alerts/12345/",
                  "description": "Netbox 11 <12345> down.",
                  "ticket_url": "https://tickettracker.com/tickets/987654/",
                  "tags": [
                      {
                          "added_by": 12,
                          "added_time": "2011-11-11T11:11:11.111111+02:00",
                          "tag": "object=Netbox 4"
                      },
                      {
                          "added_by": 12,
                          "added_time": "2011-11-11T11:11:11.111111+02:00",
                          "tag": "problem_type=boxDown"
                      },
                      {
                          "added_by": 200,
                          "added_time": "2020-08-10T11:26:14.550951+02:00",
                          "tag": "color=red"
                      }
                  ],
                  "stateful": true,
                  "open": false,
                  "acked": false
              }
          ]
      }

      Pagination-support:

      `next`
      The link to the next page, according to the cursor, or `null` if on the last page.
      `previous`
      The link to the previous page, according to the cursor, or `null` if on the first page.
      `results`
      An array of the resulting subset of rows, or an empty array if no results.

      Refer to this section for an explanation of the other fields.

    • POST: creates and returns an incident

      Example request body:
      {
          "source": 11,
          "start_time": "2011-11-11 11:11:11.11111",
          "end_time": null,
          "source_incident_id": "12345",
          "details_url": "https://uninett.no/api/alerts/12345/",
          "description": "Netbox 11 <12345> down.",
          "ticket_url": "https://tickettracker.com/tickets/987654/",
          "tags": [
              {"tag": "object=Netbox 4"},
              {"tag": "problem_type=boxDown"}
          ]
      }

      Refer to this section for an explanation of the fields.

  • /api/v1/incidents/<int:pk>/:

    • GET: returns an incident by PK

    • PATCH: modifies parts of an incident and returns it

      Example request body:
      {
          "ticket_url": "https://tickettracker.com/tickets/987654/",
          "tags": [
              {"tag": "object=Netbox 4"},
              {"tag": "problem_type=boxDown"}
          ]
      }

      The fields allowed to be modified are:

      • details_url
      • ticket_url
      • tags
  • /api/v1/incidents/<int:pk>/ticket_url/:

    • PUT: modifies just the ticket url of an incident and returns it

      Example request body:
      {
          "ticket_url": "https://tickettracker.com/tickets/987654/",
      }

      Only ticket_url may be modified.

  • /api/v1/incidents/<int:pk>/events/:

    • GET: returns all events related to the specified incident

      Example response body:
      [
          {
              "pk": 1,
              "incident": 10101,
              "actor": {
                  "pk": 12,
                  "username": "nav.oslo.uninett.no"
              },
              "timestamp": "2011-11-11T11:11:11+02:00",
              "received": "2011-11-11T11:12:11+02:00",
              "type": {
                  "value": "STA",
                  "display": "Incident start"
              },
              "description": ""
          },
          {
              "pk": 20,
              "incident": 10101,
              "actor": {
                  "pk": 12,
                  "username": "nav.oslo.uninett.no"
              },
              "timestamp": "2011-11-11T11:11:12+02:00",
              "received": "2011-11-11T11:11:13+02:00",
              "type": {
                  "value": "END",
                  "display": "Incident end"
              },
              "description": ""
          }
      ]
      
      Note that `received` is set by argus on reception of an event. Normally,
      this should be the same as, or a little later, than `timestamp`. If there
      is a large gap (in minutes), or `received` is earlier `timestamp`, it
      is likely something wrong with the internal clock either on the argus
      server or the event source.
      
    • POST: creates and returns an event related to the specified incident

      Example request body:
      {
          "timestamp": "2020-02-20 20:02:20.202021",
          "type": "OTH",
          "description": "The investigation is still ongoing."
      }

      If posted by an end user (a user with no associated source system), the timestamp field is optional, and will be set to the time the server received it if omitted.

      The valid types are:

      • STA - Incident start
        • An incident automatically creates an event of this type when the incident is created, but cannot have more than one. In other words, it's never allowed to post an event of this type.
      • END - Incident end
        • Only source systems can post an event of this type, which is the standard way of closing an indicent. An incident cannot have more than one event of this type.
      • CLO - Close
        • Only end users can post an event of this type, which manually closes the incident.
      • REO - Reopen
        • Only end users can post an event of this type, which reopens the incident if it's been closed (either manually or by a source system).
      • ACK - Acknowledge
        • Use the /api/v1/incidents/<int:pk>/acks/ endpoint.
      • OTH - Other
        • Any other type of event, which simply provides information on something that happened related to an incident, without changing its state in any way.
  • GET to /api/v1/incidents/<int:pk>/events/<int:pk>/: returns a specific event related to the specified incident

  • /api/v1/incidents/<int:pk>/acks/:

    • GET: returns all acknowledgements of the specified incident

      Example response body:
      [
          {
              "pk": 2,
              "event": {
                  "pk": 2,
                  "incident": 10101,
                  "actor": {
                      "pk": 140,
                      "username": "jp@example.org"
                  },
                  "timestamp": "2011-11-11T11:11:11.235877+02:00",
                  received": "2011-11-11T11:11:11.235897+02:00",
                  "type": {
                      "value": "ACK",
                      "display": "Acknowledge"
                  },
                  "description": "The incident is being investigated."
              },
              "expiration": "2011-11-13T12:00:00+02:00"
          },
          {
              "pk": 20,
              "event": {
                  "pk": 20,
                  "incident": 10101,
                  "actor": {
                      "pk": 130,
                      "username": "ferrari.testarossa@example.com"
                  },
                  "timestamp": "2011-11-12T11:11:11+02:00",
                  "received": "2011-11-12T11:11:11+02:00",
                  "type": {
                      "value": "ACK",
                      "display": "Acknowledge"
                  },
                  "description": "The situation is under control!"
              },
              "expiration": null
          }
      ]
    • POST: creates and returns an acknowledgement of the specified incident

      Example request body:
      {
          "event": {
              "timestamp": "2011-11-11 11:11:11.235877",
              "description": "The incident is being investigated."
          },
          "expiration": "2011-11-13 12:00:00"
      }

      Only end users can post acknowledgements.

      The timestamp field is optional, and will be set to the time the server received it if omitted.

  • GET to /api/v1/incidents/<int:pk>/acks/<int:pk>/: returns a specific acknowledgement of the specified incident

  • GET to /api/v1/incidents/mine/: behaves like /api/v1/incidents/ except only showing the incidents added by the logged-in user, and no filtering on source or source type is possible.

  • GET to /api/v1/incidents/open/: returns all open incidents

  • GET to /api/v1/incidents/open+unacked/: returns all open incidents that have not been acked

  • GET to /api/v1/incidents/metadata/: returns relevant metadata for all incidents

Notification profile endpoints
  • /api/v1/notificationprofiles/:

    • GET: returns the logged in user's notification profiles

    • POST: creates and returns a notification profile which is then connected to the logged in user

      Example request body:
      {
          "timeslot": 1,
          "filters": [
              1,
              2
          ],
          "media": [
              "EM",
              "SM"
          ],
          "phone_number": 1,
          "active": true
      }

      The phone number field is optional and may also be null.

  • /api/v1/notificationprofiles/<int:pk>/:

    • GET: returns one of the logged in user's notification profiles by PK
    • PUT: updates and returns one of the logged in user's notification profiles by PK
      • Note that if timeslot is changed, the notification profile's PK will also change. This consequently means that the URL containing the previous PK will return a 404 Not Found status code.
      • Example request body: same as POST to /api/v1/notificationprofiles/
    • DELETE: deletes one of the logged in user's notification profiles by PK
  • GET to /api/v1/notificationprofiles/<int:pk>/incidents/: returns all incidents - both open and historic - filtered by one of the logged in user's notification profiles by PK

  • /api/v1/notificationprofiles/timeslots/:

    • GET: returns the logged in user's time slots

    • POST: creates and returns a time slot which is then connected to the logged in user

      Example request body:
      {
          "name": "Weekdays",
          "time_recurrences": [
              {
                  "days": [1, 2, 3, 4, 5],
                  "start": "08:00:00",
                  "end": "12:00:00"
              },
              {
                  "days": [1, 2, 3, 4, 5],
                  "start": "12:30:00",
                  "end": "16:00:00"
              }
          ]
      }

      The optional key "all_day" indicates that Argus should use Time.min and Time.max as "start" and "end" respectively. This also overrides any provided values for "start" and "end". An example request body:

      {
          "name": "All the time",
          "time_recurrences": [
              {
                  "days": [1, 2, 3, 4, 5, 6, 7],
                  "all_day": true
              }
          ]
      }

      which would yield the response:

      {
          "pk": 2,
          "name": "All the time",
          "time_recurrences": [
              {
                  "days": [1, 2, 3, 4, 5, 6, 7],
                  "start": "00:00:00",
                  "end": "23:59:59.999999",
                  "all_day": true
              }
          ]
      }
  • /api/v1/notificationprofiles/timeslots/<int:pk>/:

    • GET: returns one of the logged in user's time slots by PK
    • PUT: updates and returns one of the logged in user's time slots by PK
      • Example request body: same as POST to /notificationprofiles/timeslots/
    • DELETE: deletes one of the logged in user's time slots by PK
  • /api/v1/notificationprofiles/filters/:

    • GET: returns the logged in user's filters

    • POST: creates and returns a filter which is then connected to the logged in user

      Example request body:
      {
          "name": "Critical incidents",
          "filter_string": "{\"sourceSystemIds\": [<SourceSystem.pk>, ...], \"tags\": [\"key1=value1\", ...]}"
      }
  • /api/v1/notificationprofiles/filters/<int:pk>/:

    • GET: returns one of the logged in user's filters by PK
    • PUT: updates and returns one of the logged in user's filters by PK
      • Example request body: same as POST to /api/v1/notificationprofiles/filters/
    • DELETE: deletes one of the logged in user's filters by PK
  • POST to /api/v1/notificationprofiles/filterpreview/: returns all incidents - both open and historic - filtered by the values in the body

    Example request body:
    {
        "sourceSystemIds": [<SourceSystem.pk>, ...]
    }

Models

Explanation of terms

  • incident: an unplanned interruption in the source system.
  • event: something that happened related to an incident.
  • acknowledgement: an acknowledgement of an incident by a user, which hides the incident from the other open incidents.
    • If expiration is an instance of datetime, the incident will be shown again after the expiration time.
    • If expiration is null, the acknowledgement will never expire.
    • An incident is considered "acked" if it has one or more acknowledgements that have not expired.
  • start_time: the time the incident was created.
  • end_time: the time the incident was resolved or closed.
    • If null: the incident is stateless.
    • If "infinity": the incident is stateful, but has not yet been resolved or closed - i.e. open.
    • If an instance of datetime: the incident is stateful, and was resolved or closed at the given time; if it's in the future, the incident is also considered open.
  • source: the source system that the incident originated in.
  • object: the most specific object that the incident is about.
  • parent_object: an object that the object is possibly a part of.
  • problem_type: the type of problem that the incident is about.
  • tag: a key-value pair separated by an equality sign (=), in the shape of a string.
    • The key can consist of lowercase letters, numbers and underscores.
    • The value can consist of any length of any characters.

ER diagram

ER diagram

Notifications and notification plugins

A notification plugin is a class that inherits from argus.notificationprofile.media.base.NotificationMedium. It has a send(incident, user, **kwargs) static method that does the actual sending.

The included argus.notificationprofile.media.email.EmailNotification needs only incident and user, while an SMS medium in addition needs a phone_number. A phone_number is a string that includes the international calling code, see for instance Wikipedia: List of mobile telephone prefixes by country.