OperationCode/operationcode_backend

Make sentry monitoring better

apex-omontgomery opened this issue · 10 comments

Feature

Why is this feature being added?

Sentry is adding a bunch of value to our monitoring, we wish to increase this value in the future.

What should your feature do?

  1. Configure to include correct build environment (staging, prod)
  2. Tie the build id of travis, github and sentry so we can relate build issues to sentry issues.
  3. Send logs to chatops for repeated messages, but without spamming them.
  4. Identify routes that require elevated monitoring or de-escalated monitoring. (some routes require every error, some require an increased number of errors).

That's not really how Raven works though. It essentially alerts us to when users encounter an error in their browser. The input errors aren't browser errors.

Some of the errors above are due to unknown coding/ infra issues that were only found when a user did some action, from looking at the docs you can modify the config for raven in all sorts of ways.

https://docs.sentry.io/clients/ruby/config/

Currently we only have the default config:

Raven.configure { |config| config.dsn = "https://#{OperationCode.fetch_secret_with(name: 'sentry_credentials')}@sentry.io/147247" }

There are some exceptions we don't currently log, but an example that could be helpful would be to capture specific routes we are concerned with and pipe them to slack: https://docs.sentry.io/clients/ruby/usage/#reporting-messages

Right. I'm saying that Raven is client-side only. There's no server-side implementation of Sentry that I'm aware of. #chat-ops is only Raven dumps.

After chatting a bit more, this does seem like an issue. Clearly some services are reporting via Raven. Perhaps it's an aspect of an e2e implementation w/ Raven.

All aside, most of Raven's errors in #chatops seem very cryptic and unhelpful.

Yeah, I think that we could really benefit from identifying routes that are very important, including context and breadcrumbs and ensuring that these items get properly logged.

This has been corrected.
Will close the ticket once we document the fix and corrective actions somewhere and what we expect sentry to do.

The cause of this was twofold;

  • Slack app was not enabled.
  • Filtering of events was too aggressive and only showed them the first time they happened.

@nellshamrell and leads, please add your expectations for logging in #chatops to this thread.

After playing with sentry here's what I think.

  1. Connect sentry and github so that any errors will have the commits tied to them
  2. Tie the build id of sentry, travis and github, and the server again so we can have the correct release ID
  3. Include error context (user signups can have user email, environment can be set to production, staging, development, any arbitrary context can also be set to aid in further issues.
  4. Send logs to chatops for repeated messages, but without spamming them. Perhaps reset on new build or release?
  5. If sentry prompts a user feedback form we should know about that fairly quickly.
  6. There should be specific routes that require elevated monitoring (user signups, password resets, mentor requests?) Maybe routes that indicate a user wouldn't be able to notify us to the issue.

I think of sentry as monitoring for application behavior, so we shouldn't be using it for monitoring CI/CD.

Okay so sentry is actually working well. I think now with the addition of staging, we need to prioritize targetting environment. Currently the api.staging.... is showing up on sentry as production.