mozilla-releng/balrog

balrog change notifier no longer working

Closed this issue · 10 comments

We discovered today that we never configured e-mail when we migrated to GCP. I want to see if these are still necessary before we spend any effort fixing them.

It's been 2.5 years, presumably we should just drop this code by now...

It's been 2.5 years, presumably we should just drop this code by now...

For some additional context, these were added in response to an RRA ages ago. IMO we were always too spammy about things for it to be useful, and no person nor machine ever monitored the list we sent them to.

It's been 2.5 years, presumably we should just drop this code by now...

For some additional context, these were added in response to an RRA ages ago. IMO we were always too spammy about things for it to be useful, and no person nor machine ever monitored the list we sent them to.

@moz-hwine do you have an opinion on this?

hwine commented

@jcristau We have 2 separate issues here:

  1. the email process broke
  2. We're no longer in compliance with an RRA recommended action

email: by itself, this is not a security concern.

RRA: @bhearsum do you happen to remember which RRA might have had this recommendation? Absent a reference, my guess is that this was, in part, to provide an audit log for incident response. (We do have a recommendation to manually audit access to several related systems, but I don't know if the email group addressed that concern.)

Recommendation:

  • get in touch with Risk Assessment for what to do about the old RRA recommendation.
  • At the least, Risk Assessment will want to update the old recommendation
  • there may be newer tools to address this issue

@mper0 - anything to add?

Keeping an audit track for changes for rules and permissions is good practice, whatever tech is used. On the alerting side, given that there seemed to be many changes a day (around 10-15), I'm wondering if it would be useful/feasible to identify a set of more critical permissions/rules that should trigger an alert.

I have some context and historical notes to add on a few things...I'll try to roll them up in this comment.

@jcristau We have 2 separate issues here:
RRA: @bhearsum do you happen to remember which RRA might have had this recommendation? Absent a reference, my guess is that this was, in part, to provide an audit log for incident response. (We do have a recommendation to manually audit access to several related systems, but I don't know if the email group addressed that concern.)

@jcristau found this - it's https://docs.google.com/spreadsheets/d/1Rya5ObK51nAYGd_kwFoiH2zh_KD6edKF9w4S_bdzQbg/edit#gid=0

I want to emphasize the fact that, frankly, even when we had these notifications they were not useful. They were too noisy for anyone to ever make any sense of (we made them as noisy as was required by the RRA).

I also want to note that this RRA happened prior to having multiple signoffs here. Now that we have that, there is no known way for anyone to modify a crucial Balrog rule or release through the API without another human looking at it first -- so in some ways, we already have this notification system, it just happens prior to the changes going live. (It's still theoretically possible for IT/Ops to mess with live rules without oversight, but those same people would also be able to kill any notification system we have, so there's not much that can be done about that.)

Keeping an audit track for changes for rules and permissions is good practice, whatever tech is used. On the alerting side, given that there seemed to be many changes a day (around 10-15), I'm wondering if it would be useful/feasible to identify a set of more critical permissions/rules that should trigger an alert.

Balrog already has a full audit log in its history tables (going back to its earliest existence). I agree that if we are to have alerts, the crux of the problem is keeping the signal-to-noise ratio useful. Even if we only had one "noise" email per week, it's easy to start ignoring them when the "signal" is literally zero (we have never had a change to these tables that would be considered "signal" in this context).

We've brainstormed with @jcristau about what could possibly go wrong:

Worst case scenario is an attacker can block an update, which could be bad in case of a critical security update. They would generally need to gain extended privileges though:

  • release: peer review is required for changes, QA is manually testing once that the update works. The time frame to modify rules without being noticed would be 2 weeks (which is a bit long). However it would require somebody with maximum privileges to modify the rules directly on the live server, so it sets the bar high.
  • nightly/beta: no peer review is required. For nightly, the roll out is every 12h, so if an update doesn't happen, it would be noticed. Beta is every 2 days.

This leaves us with two questions:

  1. Should we detect for suspicious live changes for release when we have no other trace (like a peer review)?
    I would ideally say yes, but I'm not sure we can come up with a likely enough scenario that would justify the cost of added friction here. And we do not want to reproduce a notification solution that no one checks because it is not useful enough. We could think of:
  • automating the check that QA is currently doing manually in order to perform it more regularly (I'm not sure how technically feasible it is).
  • Julien suggested diff'ing the changes of the history tables if we have some sensitive patterns we could detect - to be identified.
    @hwine what would be your take on this?
  1. How are the history tables currently backed up? We should have audit logs even if an attacker is able to delete the history on Balrog. @jbuck could you tell us how this is configured right now and how long we keep database dumps?
jbuck commented
2. How are the history tables currently backed up? We should have audit logs even if an attacker is able to delete the history on Balrog. @jbuck could you tell us how this is configured right now and how long we keep database dumps?

The MySQL database has 7 days of point-in-time recovery and 30 days of daily backups. This is done by CloudSQL's managed service. The audit logs would record any changes made to the Balrog CloudSQL instance, but I don't think it would record things like SQL statements made to delete the history table for example