artsy/README

[RFC] Distinct Auctions On-Call

dleve123 opened this issue ยท 20 comments

This document is the collective effort of the engineers on the Auctions Product Team.
We are hoping to resolve this RFC by February 3, 2020.

Problem

The auctions business teams feels unfairly supported by the current engineering team's policy that there is no guarantee of support coverage for critical auctions that happen after hours or over the weekend. Historically and currently, the engineers on the auctions product team do provide support for critical auctions; however, this is informal and hence a bit tense.

Members of Auctions Business and Product leadership have been discussing this problem for a few months. We discovered that over 2019, there were 36 sales that were deemed critical and after hours with the following distribution across months (January 2019 == 1; December 2019 == 12). There were 169 total sales during 2019 that occurred after hours.

screenshot-of-2019-critical-auctions

During conversations during the Engineering Code Huddle, members of that meeting (outside of the Product Auctions team) suggested that they wouldn't feel equipped to provide the timely support needed to effectively mitigate issues occurring during an auction.

Proposal

Given these problems, this RFC proposes that Artsy:

  1. Create a distinct auctions on-call rotation staffed by engineers on the Auctions team
  2. Remove engineers from the Auctions team from the larger on-call rotation
  3. Continue operating Artsy's general engineering on-call for non-critical auctions

Distinct On-Call Rotation

Shifts on this rotation would only exist during the span of a single critical auction, which differs from the larger engineering on-call rotation with spans fixed periods of time.

Considering these shifts are only for critical auctions that are after hours, there is an expectation the on-call engineer take a fixed amount of time off after each shift.

This RFC proposes that the decision on the amount of time not be made in this RFC but amongst the product, design and engineering stakeholders within the auctions team and be open to change given evolving product development needs.

Considering the time sensitive nature of critical auctions, and to optimize for an information-push (not information-poll) system, the support expectation for auctions on-call is to:

  • Be available by cell phone when on-call โœ…
  • Not be awake and checking slack/metrics during your shift โŒ

Removal From Larger On-Call Rotation

We want to ensure that any changes made with respect to auction on-call are seen as "different but fair" - aligned with the inherently different support needs of an auction. To that effect, this RFC proposes that we remove the members of the Auctions team from the larger rotation.

There will be 4 engineers on the auctions team as of January 24, 2020 and 29 active engineers within OpsGenie ๐Ÿ”’. As such, this RFC proposes a 13.7% (math: 4 / 29 * 100%) increase in the maintenance burden for the larger engineering on-call rotation. Artsy engineering is planning on expanding and, as such, this increase will shrink as we add new engineers.

General On-Call During Non-Critical Auction

During non-critical auctions that occur after hours, Artsy's existing engineering on-call process applies. Per that process, there is the expectation of limited support after 9am-5pm EST Monday through Friday and on the weekend.

Exceptions / Open Items / Things of Note:

  • Timing: Determination of "critical" will be driven by the Auctions Business team in partnership with the Auctions Product team. It will be vital that this determination occur with sufficient time to plan around the Auction team's schedule.
  • Auctions Load: This plan might need to change if the number of critical auctions after hours becomes untenable for a team of 4. As of January 24, 2020, the load will be 9 auctions/engineer over the course of a year.
  • Standard Incident Operations: Incidents handled by the proposed auctions on-call rotation will be managed and operationalized wherever possible given existing tools and technologies: OpsGenie, #incidents Slack channel, incident response.
  • Potential For Support Spikes: Considering the seasonal nature of auctions, we expect spikes of critical auctions: many critical auctions occurring within short periods of time. This could potentially increase risk of burn out for those on auctions on-call. We'll mitigate this with careful scheduling, ensuring all critical auctions are critical, and potentially changing our plan.

Additional Context

What comes next?

  • Assuming positive resolution
    • Determine "start date" within auctions team (likely sometime in February)
    • Determine hours off after shift expectation within the Product Auctions team
    • Determine process for managing critical auctions schedule cross-functionally
    • On / right before start date
      • Create a new Auctions team in OpsGenie
      • Remove active auction team members from the Engineering team
    • Retrospect on this process ~6 months after its introduction

I'm cautiously in favor of this.

It sounds as though it's asking a lot to either offer no official support (at present) for critical auctions after-hours, or to require members of the wider org be proficient enough to reliably offer auction support.

It also sounds like this might be handled currently in practice by auction devs supporting these critical events in addition to normal on-call shifts, or by maybe attempting to set the on-call rotations such that an auction dev is scheduled during a shift that overlaps with critical events.

Both of those seem less than ideal, and I think the proposed solution covers this.

I am generally concerned with people being silo'ed (in combination with a focus on more stable teams), which this is an example of. But at the same time, that's deserving of its own conversation, and in this case silo'ing seems like it might help some of these raised issues, so I don't think we should be afraid to do so.

This makes sense, I have 2 related concerns with this but don't have a specific solutions for them:

  • This will make auction team even more specific in a sense that it will make it harder for auction team members to switch teams and for others to join the team possibly.
  • On the same area as above, one of the benefits of on-call process to me was the fact that it would create opportunity for everyone to get out of their comfort zone and get familiar with systems they don't work with on daily basis. This will make auction team less involved in anything but auctions.

Maybe the solution for this is to do more knowledge-sharing or more rotations in and out of auction team. but something we need to be aware of.

Agreed with everything in the RFC, and the responses above. I'm on-board with this proposal, but it will definitely silo the Auctions team from the rest of Artsy Engineering. I think it might be worth it. I mean, it's all trade-offs, right?

Auctions are a weird case because they demand such a rapid, real-time response. We could train all our engineers to be able to provide the realtime support that the Auctions business team needs during these auctions, but at a very high cost. On the other hand, the RFC proposed still leaves non-critical auctions covered only by the normal on-call rotation which โ€“ as pointed out in the RFC โ€“ engineers don't currently feel equipped to do.

I'm fine going forward with this, but I agree with Ashkan that we need to be super proactive about knowledge-sharing, in both directions.

A suggestion I have here is that if we move forward with this RFC (and I'm plus one on that count), then I think we should build a retro into the process of adoption. I'd want us to reflect as an engineering org whether the tradeoffs have been worth it maybe six months from now.

Re: #286 (comment)

This will make auction team even more specific in a sense that it will make it harder for auction team members to switch teams and for others to join the team possibly.

I don't necessarily think this has to be the case. One mitigation strategy here could be to pair on someone's first few on-call shifts when joining the team.

On the same area as above, one of the benefits of on-call process to me was the fact that it would create opportunity for everyone to get out of their comfort zone and get familiar with systems they don't work with on daily basis. This will make auction team less involved in anything but auctions.

Undoubtedly, removing the auctions team from the larger on-call rotation would remove some exposure to other systems. With that said, in my experience, I don't think incident response is the best way to provide this exposure: I don't think urgently triaging/fixing an issue in a lesser-known system is the most conducive to learning, nor would that person be best equipped to provide the response needed.

So while I'm all for knowledge sharing โ€“ I think we should validate that the engineering team at larger thinks incident response is an effective way to share knowledge.

cc @ashkan18 @ashfurrow @mzikherman


Re: #286 (comment)

A suggestion I have here is that if we move forward with this RFC (and I'm plus one on that count), then I think we should build a retro into the process of adoption. I'd want us to reflect as an engineering org whether the tradeoffs have been worth it maybe six months from now

Absolutely. Added a line item to the "What comes next" section.

cc @jonallured

My initial bias was strongly against this but after talking with @dleve123 and weighting in the pros and cons, I'm in favor of creating more specialization for auctions engineers with this dedicated rotation. As mentioned by other folks above, sharing operational responsibilities can be a forcing functions to learn more about the various systems but it's not the only way. I think tech reviews, knowledge share and pairings can be more effective given that they're done in much less stressful / reactive conditions.

What are examples of the kinds of "incidents" handled for critical auctions currently? Knowing what the Auctions team could handle that on-call engineers struggle to today might help justify the change.

If the issue isn't that on-call engineers are unable to handle them, but just that they're unavailable during off-hours, this seems like a particular instance of a general problem we will continue to have: how to alert on-call engineers to truly critical incidents (e.g., site down). In that case, I'd be interested in trying to solve that with existing tools, such as by allowing critical auctions incidents to be logged directly in OpsGenie by an appropriate Auction Ops person (which could be routed to engineers' phones if desired).

If the issue is that these incidents tend to require Auction team members, this seems like an instance of another general problem we'll still have: escalating to internal experts. Again I'd be curious if we're fully leveraging what OpsGenie provides. I imagine that on-call engineers could route auctions-related incidents to designated auction team members serving as "2nd level" support.

I recognize that our on-call SLA doesn't fully enable these^ yet, but given the general nature of this challenge I'd hope any "special case" for Auctions could be temporary while we bring our tools and practices into alignment.

This has already been mentioned to an extent, but benefits of our current, shared on-call responsibility include:

  • staffing flexibility over time
  • shared knowledge over time
  • shared practices by necessity
  • uniform expectations for system reliability and support responsiveness

If we accept that these are goals, a separate rotation is non-ideal but might still be justified for business reasons or temporarily while we improve reliability, document support playbooks, or invest in management or monitoring tools. However, I'd expect those projects to be part of the arrangement.

Separate observation: This proposes carving the auctions team members out of the general support rotation, but what about auctions-related systems? It sounds like they might continue to be supported by the general rotation, which creates a strange dynamic in which the team doesn't participate in the on-call support for its own systems (except in the context of critical after-hours auctions).

What i'm generally curious is what types of incidents are being considered here, I can think of:

  • Devops-y incidents. A server that impacts auction flow is down or experiencing issues, things like Causality, Pulse, Prediction, Force. Dealing with these kind of incidents don't need Auction specific knowledge and can be handled by any on-call engineer.
  • Business logic incidents during a live auction. When a bid needs to be updated, lots needs to be skipped. These to me are the type of incidents that can only be handled by Auction team's engineers and as someone not in that team would be scared of handling them but:
    • These are also less of an incident and more of a product/feature need for admins. so eventually we should build admin interfaces to deal with them
    • In the meantime we could use https://github.com/artsy/potential/wiki to educate/help non-auction-team engineers.

That said, i'm still open to try any experience while we are aware of possible tradeoffs.

Re parts of #286 (comment) and #286 (comment) on the types of auctions incidents:

It's worth noting that we don't have too many examples of critical auctions incidents. Important to clarify that there isn't a pressing stability problem with auctions at the moment. The main problem to solve here is providing support to our business partners in the rare event that incidents do happen during critical auctions that happen during odd-hours.

With that said, a few recent incidents of this nature include:

  1. A lot breaking prediction because the lot lacked an artwork image (source)
  2. Prediction acting erratically, blocking our ability to effectively clerk auctions (source)
  3. Bidder unable to transact a work during a Buy Now sale (source)
    • bit debatable as to whether this was an "incident", but notable that our business team thought that it was.

What i'm generally curious is what types of incidents are being considered here, I can think of:

So, I would, generally, classify these types of issues as "business logic" / data-y instead of devops.

If the issue isn't that on-call engineers are unable to handle them, but just that they're unavailable during off-hours, this seems like a particular instance of a general problem we will continue to have: how to alert on-call engineers to truly critical incidents (e.g., site down). In that case, I'd be interested in trying to solve that with existing tools, such as by allowing critical auctions incidents to be logged directly in OpsGenie by an appropriate Auction Ops person (which could be routed to engineers' phones if desired).

My take is that Artsy Engineering, at large, is not interested in extending our general on-call rotation to support odd hours. With that, that's where much more feedback would from you, reader, engineer at Artsy, would be much appreciated.

Furthermore, even if there was interest from the engineering team at large, I think the qualitative (conversations I've had during Engineering Core meeting) and objective (list of recent past incidents) data suggests that non-auction engineers would not be as proficient in mitigating incidents as auction engineers during high profile auctions.

cc @joeyAghion @ashkan18

Re another part of #286 (comment):

If the issue is that these incidents tend to require Auction team members, this seems like an instance of another general problem we'll still have: escalating to internal experts. Again I'd be curious if we're fully leveraging what OpsGenie provides. I imagine that on-call engineers could route auctions-related incidents to designated auction team members serving as "2nd level" support.

I recognize that our on-call SLA doesn't fully enable these^ yet, but given the general nature of this challenge I'd hope any "special case" for Auctions could be temporary while we bring our tools and practices into alignment.

I challenge the notion that it's inherently sub-optimal to have structures/processes where engineers become more specialized.

In my very personal opinion, Artsy is a global business with very diverse products and clients. Effectively supporting the business at this scale very well may require significant depth of knowledge, and given the constraints of time, that depth of knowledge could be at the expense of other knowledge.

Processes / automation often lags behind changing features / requirements, and I think coming from the perspective that processes / automation will allow human engineers to maintain sufficiently generalized just doesn't match my experience working at Artsy or other technology companies.

In short, I think we should be open to this possible evolution, instead of approaching with an attitude of specialization being always sub-optimal.

With said that, we might learn that a specialized on-call just doesn't work, and we can change our plan at that time.

However, I think that's critical that if we adapt this RFC, then we approach with the mentally that it could be the best solution. Otherwise, I think we would be a bit dooming the plan to fail from the get-go and could be wasting resources attempting to operationalize it :)

cc @joeyAghion

Separate observation: This proposes carving the auctions team members out of the general support rotation, but what about auctions-related systems? It sounds like they might continue to be supported by the general rotation, which creates a strange dynamic in which the team doesn't participate in the on-call support for its own systems (except in the context of critical after-hours auctions).

A concrete example might be helpful here, but I don't think that we should couple any rotation to systems, but to workflows.

It sounds like you would not want to see this as temporary, even if we could eventually improve off-hour alerting and escalation in general. That's fine; I suppose it would still be "experimental" in the sense that everything is subject to ongoing refinement.

Artsy Engineering, at large, is not interested in extending our general on-call rotation to support odd hours

This^ might be where some of my resistance originates. Ultimately, I am interested in extending our support coverage, if only for the most critical of alerts (e.g., I'd bristle at the product being unavailable between 5pm and 9am on a weekend). However we haven't had the practical tools for this until recently. If and when we do, we should obviously consider any opportunities for reconsolidation.

The examples were very helpful! They triggered these questions/points of clarification:

  • This is more about a specialized support SLA than specialized engineering skills, right? Certainly in terms of the off-hours, and possibly also in terms of the criteria for an "incident."
  • There's potential for overlap and interactions between general and auctions-specific "incidents." (E.g., mild metaphysics latency compounding to impact Prediction availability, or nightly crons interfering with an off-hour auction event.) What do you have in mind for those situations? For clarity and simplicity, should the auctions rotation take "primary" responsibility during these off-hour incidents?
  • You propose that non-critical auctions would continue to be handled by the general rotation, but I'm unclear about critical auction incidents during regular hours.

By my "...what about auctons-related systems?" question, I just meant that if, say, Prediction generated availability alerts or Causality generated error rate alerts outside of critical auction periods, responsibility for addressing these would fall to the general support rotation without participation from the relevant team. It's not that important; just a case of odd incentives.

It sounds like you would not want to see this as temporary, even if we could eventually improve off-hour alerting and escalation in general. That's fine; I suppose it would still be "experimental" in the sense that everything is subject to ongoing refinement.

Correct! I think this could be long-term stable / optimal for Artsy and propose that reviewers assess this RFC as such. If after implementing we learn that it isn't, then we can try another approach.


This^ might be where some of my resistance originates. Ultimately, I am interested in extending our support coverage, if only for the most critical of alerts (e.g., I'd bristle at the product being unavailable between 5pm and 9am on a weekend). However we haven't had the practical tools for this until recently. If and when we do, we should obviously consider any opportunities for reconsolidation.

Agreed. I do think that the relatively unique time-scoped nature of benefit and live auctions makes this support extension a bit "easier". To be super clear, I feel the same way about us not having coverage during the weekends at large. However, I want the actions proposed by this RFC to be inline to be with our current stated policy, even if that stated policy isn't seen as the best to everyone.

This is more about a specialized support SLA than specialized engineering skills, right? Certainly in terms of the off-hours, and possibly also in terms of the criteria for an "incident."

Yes, I think so if I understand your question. This RFC isn't an attempt to define an "auctions engineer" or anything like that. It's trying to find a creative yet fair solution to the problem of supporting our auctions business.

There's potential for overlap and interactions between general and auctions-specific "incidents." (E.g., mild metaphysics latency compounding to impact Prediction availability, or nightly crons interfering with an off-hour auction event.) What do you have in mind for those situations? For clarity and simplicity, should the auctions rotation take "primary" responsibility during these off-hour incidents?

This is a good question! To role play, let's imagine that there was critical auction starting at 10 PM EST, there was significant performance degradation in MP impacting Prediction, Joey was on general on-call and Daniel was on auctions on-call. I would expect that Daniel would be the first-responder, but would tag in Joey in as soon it made sense to. So yes, I think the "primary" language makes sense, but with the caveat that experience will best inform an answer here.

You propose that non-critical auctions would continue to be handled by the general rotation, but I'm unclear about critical auction incidents during regular hours.

Yes, thanks for asking! The auctions product team is synced with the auctions business teams here, so we're thinking about keeping things informal during on-hours auctions. With that said, this very much might evolve.

By my "...what about auctons-related systems?" question, I just meant that if, say, Prediction generated availability alerts or Causality generated error rate alerts outside of critical auction periods, responsibility for addressing these would fall to the general support rotation without participation from the relevant team. It's not that important; just a case of odd incentives.

Ah gotcha โ€“ yeah I'm not totally sure! In my opinion, this isn't so much a problem to solve yet, so willing to let this evolve naturally for now.

I think I've caught up, but apologies if I'm asking something that's already come up.

Like @joeyAghion alluded to-- the goal with our current on-call system was never to not have out-of-hours support, it was just to recognize that without proper tooling it wasn't feasible. It was better to be explicit than to leave on-call engineers in a liminal state.

However, the work that we've been doing recently (cc @dblandin and @eessex) to route alerts through OpsGenie is making this possible and much closer than it was before. If there is a need (and it sounds like there is), what's stopping us from investing in making sure that these auction systems are alerting properly and that we have a clear escalation path to auction engineers in cases where an on-call person is not equipped to solve the problem?

If there are incidents which need to be manually alerted about, there are ways to do that as well.

As you say, as our business grows it will become more and more important for us to provide support in off-hours for critical incidents. This goes for all parts of our business, not just auctions, so I would rather see this as an opportunity to improve our alerting/on-call infrastructure vs. creating a separate system. I could imagine a similar escalation path for the purchase team re: BNMO, for example.

If that's of-interest, I'm more than happy to help tease out how this might work!

However, the work that we've been doing recently (cc @dblandin and @eessex) to route alerts through OpsGenie is making this possible and much closer than it was before. If there is a need (and it sounds like there is), what's stopping us from investing in making sure that these auction systems are alerting properly and that we have a clear escalation path to auction engineers in cases where an on-call person is not equipped to solve the problem?

I would love to get a more detailed understanding of this work stream. Is there a sense of what this looks like tactically? Is there a timeline for such a project? It's worth noting that the auctions high season is in April and we want to implement a solution here in advance of that.

If there are incidents which need to be manually alerted about, there are ways to do that as well.

Almost all of these types of incidents are alerted upon manually.


As you say, as our business grows it will become more and more important for us to provide support in off-hours for critical incidents. This goes for all parts of our business, not just auctions, so I would rather see this as an opportunity to improve our alerting/on-call infrastructure vs. creating a separate system. I could imagine a similar escalation path for the purchase team re: BNMO, for example.

Absolutely agreed. In fact, such a "forked" on-call could be a way to get data on "forked" on-call rotations in general.

To re call-out, a key difference between auctions and the rest of Artsy's workflows is that the auctions workflow is uniquely time-bound (auctions start and end and those times are pretty well-know) and as such, staffing off-hours for auctions is a much more minimal step than staffing off-hours for the rest of the business.

Also of note, there was sense that mitigating a critical issue during a critical auction would be best handled by engineers closest to the auctions systems anyways โ€“ so instead of escalating (very quickly), might as well have that engineer on-call to avoid dead-time due to communication.


In summary, I am very excited about continuing to automate alerting and escalation tooling. With that said, I think such a system of improvement will take time, and in the meanwhile, I would rather make the informal formal with respect to how the auctions team handles critical auction support and optimize for fairness across the team.

Do you have objections or foresee any negative outcome to be avoided here? If not, I would rather try out this process, learn and iterate from there.

I believe April is a very reasonable timeframe for investigating how to use/expand on our infrastructure for this.

Would you be interested in meeting up to discuss how we might go about that? I do worry, in general, about creating a system that's difficult for us to roll back or change (or scale) so am curious what we can put in place now.

I realized I'd chimed in on our team's initial discussions but not here. In short I think we should accept this RFC as-is and schedule a time to retro and revisit the process during Q2.

I definitely hear the concerns several people have voiced around siloing, specialization and, conversely, losing the opportunity to work on other systems. I personally think this is playing more to the ideal than my actual experience of being on call- in any incident that I handle I feel like I'm most often looking for an expert in an affected service immediately after verifying a report, then focusing on communications, Opsgenie paperwork, and so on with a bit of looking over the shoulder. That said I would be happy to stay in the existing on call rotation while picking this up, provided some kind of time off to compensate for being online during strange hours.

My reason for supporting this is practical. Our Auction Ops team has specifically asked for the reassurance of some off-hours support for these auctions. These are our coworkers and primary users of many auction team products, and it's important that they feel supported, to say nothing of the potential impact of an incident during a live sale. I feel like we owe it to them, and that our current on-call rotation isn't structured to address this request.

Finally I would just say that I would be open to trying to adjust the existing on-call processes by making sure all engineers contact info is easily available and making sure that each rotation is aware of any live auctions happening on its watch. There would be other questions to work out in any case.

@SamRozen @joeyAghion @erikdstock @yuki24 @williardx and @dleve123 just met to discuss this RFC and reached resolution.


Resolution

We decided to implement dedicated support for critical auctions, staffed by the auctions engineering team.

Level of Support

3: Majority acceptance, with conflicting feedback.

Additional Context:

"conflicting" might not be the most accurate term for this feedback, but there were a lot of points raised about concerns regarding:

  1. The general fact that our on-call isn't more 24/7
  2. That knowledge sharing might be lost if there were more silo'ing of on-call
  3. Such a separate on-call might lead to undesirable tooling and process drift

Next Steps

  • @dleve123 and stakeholders from the Auctions Product Team will work with their business counter-parts to establish a process for adding and managing critical auctions to a shared calendar.
  • @dleve123, @dblandin @sweir27 and the rest of the on-call working group will align on, tactically, what is the way to realize a vision for very quick escalation.
  • @dleve123 to schedule a retro for 2 quarters from now.

Exceptions

We're going to retrospect on this in ~2 quarters time. @joeyAghion (and others) can leave comments on this issue for topics that they want to ensure are covered during the retrospective.

Woot! Props to @dleve123 for being a great steward for this effort!! ๐ŸŽ‰