Improving Your Reliability through Modern Operations Practices

OPS10: Building the Foundation for Modern Ops: Monitoring

You are concerned about the reliability of your systems, services, and products. Where should you start?

In this session, you’ll get an introduction to modern operations disciplines and a framework for reliability work. We jump into monitoring: the foundational practice you must tackle before you can make any headway with reliability. Using Tailwind Traders as an example, we’ll demonstrate how to monitor your environment, including the right (and wrong) things to monitor – and why. You’ll leave with the crucial tools and knowledge you need to discuss and improve reliability using objective data.

OPS20: Responding to Incidents

Your systems are down!

Customers are calling. Every moment counts.

What do you do?

Handling incidents well is core to meeting your reliability goals.

In this session, we’ll explore incident management best practices that will help you triage, remediate, and communicate as effectively as possible.

We’ll also explore the tools Azure provides to get you back into a working state when time is of the essence.

OPS30: Learning from Failure

Incidents will happen—there’s no doubt about that. The key question is whether you will treat them as a learning opportunity to make your operations practice better or just as a loss of time, money, and reputation.

In this session, you’ll dive into one of the most important topics for improving reliability: how to learn from failure. We’ll listen in on one of Tailwind Traders post-incident reviews, often called a postmortem so we can see how it is done. You’ll learn how to shape and run this process so it actually yields value from something that would ordinarily be just a failure. After this session, you’ll be able to build a key feedback loop in your organization that turns unplanned outages into opportunities

OPS40: Deployment Practices for Greater Reliability

Infrastructure and software delivery methods have a direct and material impact on reliability. Manual service deployment and provisioning is slow, error-prone, and can result in incidents. Using modern continuous deployment practices and provisioning methods can reduce overhead while preventing incidents before they happen.

In this session, we will see how continuous delivery pipelines have helped Tailwind Traders and the rest of the industry deploy tested software to production environments to increase reliability. We’ll also explore modern methods for environment provisioning using infrastructure as code. As a result of attending this session, you will gain practical information on automated deployment and provisioning solutions using Azure-based technology.

OPS50: Preparing for Growth: Capacity Planning and Scaling

When your growth or the demand for your systems exceeds, or is projected to exceed, your current capacity – that’s a “good” problem to have. However, this can be just as much of a threat to your system’s reliability as any other factor.

In this session, dive into capacity planning and cost estimation basics, including the tools Azure provides to help with both. We wrap up with a discussion and demonstration of how Tailwind Traders judiciously applied Azure scaling features. Learn how to satisfy your customers and a growing demand, even when “challenged” by success.