Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: broken links and make external links open in new tab #22

Merged
merged 3 commits into from
Dec 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/defining.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ For your organization, a key first step is to come to a shared understanding of

Take the following steps to help create that shared understanding:

1. **Define the team that will own the service.** Start by considering who is responsible for the service you are defining. A service should be wholly owned by the team that will be supporting it via an on-call rotation. If multiple teams share responsibility for a service, it's better to split up that service into separate services (if possible). Some organizations call this "service mitosis"—splitting one cell into two separate cells, each looking very similar to the former whole. There are several methods for deciding how to separate services like, for example, splitting them up based on team size or volume of code they manage. You can read more about [how we did that at PagerDuty.](https://www.pagerduty.com/blog/well-formed-delivery-teams/)
1. **Define the team that will own the service.** Start by considering who is responsible for the service you are defining. A service should be wholly owned by the team that will be supporting it via an on-call rotation. If multiple teams share responsibility for a service, it's better to split up that service into separate services (if possible). Some organizations call this "service mitosis"—splitting one cell into two separate cells, each looking very similar to the former whole. There are several methods for deciding how to separate services like, for example, splitting them up based on team size or volume of code they manage. You can read more about [how we did that at PagerDuty.](https://www.pagerduty.com/blog/well-formed-delivery-teams/){:target="_blank" }

1. **Set up the on-call rotation for this service.** Ensure that the people on the team share responsibility for ensuring availability of the service in production. Create on-call schedules that rotate individuals and back-up responders on a regular cadence, as well as policies that include escalation contacts.

Expand Down
6 changes: 3 additions & 3 deletions docs/digital_transformation.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@
## Digital Transformation
Why are you here and why are you reading this documentation now? Chances are, it's because your organization is undergoing an initiative to modernize your technical practices—also known as digital transformation.

Digital transformation is now an imperative initiative for many executives and organizations because speed is what matters when competing for customers. The PagerDuty [State of Digital Operations 2017 UK report](https://www.pagerduty.com/resources/reports/digital-operations-uk/) showed that when a digital service is unresponsive or slow, 86.6% of consumers will switch to an alternative way (read: not your company) to complete their task. And when they do, 66.8% said they would probably never return to try your service again. Those are pretty stark numbers.
Digital transformation is now an imperative initiative for many executives and organizations because speed is what matters when competing for customers. The PagerDuty [State of Digital Operations 2017 UK report](https://www.pagerduty.com/newsroom/uk-digital-operations-report/){:target="_blank" } showed that when a digital service is unresponsive or slow, 86.6% of consumers will switch to an alternative way (read: not your company) to complete their task. And when they do, 66.8% said they would probably never return to try your service again. Those are pretty stark numbers.

Services will fail; it's a fact of operating. How your company responds when that happens is what makes all the difference between consumers staying with you or abandoning your services in favor of a competitor's. Reducing the number of handoffs by empowering engineers to own their services in production can significantly mitigate that risk by reducing resolution time whenever incidents occur. Placing subject matter experts with direct knowledge of the systems they support in the role of first responders will decrease the inevitable chaos and panic that arise from uncertainty, leading to lower incident resolution times.

Faster incident resolution times are not the only benefits of implementing full-service ownership ([other benefits include](https://cloud.google.com/devops/state-of-devops/) increased market share, higher profitability, and better team satisfaction, among others), but it is one of the easiest to quickly measure to demonstrate ROI during a transformational initiative that may otherwise take several months or years to achieve measurable results and key desired outcomes.
Faster incident resolution times are not the only benefits of implementing full-service ownership ([other benefits include](https://cloud.google.com/devops/state-of-devops/){:target="_blank" } increased market share, higher profitability, and better team satisfaction, among others), but it is one of the easiest to quickly measure to demonstrate ROI during a transformational initiative that may otherwise take several months or years to achieve measurable results and key desired outcomes.


## Demonstrating Value
Expand All @@ -20,7 +20,7 @@ Measuring the impact of major incidents can be done with a simple graph where X

![incidentimpact chart](/assets/images/incident_impact.png)

_For more information on how PagerDuty classifies incidents, check out the [Incident Response Guide](https://response.pagerduty.com/before/severity_levels/)._
_For more information on how PagerDuty classifies incidents, check out the [Incident Response Guide](https://response.pagerduty.com/before/severity_levels/){:target="_blank" }._

How your organization reacts and mitigates the impact of an incident can be measured several ways. Arguably, there may be more sophisticated ways of measuring incident impact, but a good first step for many organizations looking to quantifiably demonstrate the success of shifting full-service ownership models is to focus on measuring Mean-Time-To-Acknowledge (MTTA) and Mean-Time-To-Resolution (MTTR).

Expand Down
2 changes: 1 addition & 1 deletion docs/escalations.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ The first level of an escalation policy in an organization that has adopted full

Some teams treat the first level of the escalation policy as the primary point of contact for questions about the service. When paged, that person has become the "interrupt handler" and the role of "standby interrupt handler" shifts to another person within the on-call rotation. Additionally, whenever the service they own is impacted, the service they own is affected by an incident within another service, or if they're considered a stakeholder, the first-level responder will also be brought in as the subject matter expert (SME).

Often, there are multiple people on the first level—the primary responder and a shadow, who should be observing and learning from the primary responder. As new people join the team, they should have the opportunity to shadow so they can ramp up and be prepared to deal with incidents on their own. At PagerDuty, we practice [shadowing and reverse shadowing](https://www.pagerduty.com/blog/on-call-shadow-practice/).
Often, there are multiple people on the first level—the primary responder and a shadow, who should be observing and learning from the primary responder. As new people join the team, they should have the opportunity to shadow so they can ramp up and be prepared to deal with incidents on their own. At PagerDuty, we practice [shadowing and reverse shadowing](https://www.pagerduty.com/blog/on-call-shadow-practice/){:target="_blank" }.

### Second Level
The second level of escalation is meant to catch notifications that weren't acknowledged. Most organizations understand that everyone is human. Sometimes the person on call can't work on the current problem. If the first-level responder is unavailable, escalation to the second level typically activates within a time period determined by the service's tier level and the SLOs.
Expand Down
12 changes: 6 additions & 6 deletions docs/functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ Every service should set realistic expectations for its availability with a set

The goal with SLI/SLO/SLAs is to set both internal and external expectations for the behavior of your service. Internally, service owners must have a measure for gauging whether or not their service is behaving as expected so that they can make decisions around investing time and effort toward improving availability if expectations are unmet. Externally, service consumers must also understand your expected availability so that they can make decisions around investing time and effort in managing failures that may occur as a result of consuming your service.

SLAs should be consistent for each service. If you have more than one SLA for a service, consider splitting up the service. Each service should only be accountable to one SLA, though it may have multiple SLOs. For more about how to define service availability, we recommend reading [Chapter 4 of the Google SRE book](https://landing.google.com/sre/sre-book/chapters/service-level-objectives/).
SLAs should be consistent for each service. If you have more than one SLA for a service, consider splitting up the service. Each service should only be accountable to one SLA, though it may have multiple SLOs. For more about how to define service availability, we recommend reading [Chapter 4 of the Google SRE book](https://landing.google.com/sre/sre-book/chapters/service-level-objectives/){:target="_blank" }.

### Service Tiers
Defining service tiers may be an alternate way to set organization-wide expectations to reduce the overhead of individual service teams. By agreeing on the definition of what services in a particular tier will provide, it can more quickly and effectively accomplish the task of setting expectations in a standardized way.
Expand All @@ -77,10 +77,10 @@ At PagerDuty, a Tier 1 service has the strictest expectation of latency, uptime,

We also have Tier 2 and Tier 3 services. Tier 2 services, for example, have support hours that only span across weekdays. These types of services may provide additional functionality that is not critical or may be new services that have not yet been rolled out to general availability in production.

Your tier definitions may be different (or sometimes, [even our own!](https://www.pagerduty.com/blog/how-to-manage-a-tier-zero-service/)). For example, some organizations have a strictly binary sense of tier levels: critical path or not critical path. Your organizational definitions may vary, but establishing them can quickly help set expectations across teams.
Your tier definitions may be different (or sometimes, [even our own!](https://www.pagerduty.com/blog/how-to-manage-a-tier-zero-service/){:target="_blank" }). For example, some organizations have a strictly binary sense of tier levels: critical path or not critical path. Your organizational definitions may vary, but establishing them can quickly help set expectations across teams.

## Runbooks
Things will go wrong. Services will fail for any variety of reasons. As you learn about the different nuances of your service, you should keep a record of what you've successfully done in the past that might resolve common issues in the future. In an ideal world, you have the ability to resolve these common issues at the source of the problem within your code. But sometimes, a procedural fix may be necessary until you reach that ideal state. [Runbooks](https://www.pagerduty.com/resources/learn/what-is-a-runbook/) are a great place to store these types of procedural fixes.
Things will go wrong. Services will fail for any variety of reasons. As you learn about the different nuances of your service, you should keep a record of what you've successfully done in the past that might resolve common issues in the future. In an ideal world, you have the ability to resolve these common issues at the source of the problem within your code. But sometimes, a procedural fix may be necessary until you reach that ideal state. [Runbooks](https://www.pagerduty.com/resources/learn/what-is-a-runbook/){:target="_blank" } are a great place to store these types of procedural fixes.

Always remember that a runbook is not comprehensive — it's not possible to consider every type of incident or issue that a service may encounter, and due to the nature of complex systems, there isn't always a common procedure for every incident that occurs. However, it can be helpful to capture repeated processes or initial approaches to troubleshooting.

Expand All @@ -89,17 +89,17 @@ When changes are made to how the service functions or is implemented, or when pr
## Production Operations
Owning a service also means operating it in production. The service team is responsible for setting up monitoring and alerts (focus on ensuring your SLOs are met), setting up and maintaining supporting tools, and ensuring an appropriate level of robustness and reliability.

Some organizations use a combination of a centralized infrastructure support team (e.g., SRE) in tandem with support that comes from within the service ownership team. Some organizations have service ownership teams that own the entire stack soup to nuts. The level of responsibility here will vary. However, a number of effective guides to running software in production are available, and we recommend turning to the [Google SRE book](https://landing.google.com/sre/sre-book/toc/) for in-depth guidance when operating in production.
Some organizations use a combination of a centralized infrastructure support team (e.g., SRE) in tandem with support that comes from within the service ownership team. Some organizations have service ownership teams that own the entire stack soup to nuts. The level of responsibility here will vary. However, a number of effective guides to running software in production are available, and we recommend turning to the [Google SRE book](https://landing.google.com/sre/sre-book/toc/){:target="_blank" } for in-depth guidance when operating in production.

## Project Management
Full-service ownership extends well beyond the practices of software development and operating software in production. Software is neither written nor run in a vacuum. How the business supports ongoing development is a critical function of enabling full-service ownership.

How does your service handle unforeseen circumstances? There's an element of unpredictability in full-service ownership. When things go wrong and an incident occurs, you will discover new things via outcomes and action items from your [postmortems](https://postmortems.pagerduty.com). Hopefully, your team might discover necessary changes before an incident occurs. But how and when does your development team perform proactive maintenance? Project managers are an integral part of full-service ownership because they have to be mindful about building in necessary buffers for unplanned work.
How does your service handle unforeseen circumstances? There's an element of unpredictability in full-service ownership. When things go wrong and an incident occurs, you will discover new things via outcomes and action items from your [postmortems](https://postmortems.pagerduty.com){:target="_blank" }. Hopefully, your team might discover necessary changes before an incident occurs. But how and when does your development team perform proactive maintenance? Project managers are an integral part of full-service ownership because they have to be mindful about building in necessary buffers for unplanned work.

Project managers can help make full-service ownership manageable for their development teams in several ways:

- Defining what "done" is
- Being aware how much stress the team is under and shielding them from [executive swoop-and-poop](https://response.pagerduty.com/training/courses/incident_response/#executive-swoop)
- Being aware how much stress the team is under and shielding them from [executive swoop-and-poop](https://response.pagerduty.com/training/courses/incident_response/#executive-swoop){:target="_blank" }
- Understanding and mitigating dependencies by doing connective tissue work between different teams and features
- Bringing awareness of what it means to pull people away from other projects to solve a problem

Expand Down
Loading