Runbooks

An on-call & event management platform for SREs to manage, investigate & solve incidents

Data Visualization
User Research
UX/UI
Prototyping
Web Accessibility

Outcome

Reduced SREs' Incident-solving Time By 29%

In the dogfooding testing with Splunk internal designers and engineers, the data shows that, on average, 29% of time is saved when switching between different tabs, logging into different platforms, and reading various contents of runbooks.

Gained Immediate Customer Value Through V1 MVP

I interviewed seven customers, facilitated two workshops, and developed mockups while collaborating with a cross-functional team at Splunk. Through my design solution, we were able to demonstrate a potential V1 that provided immediate value to Splunk customers in just two months.

Drove Future Enhancements In The Product Roadmap

By integrating runbooks directly into the incident responder journey, we can deliver a better and more comprehensive platform to our customers. This will serve to achieve Splunk's goals of providing all incident responders with the tools they need to respond to incidents more efficiently.

Context

What Is Incident Intelligence (AIOps)?

Splunk Incident Intelligence, an on-call and event management platform (Perseus), aims to reduce outage time by escalating the right alerts.
Typically, they utilize runbooks to standardize their approaches for resolving incidents.

What Are Runbooks?

A runbook is a documented set of procedures or instructions that guide engineers or operators in handling specific tasks to resolve incidents. It provides step-by-step guidance to ensure consistent and efficient execution of processes.

Why Are Runbooks Needed?

Runbooks can be utilized for operational purposes, such as rolling back a code deployment, troubleshooting issues causing downtime in a user's infrastructure, and providing directives on which team or team member should escalate specific issues.

Who Needs Runbooks?

Runbooks are essential for Team Leads (Admins) and Incident Responders (SREs) involved in incident response, as they ensure consistent, efficient, and effective handling of incidents across teams and individuals.

• Manages small to large teams of SREs;
• Often overburdened with administrative tasks;
• Attempts broad awareness of service; infrastructure but may have knowledge gaps.
• Needs to know when they are on-call;
• Often woken late at night for critical incidents;
• Need context, guidance, support on the most optimal way to resolve incidents.

Problems

Understanding The Customers Through Interviews

My manager and I interviewed 7 SREs to gain a better understanding of runbooks. Based on these interviews, we formed an opinion on how runbooks should be defined, their usefulness to teams, and the value of integrating them into the incident intelligence experience.


One Core Problem For Team Leads

After conducting an affinity map exercise following the interviews, my manager and I further identified and developed four core problem areas that are impacting our customers, based on our understanding from the interviews.

1. How can we make runbooks easy to create, update, share and correctly attach to a related incident?

"I would love a one-stop-shop with everything in it - managing on-call rotation, handling the incident itself (moving away from Jira), and having runbooks attached to particular services..."


Three Core Problems For SREs

1. How can we ensure runbooks have clear contents and easy-to-follow steps?

"I must ensure that my team and I are taking appropriate steps to resolve specific issues to ensure speedy MTTR and adherence to my org’s SLA."

2. How can we align communication methods between incident responders?

"When an incident is triggered, a Jira ticket is created and a new Slack channel is created with the title of the incident number (unique ID). Every stakeholder joins the channel and communicates with the team."

3. How can we provide helpful context and supporting data to incident responders to help them take action as well as to see the impact of their actions?

“It should be integrated with Incident Response platform. I’m sensitive to context switching between different tools...”

“Should be somewhere between a dashboard with some instructions; system health dashboard; recipes for debugging issues; automated slack notifications - consolidate these all together in a single pane...”

Generative Workshop


Refining Ideas With Customer Values In Mind

After the generative workshop, I refined the ideas with a focus on empathy and customer value, ultimately selecting the ones that received the highest consensus. The following are the selected ideas presented here.



Evaluative Workshop

Evaluating Ideas With A Cross-functional Team

After gathering ideas from the generative workshop, I invited PMs, designers, and internal engineers for an evaluative workshop.

The purpose was to further evaluate these ideas, identify the most priority features, and discuss their technical feasibility for me to proceed with the design process.


Impact-effort Feasibility Matrix

My manager and I created an impact-effort matrix to encourage PMs and designers to highlight features with the highest customer or product value in the impact section.

Additionally, we aimed to provide engineers with more insights into the effort section, helping them identify features that should not be designed at this time.

Our focus is on defining the high-value features as part of our MVP definition.

Synthesis

Scope Of Functionality For PoC

Due to the limited two-month time frame, my goal was to prioritize the development of V1 MVP for both Admin and SRE views, while ensuring the implementation of the following proof-of-concept functionalities:

• Users can view and take action on runbook steps.
• Users can view related observability (o11y) resource data and relevant metadata in runbooks.
• Users can assign a task that directs incident responders to an APM dashboard.
• Users can monitor transactions running through a service.
• Users can create, save, and assign runbooks to a service across the observability (o11y) experience.
• SREs who are paged when an incident is created.
• Users can communicate with others at any time through integrated applications.

Scope Of Errors

There are two scopes of errors for runbooks:

• Error might be occurring in the configuration of runbook steps (unable to load the next step).
• There might be overlapping routing in runbooks (runbooks will not create multiple incidents).

Design Development

Design Direction

Providing standardization to incidents for SREs, including automated guidance on initial strategic steps that incident responders can take to resolve incidents as quickly as possible.

Design Iteration

Design isn't a one-and-done process. I explored various design directions to ensure an inclusive approach, iterating on my wireframes through critiques with PMs and other designers.

I started with the runbook writing screen as it provides a good overview of all the features. In my previous draft, I split the runbook steps and displayed them one by one in Perseus, creating simple connections between each step. However, during the design critique with senior designers, I received feedback that this version of the design was too basic to follow and lacked automation. Additionally, the steps tracing view only showed the action status, raising concerns about the ability to go back to a previous step if a user clicked the wrong option.


I appreciate all the feedback provided, as it informed my final design presented here. I made iterations such as using automated dialogue to provide clear directions for SREs to follow and adding a kebab icon next to each step's status, giving users more options such as going back to the previous step.




Design Challenge 01

For team leads, how can we make runbooks in Perseus easy to create, update, share, and correctly attach to related incidents?

Team leads need to ensure that runbooks are easy to create, to update, to share, and to attach to incidents in order to ensure value, team adoption, and adherence to the team’s Incident response processes.



Design Solution 01

A customer-centric smart runbook admin view that enables team leads to effortlessly create and manage events.

The admin's journey begins by navigating to the incident intelligence settings page. From there, they click on the runbooks tab and proceed to create a new runbook. The editor provides a seamless experience for adding runbook steps and text. The admin gives the runbook a name and saves their work.

2. Used automated dialogue to creates clear directions for SREs to follow.

The automated dialogue provides clear instructions to SREs, enabling them to read runbooks quickly and increase efficiency.

3. Flow-view shows relationship of each steps clearly.

The flow-view makes runbooks steps clear and easy to revise and manage. However, due to its high impact and the substantial effort required in the evaluative workshop, this idea is reserved for future use and not included in the MVP.

Design Challenge 02

For SREs, how can we ensure runbooks have easy-to-follow steps, aligned communication methods, and provide helpful context to enable efficient action and visibility of impact?

At the admin-to-SRE handoff stage, we can kick-off the SRE experience by showing them in the incident warroom, where they can select the runbook that was just created from the list and begin following its steps.



Design Solution 02

A straightforward smart runbook incident responder view that enables SREs to efficiently and swiftly resolve incidents with simplicity.

Once an incident is triggered, the admin embarks on their journey by arriving at the incident warroom. There, they select a runbook from the list of previously created runbooks under their administration. As the admin selects the runbook, it automatically populates the runbook tab for easy reference. The SRE then takes charge, embarking on their investigator journey, diligently following the steps outlined in the runbook. If the step instructs them to do so, the SRE navigates to APM to review impacted services. Step by step, the SRE progresses through the runbook until they successfully resolve the incident.

Key Features:

1. Simplifies the incident warroom by reducing it to a panel with impacted services display.

If the step instructs SREs to review impacted services in APM, a button can be provided that, upon clicking, opens APM, pre-sets the appropriate filter to display only impacted services, and reduces the incident warroom to a right-hand oriented panel. This allows SREs to quickly access the relevant information they need, focus on the impacted services, and take prompt actions, ultimately leading to faster incident resolution and improved operational efficiency.

2. Easy-to-follow steps that allow SREs to see the impact of their actions.

This provides a structured and guided approach for SREs during the incident resolution process, enabling them to effectively navigate through the investigation journey in a consistent, efficient, and accurate manner. Ultimately, this leads to faster incident resolution times, reduced downtime, and improved overall service reliability.

Dogfooding Testings

Reduced SREs' Incident-solving Time By 29%

Due to the short timeline, I couldn't conduct a full usability test. Instead, I performed dogfooding tests to validate the flow's coherence. I shared prototype links with the Splunk designers and engineers channel on Slack, assigning them a task to compare the time taken to resolve incidents using my design solutions versus the time taken using current tools. The results revealed an average time saving of 29% by eliminating the need to switch between tabs, login to different platforms, and read various runbook contents.

Testing Feedback

After the testing, the teams appreciated the key features I implemented, as mentioned above. However, there are two areas that need improvement:

1. Runbook lacks dynamism and collaboration.

Currently, my design solution lacks dynamism and collaboration. The current MVP runbook is designed for one person to solve one incident with a single runbook. However, in practice, it is common for one incident to be related to multiple runbooks and involve different SREs who are assigned various tasks and steps.

2. Runbook was created linearly.

The runbook is created in a linear manner, and it becomes excessively large when the task is complex. This linear runbook can become unwieldy, especially if it contains over 50 steps.

Next Steps

1. Designing Runbook Modules To Improve Dynamism & Collaboration.

Based on user testing feedback, the introduction of a runbook module emerges as a potential solution to enhance the dynamism and specificity of runbooks.

Here is a wireframe showcasing how the runbook module make the runbook experience more dynamic, intelligent, and aligned with the evolving needs of incident resolution.

Closing
Thoughts

1. Design is never ending.

All of these elements were what I shipped at Splunk. However, after the initial shipment, there were multiple versions as my team added new features and the system improved based on user feedback along the journey. Consequently, my team constantly revisited and iterated on the design. It turned out that the scalable manner in which I built it allowed them to pivot and adapt for the long term.

2. Design is about collaboration.

A designer's job is not solely about creating great designs all day long. It primarily involves collaborating with others to ensure that the design makes sense and can be successfully implemented.

At Splunk, with the help of my managers and teammates, I was able to solve the SRE problem of using runbooks in just 2 months. Although you may not be familiar with them, I want to acknowledge their valuable contributions. They are truly exceptional individuals, and I thoroughly enjoyed working with them. I am immensely grateful for all that I have learned from them, and it is thanks to their assistance that I was able to achieve this feat within such a short timeframe.

3. Design is about how things work.

Design is not just about how things look; it is also about how they work. As a designer, it is my responsibility to ensure that everything I do at this company prioritizes creating designs that can be built, will be built, and align with the customer's tone.

4. Turning design into full business products.

While it is not my first time designing B2B products, this particular B2B experience has provided me with a realistic view of how design supports a product team. It has given me a deeper understanding of the problems we are solving and the collaborative ways we work with other stakeholders to address those challenges. It has been fulfilling to witness the transformation of design into full-fledged business products and the ability to simplify complex matters.

OverviewOutcomeContextProblemsGenerative workshop Evaluative workshop SynthesisDesign developmentChallenge 1: Admins viewChallenge 2: SREs viewTestingsNext stepsClosing thoughtsBack to top