Source: GitLab Blog | Author: Sarah Waldner
GitLab Incident Management는 대응 팀이 문제에 집중하고 프로세스 자체에 시간을 낭비하지 않고 평균 대응 시간을 단축할 수 있도록 지원합니다.
Managing incidents can be stressful! While you’re busy trying to restore service for your customers, you are also likely juggling several competing priorities: digging through multiple tools to understand the problem, communicating with stakeholders, and updating tickets in different systems. Did you know that you can use GitLab to help manage the chaos?
GitLab Incident Management, which recently became a viable category, aims to decrease the overhead of managing an incident so response teams can spend more time actually resolving problems. We do this by enabling teams to quickly gather the resources in one central, aggregated view. We facilitate communication and enable teams to have dialogs that can be captured all in the same tool they already use to collaborate on development. Ultimately, GitLab Incident Management can help response teams to shorten MTTR.
Why Incident Management within GitLab?
GitLab is a complete DevOps platform, delivered as a single application. As such, we believe there are additional benefits for DevOps users to manage incidents within GitLab.
- Co-location of code, CI/CD, monitoring tools, and incidents reduces context switching and enables GitLab to correlate what would be disparate events or processes within one single control pane.
- The same interface for collaboration for development and incident response streamlines the process. The developers who are on call can use the same interface that they already use every day; this prevents the incident responders from having to use a tool that they are unfamiliar with and thus hampering their ability to respond to the incident.
GitLab Incident Management Capabilities
Available today, GitLab Incident Management includes the following highlighted capabilities:
- Incident issues as the one place to capture all data and information related to the incident.
- Integration with Slack to facilitate intuitive team communication
- Link Zoom calls to GitLab issues to facilitate synchronous communication
- Embed GitLab-managed Kubernetes metrics directly within the GitLab Incident Issue
- Embed generic Grafana metrics directly within the GitLab Incident Issue
- The GitLab alerts endpoint can accept alerts from any source via a generic webhook receiver
- Prometheus Recovery alerts can automatically close issues that were created when you receive notification that the alert is resolved.
How to use GitLab Incident Management
There are numerous entry points to a potential incident. As an incident responder, once you are aware of an ongoing incident, you can manually create an incident issue by simply tagging the issue with the
Alternatively, you can also configure GitLab to automatically create incidents based on alerts from your monitoring tool. When an alert is posted to the GitLab Alerts endpoint, GitLab can create incidents using an issue template, populating important information useful to the incident response team.
The incident issue template can be customized using quick actions to label, mention team members, or assign to specific people automatically. Doing so will help create incidents that have a consistent baseline set of information to help jumpstart the incident response.
As more details for the ongoing incident emerge, you can directly embed GitLab-managed Kubernetes cluster metrics and application metrics in the incident. You can also embed other Grafana metrics in the incident if this is a critical tool for your team. Sharing up to date information in a central location will help facilitate understanding and enable the incident response team to move forward armed with the latest information. Having embedded charts can also enable more effective retrospectives by having relevant information within the same view.
As the firefight progresses, the incident response team is encouraged to add timeline events, updates, questions, and answers to the incident. These interactions help create an audit trail and enable shared understanding across the team.
At the end, an incident can be automatically closed once GitLab receives a recovery alert via the enabled Prometheus recovery alert integration. As the team reconvenes to determine actionable next steps, it can leverage the completed incident ticket to find improvement areas instead of relying on a separate tool. Furthermore, a team can directly create and link action items to the incident issue in the form of related issues and merge requests to improve the resiliency of the system.
Get started by visiting the Incident Management documentation page and create an issue template. Adopt a new process or amend the existing process for incident management to take advantage of the capabilities within GitLab.
Incident Management is a focus area for GitLab in 2020. We plan to continue iterating and improving this category. We’d love your help in prioritizing work on the most valuable improvements to the incident management solution. Keep an eye on Incident Management Issues and upvote or share your experiences in relevant issues.
To report a bug or request a feature or enhancement, follow these steps:
- Open an issue in the GitLab project.
- Describe the feature enhancement and, if possible, include examples.
- Add these labels to the issue: Category:Incident Management, devops::monitor, group::health
- Tag @sarahwaldner on the issue.