Source: GitLab Blog | Author: Olena Horal-Koretska
Status Page는 인시던트 상태 및 유지 관리 시간을 전달하는 도구입니다. 최근에 우리는 내부 팀과 고객 모두에게 최고의 사고 관리 경험을 제공하기 위해 GitLab에서 이 도구를 구축하기 시작했습니다.
Current Status Update Approach
Incident handling in GitLab happens inside the issue in a dedicated public project. The team discusses and posts updates there. Public updates are manually published by engineer-on-call to status.gitlab.com every 15 mins. This is not ideal – responders lose precious time during fire-fight switching tools and duplicating information. Also having public project for incident management means:
- massive load on your instance in the “hard times”
- higher monetary cost
- no access to status updates if your GitLab instance is down
- sensitive information that comes up in a discussion is public and may cause vulnerability exploit while it is being fixed
Our first customer was our internal team. We dogfood everything, and the Status Page wasn’t an exception. So requirements were built based on internal team needs:
- No tool switching for incidents updates: people that handle incidents have enough burden on their shoulders with fixing incidents. So we should spare them responding to pings about what happened, what the status of the incident is, and how it is progressing. But at the same time provide to those impatient once updates as soon as they are available. Incident Status should be updated in one place both for peer-problem-solvers and the public.
- Ability to control visibility level – which updates are published and which are not: when you have an issue in your product you do not necessarily want to shout out: “Hey, you malicious hacker, we’ve got a problem – go exploit it”. You want to let your team handle it calmly in a timely manner. But at the same time you want to send assuasive messages to the public without distracting fire-fight team.
- Display all types of data from GitLab incident description and comments: markdown, images, embedded charts on Status Page. As incidents are handled in GitLab issues, a variety of data representation is available to showcase and communicate problem or solution. This rich data has to be available in public updates.
Building Status Page
Our new Status Page is designed to address all above mentioned concerns.
SPIKE: an investigation of available solutions and risk estimation. It is necessary so the team can be aligned on general direction before starting the implementation itself
Initially, we considered leveraging one of many open-source Status Page implementations. But none of them could satisfy all our requirements, so eventually we decided to just go ahead and build our own implementation.
Backend and Data Scraping
When we started, we first brainstormed all the different solutions we could utilize to collect data from incidents issues to be automatically published to the Status Page:
Option 1: (GitLab) Webhooks – user sets up the endpoint to which GitLab will post incident updates
Option 2: Alerts coming directly from Prometheus Alertmanager
Option 3: Status page itself monitoring other services
Option 4: Human beings pushing a markdown file to git or calling the API with some utility. e.g.
Option 5: CI job running manually or scheduled to run during certain intervals
Those approaches required either manual user input, additional CI resources, or building some sophisticated piece of software that was unnecessary for this case. So refining the list down to the optimal solution, without over-engineering while being able to provide instant feedback on Status Page, the incident issue is converted to JSON and published to the Status Page by a background job.
Here at GitLab we love VueJS so much we contribute to it, and so the team has great expertise. Consequently our component library GitLab UI and styling utilities are based on VueJS. You could have guessed that we didn’t have much doubt about the frontend framework to use! Besides the UI library as a dependency, GitLab provides
stylelint, and SVGs as npm packages as well. It was very convenient to have them handy as any new project setup always raises lots of questions about best practices & best tools. With all of this, the Status Page was able to be GitLab-branded. Feel free to use GitLab utilities in your own project too!
NOTE: Status Page is a stand-alone application, hosted in a separate GitLab repository consuming JSON files generated by a background job. It is distributed under MIT license and can be used apart from GitLab given that correct data source is provisioned. At the same time, you’ll get the best experience by using our status page with GitLab.
Frontend along with generated JSON data sources is published to cloud storage – currently we support only Amazon S3 because we are hosted on Google Cloud and want our status page to still be available if Google Cloud is down. Credentials are provided by the user when setting up incident tracking project for Status Page.
Once an incident issue is created/updated in GitLab (manually or via alert), its description (with all types of data) along with comments that were marked as public will be picked by background job, converted to JSON and mirrored on Status Page.
Give it a try
Here’s a great step by step guide on how to setup a Status Page for your project with GitLab. Enjoy and may all your systems be operational!
I do not want to tire you, my dear reader, with more technical details. There was much more discussed and still to be implemented. But all this wouldn’t be possible without great collaboration from the Monitor:Health Team to whom I’m thankful for all heated discussions, great insights, quick iterations, fast fails and advantage to work with.