Presentation: How We are Dealing with Metrics at Scale on GitLab.com
As GitLab.com has grown, the number of metrics generated by the application has grown exponentially. Ensuring our team has good quality dashboards and alerting rules was becoming an ever more challenging task. There’s no worse time than experiencing an outage that you expected to have been warned of, only to find out that the alert had been inoperable for months. As an engineer on the infrastructure team supporting GitLab.com, sometimes it felt, during an incident, that we were drowning in data while at the same time struggling to access the most pertinent indicators of the underlying issue. This talk discusses how we are addressing this problem by building up a catalog of key metrics for each component within our application, and then using this definition to automatically generate beautiful Grafana dashboards, rock-solid alerting rules and high-quality SLA indicators. This talk is primarily aimed at Prometheus users, but the fundamentals could be applied to any other metrics system.