Grafana is a web-based visualization tool for observability, and also part of a whole stack of related technologies, all based on open source. You can configure various data sources — time series sources like Prometheus, databases, cloud providers, Loki, Tempo, Jaeger — and use or even combine them for your observability needs. As of April 2022, that means metrics, logs, traces, and visualizations based on those concepts. That is where dashboards come into play.
Dashboards are the typical first solution that even small companies or hobbyists can use to quickly get insights of their running software, infrastructure, network, and other data-generating things such as edge devices. In case of incidents, or for just glancing over them at will, dashboards are supposed to give a good overview to find and solve problems. Great dashboards make the difference of understanding an incident in 10 seconds vs. digging into red herrings for over 5 minutes. Meaningful dashboards also ease the path to setting correct alerts that do not wake you up unnecessarily.
On top of dashboards, you can leverage alerts, logs and tracing which together form a simple and helpful look at your stuff, if done right — or a complicated mess like your legacy software code, if done wrong 😏. Those other concepts are out of scope in this article. I only focus on how to achieve helpful, up-to-date dashboards, using private research and my production experience, where I extensively used Grafana dashboards — and also logs — to resolve incidents fast, and often with ease.
Typical examples on the internet present dashboards as a huge screen full of graphs, sometimes a collection of numbers or percentages. A lot to look at for human eyes. I will show you how to visualize better, with realistic examples. This article showcases solutions for Grafana and Prometheus, but can also be applied generically for other platforms.