Can Grafana Adaptive Metrics Help Slash Observability Costs?
It seemed like a good problem to have — at first. The goal was to extend and deepen visibility across Teletracking’s environments as the company accelerated its transition to a cloud native infrastructure. Oren Lion, director of software engineering at healthcare services platform provider Teletracking, was leading the migration of the company’s development teams from an open source stack, including tools like Prometheus, Thanos and Grafana.
The transition revealed an overwhelming volume of metrics, many of which were underutilized but still drove up costs. Initially, the team was managing over 1 million time series, eventually reaching 2 million, highlighting how quickly metrics can scale. Lion described the metrics explosion as an “informing but costly learning — we knew we weren’t using more than a half a million time series.”
“It was astonishing to realize how we were hardly using all these metrics and yet our spending was skyrocketing,” Lion said. “I was blown away by the volume of metrics that started flooding in.”
Metrics costs began to escalate rapidly. The goal was to onboard teams to Grafana Labs to better manage telemetry data. However, allocated observability costs quickly more than doubled per month. It became increasingly difficult to manage and reduce these costs quickly enough. The process of identifying, addressing and validating changes across PRs and deployments was slow and labor-intensive. “This led our executive team to question why we were so over budget, to which we responded that we were still adapting to the new environment and working to get costs under control,” Lion said.
Upon closer analysis, two primary sources were identified: custom metrics used for measuring the business events that are processed and dependencies representing the resources needed to support business services. Within these dependencies, tools like kube-state-metrics, node-exporter and Java Management Extensions (JMX) metrics were frequently employed, contributing to the vast volume of metrics managed across the infrastructure.
When migrating to another cluster, especially during blue-green deployments, the number of time series can quickly double, going from 300,000 to 600,000 due to the same microservices running on two clusters temporarily, Lion said. “This growth due to custom metrics is easy to see, but it becomes more complex when considering dependencies,” he said. For instance, a tool like Promtail, which pushes logs to Grafana Cloud, generates 275 time series per pod. If you have a daemon set on 40 nodes, that’s 10,000 time series from just one dependency. Multiply that by 30 dependencies and you can quickly reach a million time series between custom metrics and dependencies, Lion explained.
At the root of the challenge is how high cardinality in metrics is akin to verbosity in logs, yet there’s no easy way to “dial down” metrics like you can with log levels, Lion said. This lack of control leads to a surplus of metrics that aren’t always necessary, driving up costs. Teams had to manually detect, diagnose and resolve metric-related issues, often reacting to cost spikes caused by unexpected surges in metrics during deployments.
Additionally, teams often plan for monitoring their services and dependencies but fail to estimate and track the costs associated with these metrics. This oversight leads to an excess of time series and cost overruns. In observability, controlling verbosity is crucial, Lion said. Logs have levels (such as error, info and debug) that help manage verbosity, but metrics lack a similar mechanism, making it difficult to filter and reduce them effectively, he added.
After much effort and small dents in cost controls, Lion and his team opted for Grafana Adaptive Metrics, which serves as a “log level but for metrics.” Adaptive Metrics effectively reduce the verbosity of metrics, Lion said. Since each label in a metric may result in generating many time series, Adaptive Metrics effectively dial back the number of labels in a metric, and fewer time series are produced.
The recommendations published by Adaptive Metrics enable organizations to automate the process of assessing what metrics are needed and where. The organization quickly slashed its metrics costs by using an early iteration of Grafana Cloud’s Adaptive Metrics. Within a few weeks, after addressing all the defects and getting Adaptive Metrics fully operational, they realized a 50% reduction in costs.
“With a lower run rate on metrics we looked to further lower overall observability costs,” Lion said. “Without increasing our yearly budget with Grafana Labs we were able to discontinue more expensive renewals with other providers for IRM [incident response management] and front-end monitoring and consolidate under Grafana.”
Adaptive What?
In this case, Grafana’s Adaptive Metrics were successfully used, resulting in net cost savings. At the same time, other observability providers also offer cost optimization features. These include Datadog, Elastic, Honeycomb, New Relic and others. While they may not always use the same terminology, they serve the same purpose of being more integrated to provide only the necessary metrics, rather than opening the floodgates.
Adaptive Metrics can be likened to video monitoring of traffic situations in a large metropolitan area. If the cameras are merely and permanently monitoring the conditions, they could be adapted — and probably are in many cases — to take all that imagery data and correlate it into just the data that affects traffic conditions indirectly. For example, they might issue alerts in the event of traffic jams and extend that to analyzing why those traffic conditions occurred. This is directly comparable, as an analogy, to troubleshooting in a network or among distributed applications.
According to Grafana’s documentation, Grafana Adaptive Metrics helps optimize Prometheus cardinality to reduce observability costs by identifying and eliminating unused metrics, ensuring you only pay for what you use. The capability is used to analyze usage patterns for dashboards, alerts, reporting rules and query history to generate recommended aggregations. You have control over which recommendations to apply, so only necessary data is persisted while keeping dashboards, alerts and queries operational. As usage patterns change, Adaptive Metrics recommendations are updated.
Adaptive Metrics identifies which labels have been used and suggests safe aggregations to reduce cardinality. When applied, the aggregated version retains only the necessary labels, significantly reducing the number of series while ensuring all functions continue to work. If a metric isn’t being used at all, recommendations are used to reduce high-cardinality labels without removing the metric. Users can see which labels have been dropped and revert the aggregation if needed.
Hard Work
Adaptive Metrics is not a magic wand. In Teletracking’s case, the issue of metric flooding persists, even though solutions like Grafana have helped reduce some of the burden. Detecting spikes and addressing them over time is still a Whac-A-Mole effort, including setting up alerts to catch these issues early. “When my team gets alerted, we have to go through a process to resolve the problem, starting with identifying and diagnosing the issue,” Lion said.
One challenge is tracing back the origin of the spike to a specific team or service, which often stems from a seemingly innocent custom metric, like a histogram showing latency. “However, if this histogram has a high-cardinality label and multiple buckets, it can cause the time series count to skyrocket,” Lion said.
To manage this, Lion created a metric spend dashboard that tracks the total number of time series and costs incurred by each team. “I can drill down into service lines or groups of services to identify which service caused a spike,” Lion said. “Once identified, I can contact the responsible team to update outdated configurations in their service monitor and redeploy the service through Spinnaker to Kubernetes in both staging and production environments. Only then can we start reducing the metrics, but the effort involved in this process is substantial.”
Finally, to debug metrics, Lion and his team rely on tools like the cardinality management dashboard in Grafana. “I can easily access it from the homepage, where metrics are already sorted by the highest number of time series, the highest percentage of total series, and whether those metrics are in use or not,” Lion said. “I can also drill down into specific metrics to examine values and labels for context, which helps identify the root cause. The same process applies to labels in the cardinality dashboard, as some labels are high cardinality and are spread across multiple metrics.”
More Wanted
During conversations with Grafana engineer Patrick Oyarzun about how much effort to invest in identifying the root cause of a surge in metrics and whether to apply client-side relabeling or let Adaptive Metrics aggregate away the problem, Oyarzun responded with the approach they take internally to manage metrics, Lion said. This approach is completed using a scheduled job in GitHub to pull Adaptive Metrics recommendations and apply them. “We’re choosing to go down this ‘GitOps’ path and let Adaptive Metrics save us time and money,” Lion said.
“Looking ahead, we’re also excited to explore how we can save time and money by piloting Adaptive Logs,” Lion said. “So far, we’ve had a really positive experience, but it’s still early days. With these features, we’ll be able to reinvest in other parts of our observability stack and get the most out of Grafana Cloud.”