From 2715333a6c66c9eaa648c13e10c02481f47280aa Mon Sep 17 00:00:00 2001 From: Jerod Santo Date: Tue, 1 Oct 2024 09:49:42 -0500 Subject: [PATCH] Big Tent 14, 15 --- bigtent/big-tent-14.md | 300 +++++++++++++++++++++-------------------- bigtent/big-tent-15.md | 293 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 446 insertions(+), 147 deletions(-) create mode 100644 bigtent/big-tent-15.md diff --git a/bigtent/big-tent-14.md b/bigtent/big-tent-14.md index 620b4cf5..e200eca5 100644 --- a/bigtent/big-tent-14.md +++ b/bigtent/big-tent-14.md @@ -1,293 +1,299 @@ -**Mat Ryer:** Hello. I'm Mat Ryer, and welcome to Grafana's Big Tent. It's a podcast all about the people, community, tools and tech around observability. Today we're talking about monitoring Kubernetes. Why do we need to do that? Why do we need an episode on that? Well, we're gonna find out. And joining me today, it's my co-host, Tom Wilkie. Hi, Tom. +**Mat Ryer:** Hello and welcome to Grafana's Big Tent, the podcast all about the people, community, tools and tech around observability. I'm Mat Ryer. Today we're discussing metrics, specifically adaptive metrics. We'll learn about what that is and why that is. -**Tom Wilkie:** Hello, Mat. How are you? +Helping me dig into this subject, I'm joined by some guests... I'm joined by Patrick, Mauro and Oren. Welcome to Grafana's Big Tent. Perhaps you could give us a quick intro of yourselves... Patrick, starting with you. -**Mat Ryer:** Pretty good. Where in the world are you doing this from today? +**Patrick Oyarzun:** Sure, yeah. My name is Patrick Oyarzun, I'm a Principal Engineer at Grafana. I've been working on Mimir and Hosted Metrics for about two and a half years, joined Adaptive Metrics about two years ago, and I've spent most of the time working on the Recommendations Engine. -**Tom Wilkie:** Well, I guess for all the podcast listeners, you can't see the video... But out the window you can see the Space Needle in Seattle. +**Mat Ryer:** Cool. Hopefully we'll learn a bit more about some of the work you've been doing today. Mauro, welcome. -**Mat Ryer:** Okay, that's a clue. So from that, can we narrow it down? Yeah, we'll see if we can narrow that down with our guests. We're also joined by Vasil Kaftandzhiev. Hello, Vasil. +**Mauro Stettler:** Hi. Yeah, my name is Mauro. Just like Patrick, I'm also a Software Engineer at Grafana. I'm working on Adaptive Metrics as well, and... Yeah, I think I started Adaptive Metrics as a hackathon project roughly around three years ago. -**Vasil Kaftandzhiev:** Hey, Mat. How are you doing today? +**Mat Ryer:** Very cool. Okay, and also our final guest, Oren. Oren, you're not from Grafana, are you? Where are you from? And welcome to the show. -**Mat Ryer:** Oh, no bad. Thank you. And we're also joined by \[unintelligible 00:01:05.12\] Hello, Deo. +**Oren Lion:** Thank you, Mat. I'm super-excited to be here. I work at Teletracking, and you probably have never heard of us. -**Deo:** Hey, Mat. It's nice to be here. +**Mat Ryer:** Well, we have now. -**Mat Ryer:** It's an absolute pleasure to have you. Do you have any ideas where Tom could be then, if he's got the Seattle Needle out his window? Any ideas? +**Oren Lion:** Now you have. And we've got a tagline, maybe that'll help make it more memorable... And it's "No patient waits for the care they need." And just to help convey what that means - to me as a patient, for example, I get care in a smooth and timely way. As a patient, I don't need to wait to get admitted, and I don't need to wait to stay longer than I need to. -**Vasil Kaftandzhiev:** He's definitely not in Bulgaria, where I'm from. +And from the healthcare systems perspective, they need to run their business efficiently, by increasing their capacity to care, healthcare systems can lower their operating system costs, and run sustainably. For example, by improving bed turnaround time, and by reducing ED wait times, and by reducing length of stay. -**Tom Wilkie:** I am not, no. +And by the way, Mat, you may have been an indirect customer, hopefully not too often... We are implemented in a number of trusts in the NHS. So my name is Oren Lion, and I run productivity engineering at Teletracking. And my interest in adaptive metrics is that I'm looking for ways to provide a high quality of service - I need monitoring for that, but also, I need to reduce the cost of running services. -**Mat Ryer:** Yeah. Okay, is that where you're dialing in from? +**Mat Ryer:** Excellent. And you're an engineer, you're not in marketing? -**Vasil Kaftandzhiev:** I'm dialing from Sofia, Bulgaria. This is a nice Eastern European city. +**Oren Lion:** I actually work through my engineers. I'm a manager of engineers, so I get to take credit for the work they do. -**Mat Ryer:** Oh, there you go. Advert. Deo, do you want to do a tourist advert for your place? +**Mat Ryer:** Yeah, fair enough. Okay, good. So maybe we could then -- let's start with the problem that adaptive metrics is solving. What's the problem? Too many metrics? How can it be that many? -**Deo:** Yeah, I'm based in Athens. \[unintelligible 00:01:31.28\] almost 40 degrees here. It's a bit different. +**Oren Lion:** Yeah, I was thinking about this, Mat... And as I'm looking back on how I migrated all the development teams from our open source observability stack, Prometheus, Thanos, Trickster and Grafana, to Grafana Cloud, I was blown away by the volume of metrics flooding in. There's no way we're using them, and spend is just skyrocketing. And we're pushing over 1 million time series. And in case my team ever listens to this, when I say "I am taking credit for your work" - I just want to say that a few times. So thank you, productivity engineering team at TeleTracking. -**Mat Ryer:** Athens sells itself really, doesn't it? Alright. +But back to the question - so how did I get to a million time series? And I'm asking myself, "Where do they come from?" And I'm thinking of two camps. Custom metrics, and dependencies. So custom metrics are what we write to measure processing business events, and dependencies are the resources we use to support our business services. -**Tom Wilkie:** Well, I can assure you it's not 40 degrees C in Seattle. It's a bit chilly here. It's very welcoming to a Brit. +\[00:04:17.06\] So in Kubernetes - think KubeState, EC2, NodeExporter, Kafka, JMX... And what I've found is that 40% of the time series come from custom metrics, and the other 60% of time series come from dependency metrics. -**Mat Ryer:** I don't know how the ancient Greeks got all that work done, honestly, with it being that hot there. It's that boiling... How do you invent democracy? It's way too hot. +Now, to get a picture of this, here's a hard example of custom metrics. I've got 200 microservices. Each service produces 500 time series, and they're scaled to three pods. So that's 300 time series pushing from that cluster. So 300,000, and I'm well on my way to a million. And I'm wishing this was capital gains on my stocks... But let's say we migrate to a new cluster and we blue/green. Now we're pushing 600,000 time series for a time. And just turning to dependencies - so take PromTail; it runs as a daemon set, so there's one pod per node. It publishes 275 time series per pod, on a 40-node cluster. That's 10,000 time series. -**Tom Wilkie:** Is that a global warming joke, is it, Mat? I don't think thousands of years ago it was quite that warm. +So back to the question, "How did I get to a million time series?" It's fast and easy. When you work with the teams, it's like they're thinking ahead about design, and about how to monitor their service and dependencies, but fall short of estimating and tracking the cost to monitor a service. -**Mat Ryer:** Oh, really? No... It must have still been. Actually, that's a great point. I don't know. Okay, well... Right. Tell me. Why do we need to do a podcast episode on monitoring Kubernetes. Aren't traditional techniques enough? What's different? Deo, why do we need to have this particular chat? +**Mat Ryer:** Yeah. It's a very tough problem to solve, so I don't really blame them... You take credit for their work. Do you also share some of the blame? Is that how it works? -**Deo:** Alright, that's interesting. First of all, I'm leading a DevOps team, so I have a DevOps background; I come like out of both ways. It can even be like an engineering background, or a sysadmin one. Now, if we're talking about the old way of doing monitoring things, I don't know. So I'm based on engineering. In my past positions I was writing code, and I still am... But the question "Why do we need monitoring?" is if we are engineers, and we deploy services, and we own those services, monitoring is part of our job. And it should come out of the box. It should be something that -- how we could do it in the past, how we can do it now... It's part of what we're doing. So it's part of owning your day to day stuff. It's part of our job. +**Oren Lion:** Oh yeah, definitely. -**Tom Wilkie:** I mean, that's a really interesting kind of point, where like, who's responsible for the observability nowadays? I definitely agree with you, kind of in the initial cloud generation, the responsibility for understanding the behavior of your applications almost fell to the developers, and I think that explains a lot of why kind of APM exists. But do you think -- I don't know, maybe leading question... Do you think in the world of Kubernetes that responsibility is shifting more to the platform, more to kind of out of the box capabilities? +**Mat Ryer:** Great. -**Deo:** It should be that 100%. Engineers who deploy code, post code, they shouldn't care about where this leaves, how it's working, what it does, does it have basic knowledge... But everything else should come out of the box. They should have enough knowledge to know where the dashboards are, how to set up alerts... But in our example, most of the times we just deploy something, and then you have a ton of very good observability goodies out of the box. Maybe it wasn't that easy in the past; it's very easy to do now. The ecosystem is in a very, very good position to be able, with a very small DevOps team, to be able to support a big engineering team out of the box. +**Oren Lion:** Yeah. Yeah. I could have prevented a lot of this, but I just let it go. -**Tom Wilkie:** I guess what is it about Kubernetes in particular that's made that possible? What is it about the infrastructure and the runtime and all of the goodies that come with Kubernetes that mean observability can be more of a service that a platform team offers, and not something every individual engineer has to care about? +**Mat Ryer:** Well, it's a tough problem, because you don't really always know what you're going to need later. So there's definitely this attitude of "We'll just record everything, because we're better safe than sorry." Was it those sorts of motivations driving it? -**Deo:** Alright, so if we talk about ownership, for me it shouldn't be different. It should be owned by the team who writes this kind of stuff. Now, why Kubernetes? It's special. Maybe it's not. Maybe the technology is just like going other ways. But I think now we're in a state where the open source community is very passionate about this, people know that you should do proactive monitoring, you should care... And now Kubernetes - what it did that was very nice, and maybe spinned up, like maybe make this easier - healing. Auto-healing now is a possibility. So as an engineer, maybe you don't need to care that much about what's going on. You should though know how to fix it, how to fix it in the future... And if you own it, by the end of the day things will be easier tomorrow. +**Oren Lion:** Yeah... I mean, when I wondered why this was happening - it's a verbosity problem. And controlling verbosity is an observability problem. Look at logs and metrics. With logs, you solve the problem of controlling verbosity, with log level. -So what we can have -- we'll have many dashboards, many alerts, and it's easy for someone to pick this up. By the end of the day, it's like a million different services and stack underneath. But all this complexity somehow has been hidden away. So engineers now they're supposed to know a few more things, but not terribly enough \[unintelligible 00:05:59.03\] Maybe that was not like in the past. But it is possible now. And partially, it's because of the community over there. How passionate the community is lately when it comes to infra and observability and monitoring. +So you have error, info \[unintelligible 00:06:28.18\] debug, trace, and you set the log level according to your need. Usually error... Don't forget to dial it back from debug to error when you're done debugging. And every log has a cost. And I'm thinking about Ed Welch, in his great podcast, "All Things Logs", where he points that out... -**Vasil Kaftandzhiev:** \[06:15\] It's really interesting how on top of the passion and how Kubernetes have evolved in the last 10 years, something more evolved, and this is the cloud provider bills for resources, which is another topic that comes to mind when we're talking about monitoring Kubernetes. It is such a robust infrastructure phenomena, that touches absolutely every part of every company, startup, or whatever. So on top of everything else, developers now usually have the responsibility to think about their cloud bill as well, which is a big shift from the past till now. +And then turning our attention to metrics. Metrics have verbosity, but it comes in the form of cardinality. So we're generally not as good at filtering metrics in relabel configs as we are at setting a log level for logs. And so there are ways to identify high cardinality, but nothing like a log level. There's no simple way to just dial it down. And with a nod to Ed Welch, every time series has a cost... -**Deo:** You're right, Vasil. However, it's a bit tricky. One of the things we'll probably talk about is it's very easy to have monitoring and observability out of the box. But then cost can be a difficult pill to swallow in the long run. Many companies, they just -- I think there are many players now in observing the observability field. The pie is very, very big. And many companies try to do many things at the same time, which makes sense... But by the end of the day, I've seen many cases where it's getting very, very expensive, and it scales a lot. So cost allocation and cost effectiveness is one of the topics that loads of companies are getting very worried about. +**Mat Ryer:** Yeah, fair enough. Well, you're actually on a podcast with a couple of the old engineers here... So if you've got ideas to pitch them, by all means. Mauro, what do you think? Label levels for metrics? -**Tom Wilkie:** Yeah, I think understanding the cost of running a Kubernetes system is an art unto itself. I will say though, there are certain bits, there's certain aspects of Kubernetes, certain characteristics that actually make this job significantly easier. I'm trying to think about the days when I deployed jobs just directly into EC2 VMs, and attributing the cost of those VMs back to the owner of that service was down to making sure you tagged the VM correctly, and then custom APIs and reporting that AWS provided. And let's face it, half the teams didn't tag their VMs properly, there was always a large bucket of other costs that we couldn't attribute correctly... And it was a nightmare. +**Mauro Stettler:** Yeah, I agree with everything that Oren said. I would like to add one more thing regarding where the high cardinality comes from. I think in some cases the cardinality is simply an artifact of how the metrics got produced. So depending on what the metric is -- usually when the metrics are produced, there are many service instances producing metrics about themselves. And very often you have multiple service instances doing the same thing. They're basically just replicas of the same service. And due to the fact how they are produced and collected, each of the metrics that are -- or each of the time series that are produced by each of those instances get a unique label assigned to them. -And one of the things I definitely think has got better in the Kubernetes world is everything lives in a namespace. You can't have pods outside of namespaces. And therefore, almost everything, every cost can be attributed to a namespace. And it's relatively easy, be it via convention, or naming, or extra kind of labeling and extra metadata, to attribute the cost of a namespace back to a service, or a team. And I think that for me was the huge unlock for Kubernetes cost observability, was just the fact that this kind of attribution is easier. I guess, what tools and techniques have you used to do that yourselves? +\[00:08:09.26\] But depending on what the metric represents, you might not actually need to know which of those instances has, I don't know, increased the counter. So in those cases, I think it's also common that the cardinality is higher than what's actually necessary, simply because of the way how the metrics pipeline works. And in those cases where you don't actually want to pin down which service instance has increased some counter or whatever, it's useful to have something in the metrics pipeline, which can drop those labels that you don't need. -**Deo:** Right. So I don't want to sound too pessimistic, but unfortunately it doesn't work that nicely in reality. So first of all, cloud providers, they just - I think they enable this functionality to be supported out of the box (maybe it's been a year) \[unintelligible 00:09:22.25\] And GCP just last year enabled cost allocation out of the box. So it means you have your deployment in a namespace, and then you're wondering, "Okay, how much does this deployment cost? My team owns five microservices. How much do we pay for it?" And in the past, you had to track it down by yourself. Now it's just only lately that cloud providers enable this out of the box. +**Mat Ryer:** So that was going to be my question... How do people solve this today then? Is it a case of you sort of have to go through all your code, look at all the places you're producing metrics, and try and trim it down? Does that ever work? -So if you have these nice dashboards there, and then you see "My service costs only five pounds per month", which is very cheap, there is an asterisk that says "Unfortunately, this only means your pod requests." Now, our engineers, like everyone else, it takes a lot more effort to have your workloads having the correct requests, versus limits, so it's very easy by the end of the day to have just a cost which is completely a false positive. Unfortunately, for me at least, it has to do with ownership. And this is something that comes up again and again and again in our company. +**Mauro Stettler:** It kind of works, but... So in my experience of working with customers -- so now you're basically talking about metrics that are application observability metrics, not the dependencies that you just talked about, Oren, right? So in those cases, very often what I've seen is that in larger organizations it can happen that there are strict rollout policies which make it impossible to deploy fixes quickly. So it can happen that, for example, a new change gets deployed, which blows up the cardinality because a new label has been added, which has a really high cardinality... Then the team of developers maintaining that application realizes the problem, they want to fix it, but it takes weeks to get the fix into production. And that's also another one of those situations where it's really useful that if in the metrics pipeline we can just drop the label, even if it's just a bridge until the fix has been deployed. -\[10:25\] Engineers need to own their services, which means they need to care about requests, and limits. And if those numbers are correct - and it's very difficult to get them right - then the cost will be right as well. It's very easy just to have dashboards for everyone, and then these dashboards will be false positives, and then everyone will wonder why dashboards \[unintelligible 00:10:44.11\] amount of money, they will pay 10x... It's very difficult to get it right, and you need a ton of iterations. And you need to push a lot, then you need to talk about it, be very vocal, champion when it comes to observability... And again, it's something that comes up again and again. +**Patrick Oyarzun:** Yeah. And there's been existing solutions for a while to try to help control cardinality. Typically, it takes one of two forms. One is just dropping some metrics entirely. It's common, for example - say you're monitoring Kubernetes; you might find a list somewhere of metrics you don't really need, according to somebody's opinion or some philosophy that they've applied. And you might decide to just drop all of those outright. That still requires changing, relabel configs, which like Mauro said, sometimes it's hard... -If you are champion on observability, sometimes it's going to be cost allocations, sometimes it's going to be requests and resources, sometimes it's going to be dashboards, sometimes it's going to be alerts, and then they keep up. Because when you set up something right, then always there is like the next step you can do - how can we have those cost allocations proactively? How can we have alerts based on that? How can we measure them, and get alerted, and then teach people, and engage people? It's very difficult questions, and it's really difficult to answer them correctly. And I don't think still cloud providers are there yet. We're still leading. +The other way that people try to do this is by dropping individual labels off of their metrics. And so maybe you realize you have some redundant label, or you have a label that is increasing cardinality, but you don't care about... The issue with doing that though is a lot of times you'll run into issues in the database that you're sending to. So lots of time series databases, including Prometheus - and anything that's trying to look like Prometheus - require that every series has a unique label set. And so if you try to, say, just drop a pod label for example on a metric, usually you'll start getting errors saying "duplicate sample received", "same timestamp, but a different value" kind of thing. -**Tom Wilkie:** I'm not disagreeing at all, but a lot of this doesn't rely on the cloud provider to provide it for you, right? A lot of these tools can be deployed yourself. You can go and take something like OpenCost, run that in your Kubernetes cluster, link it up to a Prometheus server and build some Grafana dashboards. You don't have to wait for GCP or AWS to provide this report for you. That's one of the beauties of Kubernetes, in my opinion. This ability to build on an extend the platform, and not have to wait for your service provider to do it for you is like one of the reasons why I'm so passionate about Kubernetes. +And so it's not as simple as just don't send the data. You need to figure out -- now you have two samples with the same labels. Do you want to persist one of them, or the other, or maybe some combination? And so generally, what you really need to do is aggregate the data. You need to add it up, or maybe you want to store the MACs, or something like that, in order to actually do it. -**Deo:** Completely agree. And again, I don't want to sound pessimistic. OpenCost is an amazing software. The problem starts when not everything is working with OpenCost. So for example, buckets - you don't get \[unintelligible 00:12:27.22\] Auto-scaling is a very good example. So if you say "I have this PR on TerraForm, and then it will just increase auto-scaling from three nodes to 25 minutes. OpenCost, how much will this cost?" OpenCost will say "You know what? You are not introducing any new costs, so it's zero." Fair enough. \[unintelligible 00:12:48.12\] is going to be deployed. But then if your deployment auto scales 23 nodes, it's going to be very expensive. So while the technology is there, it's still -- at least in my experience, you need to be very vocal about how you champion these kinds of things. And it's a very good first step, don't get me wrong. It's an amazing first step, okay? We didn't have these things in the past, and they come from the open source community, and they're amazing. And when you link all of these things together, they make perfect sense. And they really allow you to scale, and like deploy stuff in a very good way, very easily. But we still -- it needs a lot of effort to make them correct. +**Oren Lion:** Yeah. I mean, just building on that, before adaptive metrics we'd have to detect a problem, and it takes effort just like detecting any defect for a team, as Mauro was saying, and it distracts them from other things they're doing, in a deployment that causes a surge in metrics. As a customer, I kind of have some buffer, because Grafana has 95th percentile billing, so I've got a spike budget. And as long as I catch the problem quickly, I won't get billed for it. So I'm sure I'll flood Grafana again with metrics I don't care to pay for. So Mat, and Grafana, and team, my future self thanks you, but we still need alerts to like catch those things. -**Vasil Kaftandzhiev:** I really love the effort reference. And at the present point, if we're talking about any observability solutions, regardless if it is OpenCost for cost, or general availability, or health etc. we start to blend the knowledge of the technology quite deep into the observability stack and the observability products that are there. And this is the only way around it. And with the developers and SREs wearing so many superhero capes, the only way forward is to provide them with some kind of robust solutions to what they're doing. I'm really amazed of the complexity and freedom and responsibilities that you people have. It's amazing. As Peter Parker's uncle said, "With a lot of power, there is a lot of responsibility." So Deo, you're Spiderman. +\[00:12:03.11\] And then after you've caught it, like Patrick was saying, it can be tricky to diagnose the problem, so now the team is spending time trying to diagnose the problem. And we need to find it, we have to diagnose it, and now we need to resolve it, and that takes time, too. So people aren't super-familiar with relabel configs in Prometheus, and so they'll have to read the docs, and make the changes, and PR the change, and get it merged, and deploy it through a pipeline... And it's probably an orchestration pipeline, because it might be a Prometheus operator going to five clusters. So it's a lot of effort. Every time you have a cost overrun, it's expensive from a billing perspective, and it's expensive from a resource perspective. -**Deo:** \[14:25\] I completely agree. One of the things that they really work out, and I've seen it -- because you're introducing all these tools, and engineers can get a bit crazy. So it's very nice when you hide this complexity. So again, they don't need to know about OpenCost, for example. They don't need to know about dashboards with cost allocation. You don't need to do these kinds of things. The only thing they need to know is that if they open a PR and they add something that will escalate cost, it will just fail. This the only thing they need to know. That you have measures in there, policies in there that will not allow you to have a load of infra cost. +**Mat Ryer:** Yeah. And I suppose it's a kind of ongoing cost too, because as things are changing in the system and as people are adding things, you might well have these problems again, and then you have to go and do the same thing again. -Or, then something else we're doing is once per month we just have some very nice Slack messages about "This is a team that spent less money this quarter, or had this very big saving", and then some it could champion people... Because they don't need to know what is this dashboard. By the way, it's a Grafana dashboard. They don't need to know about these kinds of things. They only need to know "This spring I did something very good, and someone noticed. Okay. And then I'm very proud for it." So if people are feeling proud about their job, then the next thing, without you doing anything, they could try to become better at it. And then they could champion it to the rest of the engineers. +**Oren Lion:** It's unintentional. We'll create custom metrics, maybe a new histogram, and we may have more buckets than we need... So out of no one's desire to increase costs, it can just skyrocket at any point in time. \[unintelligible 00:13:15.27\] were always faced with that surge problem. -**Vasil Kaftandzhiev:** There is an additional trend that I'm observing tied to what you're saying, and this is that engineers start to be focused so much on cost, that this damages the reliability and high availability, or any availability of their products... Which is a strange shift, and a real emphasis on the fact that sometimes we neglect the right thing, and this is having good software produced. Yeah, +**Mat Ryer:** Yeah. That sounds quite scary, and I think we've done quite a good job of establishing the problem for adaptive metrics. So could somebody explain to me what is adaptive metrics? If you've never heard of this feature, what actually is it? Who wants to have a go? -**Tom Wilkie:** Yeah, you mentioned a policy where you're not allowed to increase costs. We have the exact opposite policy. The policy in Grafana Labs is like in the middle of an incident, if scaling up will get you out of this problem, then do it. We'll figure out how to save costs in the future. But in the middle of like a customer impacting problem, spend as much money as you can to get out of this problem. It's something we have to actively encourage our engineering team to do. But 100%, the policy not to increase costs is like a surefire way to over-optimize, I think. +**Mauro Stettler:** Okay, so basically adaptive metrics consists of two parts. The first part is what we call the recommendations engine, which analyzes a user's series index. So it looks at all of the series that the user currently has, and it looks at the usage on those series. Based on that information, it then tries to identify labels that the user has, which raises the series cardinality, and which the user actually doesn't need, because they don't use them, according to their usage patterns. So it generates recommendations saying "Label X could be dropped", and this will reduce your active series count by some number. -**Deo:** We have a reason, we have a reason. So you have a very good point, and it makes perfect sense. In our case though, engineers, they have free rein. They completely own their infrastructure. So this means that if there's a bug, or something, or technical debt, it's very easy for them to go and scale up. If you're an engineer and you have to spend like two days fixing a bug or add a couple of nodes, what do you do? Most of the times people will not notice. So having a policy over there saying "You know what? You're adding 500 Euros of infrastructure in this PR. You need someone to give an approval." It's not like we're completely blocking them. +Then the second part is what we call the aggregator, which is a part in the metrics ingestion pipeline when you send data to the Grafana cloud. The aggregator allows the user to apply and implement those recommendations by defining rule sets which say "Okay, I want to drop this label from metric X", and the aggregator then performs the necessary aggregation on that data in order to generate and aggregate with a reduced cardinality, according to the recommendation. -And by the way, we caught some very good infrastructural bugs out of these. Engineers wanted to do something completely different, or they said "You know what? You're right. I'm going to fix it in my way." Fix the memory leak, instead of add twice the memory on the node \[unintelligible 00:17:35.22\] Stuff like that. But if we didn't have this case, if engineers were not completely responsible for it, then what you say makes perfect sense. +**Patrick Oyarzun:** Yeah. So basically, what this is doing is -- it's very similar to kind of what I would do as a person. Like, if I wanted to do this myself, I might start by checking out "What are the main dashboards I care about? What are my SLOs based on?" And you try to develop kind of some idea of "What data is important?" And we actually have a tool that has existed for a while, called Mimir tool, that can automate a lot of that. And it's open source, anybody can use it. It'll tell you which metrics are used, basically. -**Mat Ryer:** Yeah, this is really interesting. So just going back, what are the challenges specifically? What makes it different monitoring Kubernetes? Why is it a thing that deserves its own attention? +What adaptive metrics does is it goes a step further. So instead of just telling you that the Kubernetes API server request duration seconds bucket is used, it'll also tell you whether or not every label is used. And it'll also know that in the places it is used, it's only ever used in a sum of a rate, PROMQL expression. And because we know all of that at once, we can actually tell you with confidence "Hey, if you drop the pod label on that, it's not going to affect anything." All of your dashboards will still work, you'll still get paged for your SLOs. -**Tom Wilkie:** \[18:01\] I would divide the problem in two, just for simplicity. There's monitoring the Kubernetes cluster itself, the infrastructure behind Kubernetes that's providing you all these fabulous abstractions. And then there's monitoring the applications running on Kubernetes. So just dividing it in two, to simplify... When you're looking at the Kubernetes cluster itself, this is often an active part of your application's availability. Especially if you're doing things like auto-scaling, and scheduling new jobs in response to customer load and customer demand. The availability of things like the Kubernetes scheduler, things like the API servers and the controller managers and so on. This matters. You need to build robust monitoring around that, you need to actively enforce SLOs around that to make sure that you can meet your wider SLO. +And so it bridges the gap between what we've had for a long time with something like Mimir tool, to just know "Hey, this metric is unused", now going from there all the way through to implementing a real solution that'll save cardinality. And then even more than that, this is a thing that adapts over time. So like I said before, it's been common for a long time to have these kind of public datasets of like "These are all the metrics that you probably want to keep for Kubernetes, or for Kafka, or for Redis", or any popular technology. -We've had outages at Grafana Labs that have been caused by the Kubernetes control plane and our aggressive use of auto-scaling. So that's one aspect that I think maybe doesn't exist as much if you're just deploying in some VMs, and using Amazon's auto-scalers, and so on. +\[00:16:18.22\] What adaptive metrics does is basically it'll find all of that dynamically, and then over time, as you start using more, or stop using some of it, or anything like that, it'll start generating new recommendations. Maybe you can aggregate a little bit more aggressively, because you transitioned from float histograms to native histograms, which is a feature related to the issue Oren was talking about with histogram cardinality. Maybe once you do that migration, adaptive metrics might notice "Hey, that old normal histogram is unused now, and you can get rid of it", even if you don't have control of the application that's generating that data. -I think the second aspect though is where the fun begins. The second aspect of using all of that rich metadata that the Kubernetes server has; the Kubernetes system tells you about your jobs and about your applications - using that to make it easier for users to understand what's going on with their applications. That's where the fun begins, in my opinion. +So it's really this feedback loop that I think has made adaptive metrics start to stand out... And we've had it internally. At this point, we apply the latest recommendations every weekday morning, and we've been doing that for, I don't know, quite a long time now. Nobody reviews them... It generally has worked pretty well. -**Deo:** Completely agree. If you say to engineers "You know what? Now you can have very nice dashboards about the CPU, the nodes, and throughput, and stuff like that", they don't care. If you tell them though that "You know what? You can't talk about these kinds of things without Prometheus." So if you tell them "You know what? All of these things are Prometheus metrics. And we just expose \[unintelligible 00:19:49.05\] metrics, and everything is working", they will look there. If you tell them though that "You know what? If you expose your own metrics, that says scale based on memory, or scale based on traffic." Either way, they become very intrigued, because they know the bottleneck of their own services; maybe it is how many people visit the service, or how well it can scale under certain circumstances, based on queues... There are a ton of different services. So if you tell them that "You know what? You can scale based on the services. And by the way, you can have very nice dashboards that go with CPU memory, and here is your metric as well", this is where things become very interesting as well. +**Oren Lion:** Yeah, I just wanted to build on what Patrick was saying about being able to aggregate a pod. So that's a degree of verbosity we generally don't need, so we can aggregate it up. But we couldn't publish those time series to Grafana, because we would get those errors that Patrick mentioned earlier. There's no disambiguating label, so we get errors, but Grafana can do that at the front door. And it's a level of detail we don't use commonly, but if there's an issue, we could disaggregate. So in a way, it's giving me a log level, but for metrics. I can dial it down and get a higher degree of aggregation, and then I can dial it up and disaggregate, and get that detail at the pod level. And we use this for custom metrics, too. We may have event-level labels. However, we don't need that label data commonly, but if there's an issue, we can disaggregate and basically increase the log level to debug, pay for it, get what we need, and then re-aggregate back down to error for metrics. So that helps us control costs and also control having data at our use, that we're actually using, and then when we stop using it, we can just make it all go away using adaptive metrics. -And then you start implementing new things like a pod autoscaler, or a vertical pod autoscaler. Or "You know what? This is the service mesh, what it looks like, and then you can scale, and you can have other metrics out of the box." And we'll talk about golden metrics. +**Mat Ryer:** Yeah, that sounds great. -So again, it would take a step back... Most engineers, they don't have golden metrics out of the box. And that is a very big minus for most of the teams. Some teams, they don't care. But golden metrics means like throughput, error rate, success rate, stuff like that... Which, in the bigger Kubernetes ecosystem you can have them for free. And if you scale based on those metrics, it's an amazing, powerful superpower. You can do whatever you want as an engineer if you have, and you don't even need to care where those things are allocated, how they're being stored, how they're being served, stuff like that. You don't need to care. You only need some nice dashboards, some basic high-level knowledge about how you can expose them, or how you can use them, and then just to be a bit intrigued, so you can make them the next step and like scale your service. +**Patrick Oyarzun:** Yeah. And one important thing to remember is that - we talked before about this difference between custom metrics like your business data, and dependency metrics. A lot of times both of these can actually be exported by the same process. And so, for example, say you're running a Mimir pod; it's going to export information about the queries that it's serving, or the series that it's loading. Those more like businessy things about the job of Mimir. It's also going to export all kinds of data about the Go runtime. And both of those things end up getting mixed together in your collection pipeline. And so it can be really tricky without something like adaptive metrics to do this kind of dynamic disaggregating when you need it. -**Tom Wilkie:** You said something there, like, you can get these golden signals for free within Kubernetes. How are you getting them for free? +Maybe for your business metrics you don't really care which pod served the most queries this month. It's not a contest. You just want to know how your queries are going. But when something goes wrong, maybe you actually do need to look into the Go garbage collector. And so it's possible now with something like adaptive metrics to just disaggregate this one garbage collector metric that you need, or even 10 of them, whatever it is you need, without needing to change the pipeline in a way where you now disaggregate for everything coming from that pod. It allows you to kind of think of these as different things. -**Deo:** Okay, so if you are on TCP, or if you are in Amazon, most of those things have managed service mesh solutions. +**Oren Lion:** I would just add one distinction. So when the Go profiler metrics get published to Grafana, we can just drop them using adaptive metrics. They're not even getting aggregated, because no one uses them, and that's a really great feature of adaptive metrics. It's like a big red button, and you go "I don't need these. Just drop them at the front door." But other metrics, like the business metrics that we care about, we're just aggregating and then disaggregating, just like a log level. -**Tom Wilkie:** I see. So you're leveraging like a service mesh to get those signals. +\[00:20:22.14\] And a crazy idea for you, Patrick, is what if there were some standard for metrics where we could have something like a log level set in a label, so that a tool like Prometheus maybe would even suppress sending them? I don't know. I just want to throw that out there. It occurred to me. -**Deo:** Yes, yes. But now with GCP it's just one click, in Amazon it's one click, Linkerd is just a small Helm deploy... It's no more different than the Prometheus operator, and stuff like that. +**Patrick Oyarzun:** Yeah, I mean, there's so many cool ideas that I think we've had, and we want to build on top of adaptive metrics... Generally, I think -- the word I keep using when I think about it is just-in-time metrics. It's something that I think we're just scratching the surface of, but I can imagine a future state where you generally are paying very little for your metric storage, and then something goes wrong and you can turn on the firehose, so to speak, of like "I want to know everything." And I think it's not just metrics. Grafana in general is developing adaptive telemetry across all observability signals. And I think there could be a future state where when you turn on that firehose, it includes all signals, and you're saying all at once, "In this one region, I know I'm having an issue, and I want to stop dropping metrics, logs, traces, profiles... I want to just get everything for the next hour." And then I can do my investigation, I can do my forensics, and then I can start saving money again. I think it's realistic that we could get to that point eventually. -**Tom Wilkie:** \[22:09\] Yeah. I don't think it's a badly kept secret, but I am not a big fan of service meshes. +**Mat Ryer:** Yeah. I quite like the idea that you declare an incident and then it just starts automatically... It levels up everything because there's an incident happening just for that period. -**Deo:** I didn't know that... Why? +**Mauro Stettler:** Just to extend on that... I think it's even possible that we will get to a point where you cannot only turn on the firehose right now, but basically say that you want the firehose for the last hour. Because I think that would be the coolest feature that we could build. Because I think the biggest blocker for people which prevents them from adaptive metrics is often that they're afraid that they're going to drop information which they later regret to have dropped. Because they realize later that they actually would have needed it. If they would be able to build a feature that allows you to go back in time by just one hour or two hours, even if it's not very long, I think that would help really a lot to make people not worry about dropping labels of which we say that they don't need them. -**Tom Wilkie:** Yeah, I just don't think the cost and the benefits work out, if I'm brutally honest. There's definitely -- and I think what's unique, especially about my background and the kind of applications we run at Grafana Labs, is honestly like the API surface area for like Mimir, or Loki, or Tempo, is tiny. We've got two endpoints: write some data, run a query. So the benefit you get from deploying a service mesh - this auto instrumentation kind of benefit that you describe is really kind of what is trivial to instrument those things on Mimir and Loki. And the downside of running a service mesh, which is the overheads, it's the complexity, the added complexity, the increased unreliability... There's been some pretty famous outages triggered by outages in the service mesh... For an application like Mimir and Loki and a company like Grafana Labs I don't think service meshes are worth the cost. So we've tended to resort to things like baking in the instrumentation into the shared frameworks that we use in all our applications. But I just wanted to -- I want to be really clear, I don't think Kubernetes gives you golden signals out of the box, but I do agree with you, service meshes do, a hundred percent. +**Mat Ryer:** Yeah, this is cool. I mean, I don't think we're allowed to do podcast-driven roadmap planning, but it feels like that's where we are. Mauro, just to be clear, that feature would use some kind of buffer to store the data. You wouldn't actually try and solve time travel and send people back an hour, would you? That's too far. -**Deo:** It is an interesting approach. So it's not the first time I'm hearing these kinds of things, and one of the reasons - we were talking internally about service mesh for at least six months... So one of the things I did in order to make the team feel more comfortable, we spent half of our Fridays, I think for like three or four months, reading through incident reports around service meshes. It was very interesting. So just in general, you could see how something would happen, and how we would react, how we would solve it... And it was a very interesting case. And then we've found out that most of the times we could handle it very nicely. +**Mauro Stettler:** That's one possibility, but I don't have a design doc for that one. But we do have a design doc for the solution with the buffer, because we actually do have the buffer already. We just need to use it. -Then the other thing that justified a service mesh for us is that most of our -- we are having an engineering team of 100 people. And still, people could not scale up. They could not use Prometheus stuff. They could not use HPA properly, because they didn't have these metrics. So this is more complex... Anyway, we're using Linkerd, which is just -- it's not complex. We are a very small team. It's not about complex. It's not more complex than having a Thanos operator, or handling everything else. Again, it has an overhead, but it's not that much more complex. However, the impact it had to the engineering team having all those things out of the box - it was enormous. +**Mat Ryer:** I see, cool. Yeah, no, that does make sense. Do the easier one first, and then later -- save it for a different hackathon. -And one last thing - the newest Kubernetes version, the newest APIs will support service mesh out of the box. So eventually, the community will be there. Maybe it's going to be six months, maybe it's going to be one year. I think that engineering teams that are familiar with using those things, that embrace these kinds of services, they will be one step ahead when Kubernetes supports them out of the box... Which is going to be very, very soon. Maybe next year, maybe sooner than that. +We'll do the time travel... -**Tom Wilkie:** Yeah. I mean, I don't want to \*bleep\* on service meshes from a great distance. There are teams in Grafana Labs that do use service meshes, for sure. Our global kind of load balancing layer for our hosted Grafana service uses Istio, I believe. And that's because we've got some pretty complicated kind of routing and requirements there that need a sophisticated system. So no, I do kind of generally take the point, but I also worry that the blanket recommendation to put everything on a service mesh - which wasn't what you were saying, for sure... But I've definitely seen that, and I think it's more nuanced than that. +**Oren Lion:** Yeah, next hackathon. -**Mat Ryer:** \[25:59\] But that is a good point. Deo, if you have like a small side project, is the benefits of Kubernetes and service mesh and stuff, is it's so good that you will use that tech even for smaller projects? Or do you wait for there to be a point at which you think "Right now it's worth it"? +**Patrick Oyarzun:** We have a hackathon next week, actually. -**Deo:** Obviously, we'll wait. We don't apply anything to it just because of the sake of applying stuff. We just take what the engineering teams need. In our case, we needed \[unintelligible 00:26:21.07\] we really needed these kinds of things. We needed \[unintelligible 00:26:26.26\] metrics. We needed people to scale based on throughput. We needed people to be aware about the error rates. And we needed to have everything in dashboards, without people having to worry about these kinds of things. But the good is that we got out of the box, they're amazing. So for example, now we can talk about dev environments, because we have all this complexity away to the service mesh. We're using traffic split, which again, is a Kubernetes-native API now. +**Mat Ryer:** That's very true. Well, so I mentioned hackathons because that's where this idea came from, isn't it? It came out of a hackathon. What was the story there? -So probably this is where the community will be very, very soon, but I think \[unintelligible 00:27:03.13\] DevOps on our team, it's in a state where -- we're in a very good state. So we need to work for the engineering needs one year in advance. And people now struggle with dev environments, releasing stuff sooner. Observability - we have solved it in the high level, but in the lower level still people struggle to understand when they should scale, how they can auto-heal, stuff like that. And service meshes give you a very good out of the box thing. But again, we don't implement things unless we really need them... Because every bit of technology that you add, it doesn't matter how big or small your team is, it adds complexity. And you need to maintain it, and to have people passionate about it; you have to own it. +**Mauro Stettler:** Yeah, basically we saw that this is a feature that customers were asking for. We had multiple requests where customers said that they wanted it... And I just ended up being on a bunch of support calls with one of those customers asking for this feature, and I thought "Yeah, actually, this sounds like a pretty good idea, so why don't we build that?" And the hackathon came up, so I started prototyping it. And then the company picked up, and after a few months everybody at the company told me that we have to promote this, and this is going to be the next big feature. -One other thing that I have found out that it's not working maybe, at least in the engineering department, is that people, they often change positions. And Grafana is a very big player, so it has some very powerful, passionate people... But the rest of the engineering teams it's not the same. So you may have engineers jump every year, a year and a half... So sometimes it's not easy to find very good engineers, who are very passionate. \[unintelligible 00:28:14.11\] own it, and then help scale it further. So it is challenging, I completely agree. +**Mat Ryer:** \[00:24:00.04\] Which is a really cool thing at Grafana, that we have these regular hackathons where you can really just do that. Just a brand new thing happens and it comes from anywhere. But hang on though - but customers were asking for it... What were they saying? "I'd like to pay you less..." Why are you helping them? -**Tom Wilkie:** Yeah, I 100% agree. We encourage our engineers to move around teams a lot as well. And I think all really strong engineering teams have that kind of mobility internally. I think it's very important. I just want to -- you talked a lot about auto-scaling, and I do think auto-scaling is a great way, especially with the earlier discussion about costs... It's a great way to achieve this kind of infrastructure efficiency. But two things I want to kind of pick up on here. One is auto-scaling existed before Kubernetes. Right? I think everyone who's kind of an expert in EC2 and load balancers and auto-scaling groups are sitting there, shouting at the podcast, going "We used to do this before Kubernetes!" So what is it about Kubernetes that makes us so passionate about auto-scaling? Or is it just the standard engineering thing that everything old is new again, and this all cyclical? +**Mauro Stettler:** Pretty much. \[laughs\] Well, basically, customers were saying "Look, I have those labels which are blowing up my cardinality, which makes my bill really expensive. I want to get rid of those labels. But then I have the problem with the colliding label sets, because then I have multiple series which have the same label set", and that makes the data basically useless. That's the problem that Patrick described earlier. "Is there no way to solve this problem?" And the solution to solve this problem is to aggregate them correctly, to process the data at the time when you receive them in such a way that the relevant information that you want to get out of the data remains while we can drop the information that you don't want to keep. -**Deo:** Could you in the past auto-scale easily based on the throughput, and stuff? I'm not sure. +**Mat Ryer:** And what was the reaction like from the business people in Grafana? Was there concern that you're fighting against what they're trying to do? -**Tom Wilkie:** Yeah. Auto-scaling groups on Amazon were fantastic at that. +**Mauro Stettler:** Yeah, the reaction was diverse. I think the PM became a big fan of this feature, and started to do a great job at convincing all the salespeople that this is something that we really need, and that this is actually going to help them in the long term. So I didn't have to convince the salespeople myself, but... Yeah, I'm sure that at least in the beginning there were a lot of people suspicious that this is a good idea, because it's actually going to reduce our revenue effectively. -**Deo:** Alright. And what about the rest of the cloud providers? +**Mat Ryer:** Yeah, yeah. But I kind of love that it still happened. Don't you, Oren? -**Tom Wilkie:** Yeah, I mean... Are there other cloud providers? That's a bad joke... Yeah, no, you know, Google has equivalent functionality in their VM platform, for sure. I do think -- you do kind of make a good point... I think it's kind of similar to the OpenCost point we made earlier, of like Kubernetes has made it so that a lot of these capabilities are no longer cloud provider-specific. You don't have to learn Google's version of auto-scaling group, and Azure's version of auto-scaling group, and Amazon's auto-scaling groups. There is one way -- the auto-scaling in Kubernetes, there's basically one way to configure it, and it's the same across all of the cloud providers. I think that's one of the reasons why auto-scaling is potentially more popular, for sure. +**Oren Lion:** Yeah... I was wondering, Mauro, because when this first came out, I couldn't wait to get to it, because I was getting invoiced every month because of all these overruns. And it was a lot of painful development time to deploy all the -- relabel configs, and do the client-side filtering... And then Grafana said "Don't worry about the client-side. We'll just take care of that at our front door." And we're like "Great." So we stopped updating Prometheus and client-side relabel configs... And now, how does Grafana make it a sustainable feature if customers just flood Grafana with their metrics, and don't pay any care to filtering at the client-side? -**Deo:** Very good point. +**Mauro Stettler:** Yeah, so the key to it is just that we need to be able to run the service really cheaply. And there are a bunch of ways how we do that. For example, we decode the received write requests without redundancy, and... There are a bunch of different tricks that we use to reduce the TCO of running the service. And obviously, we are still paying some money for it. It's not free to run it, but it's basically cheap enough for us to be able to offer it for free. And we think that in the end it's probably worth it, because by doing this, we are able to provide the customer a lot of value, because the customer is going to be able to get as much value out of the metrics as they would be able to without this feature, and they're going to be happy because the bill is going to be lower, so that's probably going to create a customer who likes to with us and who is hopefully also going to expand into other products that we are selling. Patrick, do you want to add something about that? -**Tom Wilkie:** \[30:15\] But I would also say, you've talked a lot about using custom metrics for auto-scaling, and using like request and latency and error rates to influence your auto-scaling decisions... There's a bit of like accepted wisdom, I guess, that actually, I think CPU is the best signal to use for auto-scaling in 99% of use cases. And honestly, the number of times -- even internally at Grafana Labs, the number of times people have tried to be too clever, and tried to second-guess and model out what the perfect auto-scaling signals are... And at the end of the day, the really boring just CPU consumption gives you a much better behavior. +**Patrick Oyarzun:** Yeah, I think just generally the experience working with the finance and sales teams has been a learning experience. I think from this vantage point, looking back, it's obvious that it was the right decision. We have a commercial channel internally at Grafana, and anytime a new customer or a new deal is won by the sales team, they'll post in there... And they usually give a short message about "Why did they choose Grafana? Why were they looking for an observability solution at all in the first place?" If there was any competitive kind of aspect to it with our competitors, they'll post about that... And over the last two years, we've seen quite a lot of customers come through that only signed because Adaptive Metrics exists. And we wouldn't have those customers otherwise. And so I think from this vantage point, it's clear that, for the company overall, this is a good thing. -And I'll finish my point - again, not really a question, I guess... But I think the reason why CPU is such a good signal to use for auto-scaling... Because if you think what auto-scaling actually achieves is adding more CPU. You're basically scaling up the number of CPUs; you're giving a job by scaling up the replicas of that job. And so by linking those two directly, like CPU consumption to auto-scaling, and just giving the auto-scaler a target CPU utilization, you actually can end up with a pretty good system, without all of this complexity of like custom metrics, and all of the adapters needed to achieve that. But again, I'm just kind of being -- I'm kind of causing argument for the fun of the listeners, I guess... But what do you think, Deo? +\[00:28:03.09\] I think on the individual sales rep basis it has been a tough pill to swallow sometimes, because you have customers that say "I'm sending 10 million series to one of your competitors, and I want to switch." And then they switch, and it's actually less with us. And trying to understand ahead of time what might their bill be after Adaptive Metrics and all of that kind of stuff, so that we can properly compensate our sales reps, so that they don't feel like they're getting the short end of the stick, has been a learning experience, I think, for all of us. -**Deo:** I think it's a very good point. I want to touch a bit on vertical auto-scaling. Are you aware about vertical auto-scaling? +**Mat Ryer:** Yeah. Wow. It's a cool project. Were there sort of big technical challenges involved in this? What was the hardest thing technically to do here? -**Tom Wilkie:** Yeah, I think it's evil. +**Patrick Oyarzun:** I feel like there's a new one every month. Just to give you a peek behind the scenes - are we on v3 now, Mauro? We're on aggregations v3, right? You could think of it like that. -**Mat Ryer:** It's where the servers stack up on top of each other, instead of -- +**Mauro Stettler:** Yeah, it's v3. -**Deo:** So vertical pod auto-scaling has to do with memory. So CPU is the easy stuff. So if you have a pod, and then it reaches 100% CPU, it just means that your calls will be a bit slower. With memory though, it's different. Memory pods will die out of memory, and then you'll get like many nice alerts, and your on-call engineers will add more memory. And then most engineering teams do the same. So pod dies of memory, you add more memory. Pod dies of memory, you add more memory. And then sometimes, like after years, you say "You know what? My costs on GCP, they are enormous. Who is using all those nodes?" And then people see that the dashboards, they see that they don't use a lot of memory, so they \[unintelligible 00:32:34.19\] and then you come out of memory alerts. +**Patrick Oyarzun:** I think there might be a v4 coming soon... But each of these versions, I think, you can think of as a complete re-architecture. And I think in each of them, so far at least, the reason for these things has been primarily cost. v2, for example, we built the whole aggregation pipeline and then looked at how much it cost to run, and it was basically the same price as just ingesting the data in the first place. And so if we're going to do that, we might as well just give you a discount on the bill and not do anything with the data, because then we're not maintaining this whole other system. -So vertical pod auto-scaling does something magical. So you have your pod, with a specific amount of memory; if it dies, then the auto-scaler will put it back in the workload, with a bit more memory. And then if it dies again, it will just increase memory. So if your service can survive pods dying, so it goes back to the queue or whatever, it means that overall you have a better cost allocation, because your pods go up and down constantly with memory consumption. So this is a good thing to discuss when it comes to an auto-scaler. +So we've gone through a bunch of iterations just kind of architecturally to figure out how do you make this something that can be free. And I don't think we're done. We're at a point where it's healthy, it's stable, the business is happy... -**Tom Wilkie:** No, a hundred percent. And to add to that, I actually think it's poorly named. I think calling it the vertical pod auto-scaler -- what it's actually doing is just configuring your memory requests for you. And classing that as auto-scaling is kind of -- you know, in my own head at least, I think that's more about just automatically configured requests. So we do that, we use that; of course we do. +But we're always talking about all these new features we want to build, and new features tend to add cost... And so there will come a day when I think we'll need a v4. -I do generally think, to your earlier point, teams that really, really care about their requests, and about the shape and size of their jobs, we've seen internally they shy away from using the vertical pod auto-scaler, because they -- and it is a huge factor in dictating the cost of their service. And then there's the teams that just want to put a service up, and just want it to run, and the service is tiny and doesn't consume a lot of resources... 100% pod vertical auto-scaling is the perfect technology for them. +**Mauro Stettler:** Yeah. And it doesn't only need to be cheap. It needs to be cheap and stable as well. Because for example the first iteration was cheap, but it wasn't stable. -**Deo:** \[34:07\] You're right. It's fine. For most of the engineers, having something there that auto-scales up and down, it's fine. But for the few services that will really benefit for your spikes in traffic, or in any Prometheus metric that you expose, it makes a very big difference. Sometimes it makes a very big difference. But this is something that most people don't care about. They'll have hundreds of microservices in there, and they will work out of the box with everything you're having in a nice way, so you don't have to worry about it. But the most difficult part is to have people know that this is possible if they want to implement it, to know where they will go and have the monitoring tools to see what's going on, their dashboards, and where they can see stuff... And then if you are cheeky enough, you can have alerts there as well, to say "You know what? You haven't scaled enough", or "Your services have a lot of CPU utilization. Maybe you should adjust it", and stuff like that, in order for them to be aware that this is possible. And again, you need people that are very passionate. If people are not passionate about what they're doing, or they don't own the services, it's very difficult. It doesn't scale very well with engineering teams. +**Mat Ryer:** So I kind of love -- honestly, just as an aside, I love this story, because I spend a lot of time advocating for build, build something; you can design it, but your design is probably going to be wrong. You really find out when you build it, and you're faced with then the realities of whatever that thing is you've built. And then - yeah, you take it from there. So I kind of like that arc, because I think that's a very healthy way of doing engineering. I don't know if you work like that. Well, you are, Oren, and it's obviously a little bit different maybe even in the healthcare space; you have to be a bit more careful. -**Mat Ryer:** Alright, let's get into the hows, then. How do we do this? Tell me a bit about doing this in Kubernetes. Has this changed a lot, or was good observability baked in from the beginning, and it's evolving and getting better? +**Oren Lion:** Yeah, I mean, we definitely have controls in place. And a lot of alerts firing off, and the features take more time to design... But features that save customers money, like this one, actually helps the customers expand into other areas in your business. So after I started using adaptive metrics, I had the headroom to move into IRM, and \[unintelligible 00:30:55.28\] and I got sites on using Tempo... And it's the same budget with Grafana that I had two years ago; I can just expand out and do more. And there's always some uplift. We also started using enterprise plugins for Snowflake. -**Tom Wilkie:** I think one of the things I'm incredibly happy about with Kubernetes is like all of the Kubernetes components effectively from the very early days were natively instrumented with really high-quality Prometheus metrics. And that relationship between Kubernetes and Prometheus dates all the way back to kind of their inception from kind of that heavy inspiration from the internal Google technology. So they're both inspired by -- you know, Kubernetes by Borg, and Prometheus by Borgmon... They both heavily make use of this kind of concept of a label... And things like Prometheus was built for these dynamically-scheduled orchestration systems, because it heavily relies on this kind of pull-based model and service discovery. So the fact that these two go hand in hand -- I one hundred percent credit the popularity of Prometheus with the popularity of Kubernetes. It's definitely a wave we've been riding. +So it's been slowly increasing, but it does help kind of keep us coming back to Grafana, because we know it's an affordable way to do monitoring. And as you make adaptive metrics better, it just makes it even more interesting to keep working with Grafana. -But yeah, that's like understanding the Kubernetes system itself, the stuff running Kubernetes, the services behind the scenes... But coming back to this kind of thought, this concept of Kubernetes having this rich metadata about your application -- you know, your engineers have spent time and effort describing the application to Kubernetes in the form of like YAML manifests for deployments, and stateful sets, and namespaces, and services, and all of this stuff gets described to Kubernetes... One of the things that I think makes monitoring Kubernetes quite unique is that description of the service can then be effectively read back to your observability system using things like kube-state-metrics. So this is an exporter for Kubernetes API that will tell Prometheus "Oh, this deployment is supposed to have 30 replicas. This deployment is running on this machine, and is part of this namespace..." It'll give you all of this metadata about the application. And it gives you it in metrics itself, as Prometheus metrics. This is quite unique. This is incredibly powerful. And it means the subsequent dashboards and experiences that you build on top of those metrics can actually just natively enrich, and -- you know, things like CPU usage; really boring. But you can actually take that CPU usage and break it down by service, really easily. And that's what I think gets me excited about monitoring Kubernetes. +**Mat Ryer:** Yeah, that's great to hear. And honestly, one of the things I like about Grafana is we really are trying to do things the right way. It's not all about just maximizing revenue, or anything like that. And yeah, I think that's -- it's also something for the future. Like Patrick said, we're looking at other ways of doing this adaptive stuff for the telemetry as well. So hopefully - yeah, we're on that journey with you, that's the idea. -**Deo:** \[38:11\] I agree one hundred percent. Yeah. And the community has stepped up a lot. I had an ex colleague of mine who was saying DevOps work is so easy these days, because loads of people in the past have made such a big effort to give to the community all those nice dashboards and alerts that they want out of the box. +**Mauro Stettler:** \[00:31:55.03\] You asked for what the technical difficulties were... If you like, we can talk more about technical difficulties. For example, one thing which I think many people don't realize, that I usually interact with is that -- so when we aggregate those series in order to drop certain labels, we need to choose the right aggregation functions. Because some series you want to query by some, some you want to query by max; then there are also sums of counters, which are different again... And for each of those different aggregation functions, we need to internally basically generate and aggregate. So it's possible that for one aggregated series, which to the user looks like it's just one series, internally it's actually multiple series, that we aggregated with multiple aggregation functions, and at the time when it gets queried, we need to select the right one. I think that's one thing which often surprises people when I talk to them. -Now, I'd just want to add to what Tom said that even though kube-state-metrics and Prometheus are doing such a very good job, like native integration with Kubernetes, it's not enough, in most cases. I have a very good example to showcase this. Let's say one of the nodes goes down, and they get an alert, and then you know that a few services are being affected... And then you ask engineers to drop in a call, and \[unintelligible 00:39:00.07\] and then start seeing what's wrong... Unless you give them easily the tools to figure out what is wrong, it's not enough. And in our case, -- actually, I think in most cases - you need a single place where to have dashboards, from kube-state-metrics, Prometheus metrics, but also logs. You need logs. And then you need performance metrics, you need your APM metrics... +Or then another one is that - Patrick mentioned the feedback loop of the recommendations engine... One problem which we have had for a long time was that once we dropped a label, we actually didn't know anymore what the cardinality of that label was that we are dropping, because we are not indexing it anymore. So sometimes this has led to imperfect recommendations, because once we dropped the label, we didn't know anymore whether the cardinality inside that label has changed in the meantime. So that's something that we are in the process of fixing right now, but it's also something that you need to be aware of. -So I think the Grafana ecosystem is doing a very, very good job. And I'm not doing advertisement, I'm just saying what we're using there. But in our case, we have very good dashboards, that have all the Prometheus metrics, and then they have Loki metrics, and then traces. You have your traces in there, and then you can jump from one to another... And then we have Pyroscope now was well... So dashboards that people are aware, and then they can jump in and right out of the box find out what is wrong - it's very powerful. And they don't need to know about what is Pyroscope, and what's profiling. They don't need to know these kind of things. You just need to give them the ability to explore the observability bottlenecks in their applications. +**Oren Lion:** Yeah, I had a question about the proactive roadmap, I guess you could say... Today I'll get an alert, and then I check recommendations, and we implement recommendations... There's some spike budget within Grafana to absorb that. But do you have some plans for an autopilot where you're like "Okay, Grafana, just apply these as they happen, find some way to inform us, and then if we need to disaggregate or get metrics back, we know where to find the place to do that"? -**Tom Wilkie:** Oh, I one hundred percent agree. I would add like this extra structure to your application that's metadata that can be exposed into your metrics. This makes it possible to develop a layer of abstraction, and therefore common solutions on top of that abstraction. And I'm talking in very general terms, but I specifically mean there's a really rich set of dashboards in the Kubernetes mixin, that like work with pretty much any Kubernetes cluster, and give you the structure of your application running in Kubernetes. And you can see how much CPU each service uses, what hosts they're running on. You can drill down into this really, really straightforward and easily. And there's a project internally at Grafana Labs to try and effectively do the same thing, but for EC2, and the metadata is just not there. Even if we use something like YACE, the Yet Another Cloudwatch Exporter, to get as many metrics out of the APIs as we can, you're not actually teaching EC2 about the structure of your application, because it's all just a massive long list of VMs. And it makes that -- you know, the application that we've developed to help people understand the behavior of the EC2 environment is nowhere near as intuitive and easy to use as the Kubernetes one, because the metadata is not there. +**Patrick Oyarzun:** There's a few different things we've talked about. So as far as the immediate question of an autopilot type thing... So the way we do that is basically it's a part of our normal CI pipelines. We have a cron job that pulls the latest recommendations, and updates what is our source of truth, which is a file in a Git repo, which is our source of truth for our current rules. And then the pipeline rolls that out. -So I just really want to -- I think this is like the fifth time I've said it, that that metadata that Kubernetes has about your application, if you use that in your observability system, it makes it easier for those developers to know what the right logs are, to know where the traces are coming from, and it gives them that mental model to help them navigate all of the different telemetry signals... And if there's one thing you take away from this podcast, I think that's the thing that makes monitoring and observing Kubernetes and applications running in Kubernetes easier, and special, and different, and exciting. +We've so far avoided building a UI flow for autopilot... And the reason is mostly because - you know, when you really start to think about it, there's a lot that I would want, at least, and I think most of our customers would want with that, which is knowing when things changed, who changed them, what the impact was, maybe even being able to roll back to a certain known good point... All of these things are already solved by a GitOps-style approach. And so we have this debate internally of, you know, we could spend some amount of time building all of that out and making a really great, first-class experience... Or we could invest that same resource into other features, and basically leave it as "Well, GitOps." -**Mat Ryer:** \[42:05\] These things also paid dividends, because for example we have the Sift technology - something I worked on in Grafana - which essentially is only really possible... You know, the first version of it was built for Kubernetes, because of all that metadata. So essentially, when you have an alert fire, it's a bot, really, that goes and just checks a load of common things, like noisy neighbors, or looks in the logs for any interesting changes in logs, and things, and tries to surface them. And the reason that we chose Kubernetes first is just because of that metadata that you get. And we're making it work -- we want to make it work for other things, but it's much more difficult. So yeah, I echo that. +We have a Terraform provider, we have an API... There are a few ways you could set up a pipeline around this. So far we've taken the second approach, of telling customers GitOps is really the best way. But obviously, different organizations are going to find that more or less challenging... So I'm curious what you think, actually, while you're here, about that whole thing. -**Vasil Kaftandzhiev:** It's really exciting how Kubernetes is making things complex, but precise. And on top of everything, it gives you the -- not the opportunity, but the available tools and possibilities to actually manage it precisely. If you have either a good dashboard, good team, someone to own it etc. so you can be precise with Kubernetes. To actually check that you should be precise. +**Oren Lion:** Yeah, I think exactly the same way you do. Even though I like the idea of an autopilot, we also have everything stored in Bitbucket in our Git repository. So we push out changes, and they're all version controlled, we know when they happened, there's traceability... So there's a lot more control. And maybe what we should implement is a cron job that just goes and pulls down recommendations and applies them. We'd still have the version history there, but it takes away the alerting factor that you're now over some threshold that you set. So yeah, I agree with you. Maybe autopilot isn't even needed. -**Tom Wilkie:** Yeah. A hundred percent. And just to build on what Mat said - no podcasts in this day and age would be complete without a mention of Gen AI and LLMs. We've also found in our internal experiments with these kinds of technologies that that meta data is key to helping the AI understand what's going on, and make reasonable inferences and next steps. So giving the metadata to ChatGPT before you ask a question about what's going on in your Kubernetes cluster has been an unlock, right? There's \[unintelligible 00:43:45.20\] a whole project built on this as well, that's actually seen some pretty impressive use cases. +**Patrick Oyarzun:** \[00:36:07.23\] I think there's one situation that we've kind of, or at least in the back of my mind I'm feeling like we might need to figure out. So you mentioned before that we do 95th percentile billing... So what that translates to is you can spike your time series to pretty much any level for up to 36 hours. And as long as it drops back down before the end of that 36 hours, you're not billed for that. So it can be one event, or it can be 36-hour long events over the month. The point is you kind of get a 36-hour budget to spike. Right now, recommendations are generated based on a few signals, like for example dashboards changing, or new queries that we haven't seen before, things like that. One of the things that we don't trigger new recommendations on right now is those spikes in cardinality. So there's a chance that you get this alert and the recommendation engine hasn't yet had a chance to analyze the new state. Typically it's running often enough that it will, but when we're talking about a 36-hour budget, what if we could get to a state where you have this auto-applied pipeline, and within the span of a few minutes you're tamping down on this cardinality explosion. I think in order to get to that level, we would need to actually trigger recommendations on the spike itself... And I think that's possible; it's just a thing that, like anything else, we have to weigh against everything else we're thinking about building. -So yeah, I think this metadata is more than just about observability. Like, it's actually -- the abstraction and that unlock is one of the reasons why Kubernetes is so popular, I think. And you said something, Deo, which I thought was really interesting... You started to talk about logs and traces. How are you using this metadata in Kubernetes to make correlating between those signals easier for your engineers? +**Oren Lion:** It does make me wonder, do I need to do client-side filtering with relabel configs, or can we just always open the floodgates and let Adaptive Metrics deal with the flood? -**Deo:** \[unintelligible 00:44:15.18\] So one of the bottlenecks is not having any labels, not having anything. The other can be having too many of them. So you have many clusters, you have hundreds of applications... It's very often the case where it's very busy, and people cannot find quickly what's going on. So we cannot have this conversation without talking about exemplars. So exemplars is something that we -- it unlocked our engineering department for really figuring out what was wrong and what they really needed. So exemplars, they work with traces. And -- +**Patrick Oyarzun:** Yeah, that's a really interesting question. I actually -- so I posted in our engineering channel just a few days ago, because I noticed we have around 25 million time series in our ops cluster that are unused right now. And the approach -- this is like getting into the design philosophy behind Adaptive Metrics a little bit, that so far we've taken the approach that it's better to keep some data, even if it's heavily aggregated, than to drop metrics completely. And the reason is mostly because we're kind of making these decisions for our engineers... Because no human is in the loop, we're not asking anybody permission before we aggregate their new metric that they added last month, or something. And so we've always taken the approach of it's better to aggregate than to drop. But even with an extreme amount of aggregation on our unused metrics, we still have 25 million time series. -**Mat Ryer:** Deo, before you carry on, can you just explain for anyone not familiar, what is an exemplar? +And so I posted that in our engineering channel and was like "Hey, this is kind of expensive. Maybe we should drop some of these." And I picked all the metrics that were more than 100,000 time series, I put them in a list and tagged the teams that owned them, and said "What do you think?" And even just the act of bringing that up, a bunch of them went and started modifying their relabel configs, and deleting recording rules they didn't need... We've seen with adaptive metrics -- it's interesting, a lot of times just giving people the visibility is enough to create the behavior change, even without necessarily the technology to do it all automatically. I think giving insights into your usage patterns, and how big this thing is that you're not using is, tends to inspire, I think, a lot of people, especially when you put it in terms of dollars, not series; then they really start to get motivated. -**Deo:** Yeah, sure. So an exemplar is just a trace, but it's a trace with a high cardinality. So when something is wrong, when you have let's say a microservice that has thousands of requests per second, how can you find which request is a problematic one? So you have latency. Clients are complaining about your application is slow. But then you see in your dashboards most of them are fine. How can you find the trace, the code that was really problematic? This is where exemplars are coming. And out of the box, it means that it can find the traces with the biggest throughput, and then you have a nice dashboard, and then you have \[unintelligible 00:45:34.25\] and then when you click this node, you can right away go to the trace. +So yeah, I mean, that's how we do it. We kind of just let everybody send whatever, and then after the fact we either aggregate, or maybe we go back and say "Hey, that's kind of expensive." Sometimes that's enough. -And then after the trace, everything is easy. With the trace, you can go to Pyroscope and see the profiling, or you can go to the pod logs, or you can go to the node with Prometheus metrics... So everything is linked. So as long as you have these problematic trace, everything else is easy. +**Mat Ryer:** Yeah, that's very cool. So how do people get this? Is it available now? How long has it been around? -And this is what really unblocked us, because it means that when something goes wrong, people don't have to spend time figuring out where can be the problematic case, especially if you have a ton of microservices, a chain of microservices. +**Patrick Oyarzun:** Yeah, it's been around for a while. I think we -- do you remember which OpsCon we announced at? It was, I think, two years ago. -\[46:11\] So yeah, exemplars was something that really, really unblocked us. Because it's really easy to have many dashboards; people are getting lost in there. You don't need many dashboards. You just give some specific ones, and people should know, and you should give them enough information to be able to do their job when they need to, very, very easily. And exemplars was really extremely helpful for us. +**Mauro Stettler:** In London... -**Tom Wilkie:** Yeah, I'm a huge fan of exemplars. I think it's a big unlock, especially in those kinds of debugging use cases that you describe. I will kind of - again, just to pick you up there... There's nothing about exemplars that's Kubernetes-specific. You can 100% make that work in any environment. Because the linkage between the trace and the metrics is just a trace ID. You're not actually leveraging the Kubernetes environment. I mean, there are things about Kubernetes that make distributed tracing easier, especially if you've got like a service mesh, again. I definitely get that. But yeah, exemplars work elsewhere. +**Patrick Oyarzun:** London OpsCon. -It's the ones that I think -- the places that are kind of Kubernetes-enhanced, if you like, in observability, is making sure that your logs and your metrics and your traces all contain consistent metadata identifying the job that it came from. So this was actually the whole concept for Loki. Five years ago, when I wrote Loki, it was a replacement for Kubernetes logs, for kubectl logs. That was our inspiration. So having support for labels and \[unintelligible 00:47:40.07\] and having that be consistent -- I mean, not just consistent, but literally identical to how Prometheus does its labeling and metadata, was the whole idea. And having that consistency, having the same metadata allows you to systematically guarantee that for any PromQL query that you give me, any graph, any dashboard, I can show you the logs for the service that generated that graph. And that only works on Kubernetes, if I'm honest. Trying to make that work outside of Kubernetes, where you don't have that metadata is incredibly challenging, and ad hoc. +**Mauro Stettler:** I think in October; it was October of last year we announced it. And I think we said that it's going to be available in a few weeks. I think it was available in November. -**Deo:** Exactly. When everything fits together, it's amazing. When it works, it's amazing. Being an engineer and being able to find out what is wrong, how you can fix it, find the pod... And by the way, auto-scaling can fix it; it's a superpower in your engineer. And you don't need to own all these technologies. You just need to know what your service is doing and how you can benefit out of it. +**Patrick Oyarzun:** \[00:39:57.10\] Right. So it's been available since around that time. It's available in every tier of Grafana, even the free tier. So you could be sending 10,000 series and you'll potentially get recommendations to reduce that even further... Which is funny, because it basically just means you can pack even more into the free tier. But we've seen that the free tier is a really useful way for people to try Grafana Cloud out, and this is such a big, important feature we don't want to put it behind kind of a paywall, so to speak. -One other thing as well is that those things are cheap. You may have seen, there are a ton of similar solutions out there. Some of them may be very expensive \[unintelligible 00:48:52.10\] The thing with Loki, and stuff is they are very cheap as well, so they can scale along with your needs... Which is critical. Because lately -- I hear this all the time; "efficiency", it's the biggest word everyone is using. You need to be efficient. So all these things are very nice to have, but if you are not efficient with your costs, eventually -- if they're not used enough, or they're very expensive, people eventually will not use them. So efficiency is a key word here. How cheap it can be, and how very well \[unintelligible 00:49:22.03\] +**Mauro Stettler:** By the way, we also see that the usage really picks up. I have some story about that, because just yesterday I noticed that -- so for a long time, the ops cluster, which Patrick already mentioned, which is our internal monitoring cluster that we use to monitor our cloud, used to be the biggest user of Adaptive Metrics. It aggregates -- so the usage fluctuates, but it usually aggregates at peak roughly around 230 million series down to 50 million. And yesterday I discovered that actually we have been overtaken by one of our users. They are now aggregating 330 million down to 30 million. So they're actually saving 300 million series using Adaptive Metrics. That's quite impressive. I was surprised by it. -**Tom Wilkie:** Yeah. And I don't want to be that guy that's always saying "Well, we always used to be able to do this." If you look at like the traditional APM vendors and solutions, they achieved a lot of the experience that they built through consistently tagging their various forms of telemetry. The thing, again, I think Kubernetes has unlocked is it's not proprietary, right? This is done in the open, this is consistent between different cloud providers and different tools, and has raised the level of abstraction for a whole industry, so that this can be done even between disparate tools. It's really exciting to see that happen and not just be some proprietary kind of vendor-specific thing. That's what's got me excited. +And what I really liked when I saw this was that - this happened like one or two weeks ago, and we didn't even notice it. So everything just worked without an issue. So that was the most satisfying. -**Deo:** \[50:09\] Okay. Now, Tom, you got me curious - what's your opinion about multicloud? +**Mat Ryer:** That's nice. -**Tom Wilkie:** Grafana Labs runs on all three major cloud providers. We don't ever have a Kubernetes cluster span multiple regions or providers. Our philosophy for how we deploy our software is all driven by minimizing blast radius of any change. So we run our regions completely isolated, and effectively therefore the two different cloud providers, or the three different cloud providers in all the different regions don't really talk to each other... So I'm not sure whether that kind of counts as multicloud in proper, but we 100% run on all three cloud providers. We don't use any cloud provider-specific features. So that's why I like Kubernetes, because it's that abstraction layer, which means -- honestly, I don't think our engineers in Grafana Labs know which cloud provider a given region is running on. I don't actually know how to find out. I'm sure it's baked into one of our metrics somewhere... But they just have like 50-60 Kubernetes clusters, and they just target them and deploy them. And again, when we do use cloud provider services beyond Kubernetes, like S3, GCS, these kinds of things, we make sure we're using ones that have commonality and similar services in all cloud providers. So pretty much like we use hosted SQL databases, we use S3, we use load balancers... But that's about it. We don't use anything more cloud provider-specific than that, because - having that portability between clouds. +**Patrick Oyarzun:** Just a humble brag from Mauro... -**Deo:** And have you tried running and having dashboards for multiple cloud providers, for example for cost stuff? +**Mat Ryer:** Humble brag, but I didn't sense any of the humble bits. That was just a pure brag of how good this stuff is. Yeah, that's really cool. Wow. And do you think -- I mean, that's quite a big drop. That's like -- it's only 10% of the things they're storing. Do you think they're overdoing it? Do you think that you'll see them pull that back a little bit? -**Tom Wilkie:** Yeah, it's hard to show you a dashboard on a podcast... But yeah, 100%. +**Mauro Stettler:** No, I don't think that they're overdoing it. I think we just have to make sure that performing these aggregations is really cheap. I think that they'll keep doing what they're doing. I'm pretty happy to -- I'm happy to see my code running like this. -**Mat Ryer:** You can just describe it, Tom. +**Patrick Oyarzun:** Yeah, and I think, you know, knowing this customer - this is an organization that is running at a massive scale, obviously... And so you end up with these kinds of problems with kind of modern cloud-native architectures, where every time you add a replica, you now have - however many series are being emitted by that type of thing, you've now added another bucket of that. And they've got that problem in spades, and so simple things like dropping some unique label here and there can really have a massive impact. -**Tom Wilkie:** Our dashboard for costs is cloud-agnostic, is cloud provider-agnostic. So we effectively take all the bills from all our cloud providers, load them into a BigQuery instance, render the dashboard off of that, and then we use Prometheus and OpenCost to attribute those costs back to individual namespaces, jobs, pods... And then aggregate that up to the team level. And if you go and look on this dashboard, it will tell you how much the Mimir team or how much the Loki team is spending. And that is an aggregate across all three cloud providers. +**Mat Ryer:** Yeah. That's so good. I don't want to keep -- obviously, this is not meant to be a marketing podcast. This is meant to just be technical stuff. But I just -- that's great. What are the downsides? There must be some. -The trickier bit there, as we've kind of talked about earlier, OpenCost doesn't really do anything with S3 buckets. But we use -- I forgot what it's called... We use Crossplane to provision all of our cloud provider resources... And that gives us the association between, for instance, S3 bucket and namespace... And then we've just built some custom exporters to get the cost of those buckets, and do the join against that metadata so we can aggregate that into the service cost. But no, 100% multicloud at Grafana Labs. +**Patrick Oyarzun:** Yeah, I think there are. I think the main thing that we see often is -- so when we aggregate, say you aggregate by the sum. You did some gauge metrics, say like disk usage or something, and you store the sum of all the disk usage across all your pods, as a random example. And now you go to query the maximum disk usage. If you didn't specify that in your aggregation config, then you'll get back an error. It'll say "We can't aggregate that way, we can only do a sum", basically. Now, importantly, the recommendations engine will notice that you made that query, and now will suggest that you go fix the rule. And so if you have an automatic process that's applying these things, then conceivably sometime soon you'll get that data back, and you can start querying it how you want to. -**Vasil Kaftandzhiev:** Talking about costs and multicloud, there are so many dimensions about cost in Kubernetes. This is the cloud resources, this is the observability cost, and there is an environmental cost that no one talks about... Or at least there is not such a broad conversation about it. Having in mind how quickly Kubernetes can scale on its own, what do you think about all of these resources going to waste, and producing not only a waste for your bill, but a waste for the planet as well, in terms of CO2 emissions, energy going to waste, and stuff like that? +But there is this a limitation. I mean, part of the design philosophy of adaptive metrics is that you don't have to change your dashboards. In order to do that, we need it to be kind of like transparent. We don't want the shape of the line on your dashboard to change because you started using adaptive metrics... Because that just kind of creates this issue of trust. Now you aren't sure, "What did we really do?" if it looks different afterwards. And so it doesn't change. Like, if you see these graphs before and after aggregation, there might be like a small gap or something right at the moment when you would turn it on, but then generally it looks normal. And that comes at the cost of being able to know that we're serving queries accurately all the time, we have to sometimes not serve those queries. -**Deo:** That's a very good question. I'm not sure I have thought about this that much, unfortunately. As a team, we try not to use a ton of resources, so we'll scale down a lot. We don't over-provision stuff... We try to reuse whatever is possible, and using \[unintelligible 00:53: 52.04\] and stuff... But mostly this is for cost effectiveness, not about anything else. But this is a very good point. I wish more people were vocal about this. As with everything, if people are passionate, things can change, one step at a time... But yeah, that's an interesting point. +\[00:44:06.25\] But I did a check the other day of our ops cluster, and we had -- I'll underestimate, because I don't want to lie. I think we had five nines of queries succeeding when it comes to adaptive metrics. So the number of queries that were getting this error of "You used the wrong aggregation function and you need to go change it" was like - there were three zeros after the point before you get to that percent. So it really does seem to mostly work. Most metrics are only ever queried through a dashboard, the usage patterns don't change much, and so for those cases it actually works great. -**Vasil Kaftandzhiev:** \[54:10\] For me it's really interesting how almost all of us get Kubernetes for granted... And as much as we are used to VMs, as much as we're used to bare metal, as much as we can imagine in our heads that this is something that runs into a datacenter, with a guard with a gun on his belt, we think of Kubernetes as solely an abstraction. And we think about all of the different resources that are going to waste as just digits into the Excel table, or into the Grafana Cloud dashboard. +**Mat Ryer:** So if somebody was just poking around in Explore, and they were doing lots of different queries, they're just interested - are they likely to impact things by doing that? Could they make a mess? -At the end of the day, I should be right here, but approximately 30% of all of the resources that are going into data-powering Kubernetes are going to waste, according to the CNCF... Which is a good maybe conversation to have down the road, and I'm sure that it's going to come to us quicker than we expect. +**Patrick Oyarzun:** It's possible, yeah. And over time we've gotten better at handling it. The current state is basically any query that we see - with a few exceptions, any query is considered kind of like important. They're all treated the same. We don't try to guess, like, "This one is something we need to work, and this one we can ignore." There's a few exceptions... So for example, in our docs we document a few ways that you can run a query that won't be considered important... Like, there's a fake label called ignore usage you can use. And if you use that label, then we ignore that query for recommendations. So if you're poking around and you want to make sure you're not going to break something, that's an option. But generally, I don't expect people to use that. -**Tom Wilkie:** I think the good news here is -- and I agree, one of the things that happens with these dynamically-scheduled environments is like a lot of the VMs that we ask for from our cloud provider have a bit of unallocated space sitting at the top of them. We stack up all of our pods, and they never fit perfectly. So there's always a little bit of wastage. And in aggregate, 100% agree, that wastage adds up. +The bigger exceptions are, for example -- one of the things we see the most common is a customer will write, for example, a recording rule, and the point of that recording rule is to try to basically calculate what their bill is going to be. It's like a meta cost optimization thing. And so they'll write a recording rule that does like a count of all the time series. That doesn't necessarily mean that they want to store account aggregation for every metric in the database, because then we're just going to end up making adapter metrics kind of useless. And if their goal in the first place was cost optimization, it's kind of this -- it's just a bad situation. -I think the 30% number from the CNCS survey - I think internally at Grafana Labs that's less than 20%. We've put a lot of time and effort in optimizing how we do our scheduling to reduce that wastage... But the good news here is like incentives align really nicely. We don't want to pay for unallocated resources. We don't want to waste our money. We don't want to waste those resources. And that aligns with not wasting the energy going into running those resources, and therefore not producing wasted CO2. +So that actually was happening a lot... When we first were in the private preview, we saw tons of customers where they would say "Hey, why don't I have any recommendations? It says all of my metrics are used. How is that possible?" And we would dig in and it was almost always somebody had written some homemade cost tool, and typically it was running a count on every metric.. So we ignore -- there's a few patterns like that where if you run the same exact query on every metric in the system, we tend to ignore that. -So I think the good news is incentives align, and it's in users' and organizations' interest not to waste this, because at the end of the day if I'm paying for resources from a cloud provider, I want to use them. I don't want to waste them. But that's all well and good, saying incentives align... I will say, this has been a project at Grafana Labs to drive down unallocated resources to a lower percentage. It has been a project for the last couple of years, and it's hard. And it's taken a lot of experimentation, and it's taken a lot of work to just get it down to 20%... And ideally, it would be even lower than that. +**Mat Ryer:** Oh, wow. That is really complicated. -**Mat Ryer:** And I suppose it keeps changing, doesn't it? +**Patrick Oyarzun:** There's a few different things, yeah. -**Tom Wilkie:** Yeah, the interesting one we've got right now is - I think we've said this publicly before... The majority of Grafana Cloud used to be deployed on Google, and over the past couple of years we've been progressively deploying more and more on AWS. And we've noticed a very large difference in the behavior of the schedulers between the two platforms. So one of the benefits I think of GCP is Google's put a lot of time and effort into the scheduler, and we were able to hit like sub 20% unallocated resources on GCP. Amazon has got a brilliant scheduler as well, and they've put a lot of time and effort into the Carpenter project... But we're just not as mature there, and our unallocated resources on EKS is worse. It's like up in the 30% at the moment. But there's a team in Grafana Labs who literally, their day to day job is like to optimize this... Because it really moves the needle. We spend millions of dollars on AWS, and 10% of millions of dollars is like more than most of our engineers' salaries. So it really does make a difference. +**Mat Ryer:** It reminds me of like Google's pagerank stuff. It's sort of like magic, and it just sort of -- yeah. -**Vasil Kaftandzhiev:** This is a really good touch on salaries, and things... I really see monitoring Kubernetes costs currently as the ROI of an engineering team towards their CTO. So effectively, teams can just now say "Hey, we've got 10% or 15% of our Kubernetes costs, and now we're super-performers and stars." +**Patrick Oyarzun:** Yeah. -\[57:59\] One question again to you, Deo. We have talked a lot of namespaces... But can you tell me your stance about resource limits, and automated, for an example, recommendations on the container level? I know that everyone is talking namespaces, but the little \[unintelligible 00:58:13.26\] of ours, the containers don't get so much love. What's your stance on that? How do you do container observability? +**Mat Ryer:** I like that as a user you don't have to worry about it, honestly. -**Deo:** Alright, so Containerd? Like, compare it to both? Or what do you mean? +**Patrick Oyarzun:** Exactly. Yeah. And it's tricky though... We take it really seriously being good stewards of your data, because ultimately we're recommending that you delete things. And and to know that we can do that and it'll be safe, and we're not going to cause you to miss an SLO firing or something like that, we take that very seriously. So every time we introduce these new rules to say "We're going to ignore this class of usage", it's a whole process. Understanding how common does this happen in the first place, let's go kind of manually check a bunch of tenants and see what would it do if we do this... -**Vasil Kaftandzhiev:** Yeah, exactly. +And so part of the recommendations engine that I've, been working on over the last couple of years is we have a whole capability where behind the scenes we can run the entire recommendations engine with different configuration, and the results are not visible to customers. And that has allowed us to basically A/B test across billions of time series, and see "What happens if we add this rule, and ignore this kind of usage? What does that do to the global savings count?" and things like that. We've done tons of tests like that. And like Mauro mentioned before, we're working on a feature to be able to over time estimate better what the impact will be even after we've aggregated some data away... That right now is, again, running as like one of these background jobs where we're collecting data and trying to see what the impact will be before we roll it out. -**Deo:** So we're in a state where other than the service mesh, everything else is like one container equals one pod... Which means -- but it's difficult to get it right. So what we advise people is just have some average CPU and memory, \[unintelligible 00:58:44.09\] and then keep it there for a week. And then by the end of the first week, change the requests and limits based on usage. We just need to be a bit pushy, try to ping them again and again, because people tend to forget... And they always tend to over-provision; they're always afraid that something will break, or something will happen... And as I've said before, most of the times people just, I think, they abuse infrastructure, where they just add more memory, add more CPU to fix memory leaks, and stuff... So you need to be a bit strict, and educate people again and again about what means in terms of cost, in the terms of building, and stuff like that. But yeah, what we say most of the time is that have some average stuff, what do you think, and then adjust after the end of the first week. +**Oren Lion:** \[00:48:08.08\] It'd be interesting if there were some metric that shows us cost savings per recommendation, just to see kind of like how things have been going up or down. Not sure what I'd do with it, but it's just an interesting metric to have. And then separately - it's pretty funny... Leading up to this, I was like "Hey, maybe I should create my own Spend Metrics by Team dashboard." And I did exactly that, just account by some labels that represent a team... And so for every team I can say "Hey, here's your monitoring costs", because I can multiply that by what we pay. And when people see the cost for all these time series, it's a wake-up call. They need to go and do something about it, like update their relabel configs... But they don't; they can not do that. We can just do a better job with adaptive metrics. That's what I'm hearing. And that's less work for us, so... It's a great thing. -Now, we don't have a lot of containers in our pod, so this makes our life a bit easier. If that wasn't the case, I'm not sure. I think though that this is something that maybe will change in the future, but you will -- I don't remember where I was reading about this, or if it's just like from \[unintelligible 00:59:54.02\] mind, but I think in the newest version of Kubernetes, requests and limits will be able to support containers as well. But again, I'm not sure if I just read about it or just \[unintelligible 01:00:05.26\] I'm not sure. +**Patrick Oyarzun:** Yeah, it's super-interesting. And you know, Mimir is a Prometheus-compatible database, and so I think a lot of times what happens is people will do what everybody does, which is google "How do I save, how do I reduce my time series?" And you find solutions that are kind of tailor-made for the Prometheus ecosystem, and they totally make sense of like "I'm going to count by team." But now that adaptive metrics exists, we're in this position where running that query tells us that you're using the team label, and that you want to count everything. -**Vasil Kaftandzhiev:** I think it's already available, that. +So we've had to kind of rethink what does usage really mean, what does cardinality really mean, and to try to do this without you noticing. We want to be able to ignore the right stuff, but not too much, so that way your dashboards don't break, but... Yeah, it's a delicate balance. -**Tom Wilkie:** Yeah, I'd add a couple of things there, sorry. Firstly, it's worth getting technical for a minute. The piece of software you need to get the data into Kubernetes, to get that telemetry, it's called cAdvisor. Most systems just have this baked in. But it's worth -- especially if you want to look up like the references on what all of these different metrics means, go and look at cAdvisor. That's going to tell you per pod CPU usage, memory usage, these kinds of things. It's actually got nothing to do with pods or containers; it's based on cGroups. But effectively, cAdvisor's the thing you need. +**Mat Ryer:** While we're on this subject, what are some other cool things that we can get into the roadmap while we're doing this new form of podcast-driven roadmap management without the PM? Although they're a great PM. Is it Steven Dungan, by the way? Is that the PM? -In Grafana Labs, we're moving towards a world where actually we mandate that limits equals requests, and everything's basically thickly provisioned. And part of this is to avoid a lot of the problems that Deo talked about at the beginning of the podcast, where if people -- there's no incentive in a traditional Kubernetes cluster to actually set your requests reasonably. Because if you set your requests low and just get billed for a tiny little amount, and then actually set your limit really high and optimistically use a load of resources, you've really misaligned incentives. So we're moving to a world where they have to be the same, and we enforce that with a pre-submission hook. +**Patrick Oyarzun:** He's on logs. He's on adaptive logs. -And then the final thing I'll say here is, I'm actually not sure how much this matters. Again, controversial topic, but we measure how much you use versus how much you ask for. So we measure utilization, and not just allocation. And we bill you for the higher of the two. We either bill you for how much you ask for, or how much you use. When I say bill, obviously, I mean internally, like back to the teams. +**Oren Lion:** Jen Villa? -\[01:01:49.13\] And because of that approach, the teams inside Grafana Labs, they all have KPIs around unit costs for their service. So they're not penalized, I guess, if their service costs more, as long as there's also more traffic and therefore more revenue to the service as well. But we measure -- like a hawk, we monitor these unit costs. And if a team wants to optimize their unit costs by tweaking and tuning their requests and limits on their Kubernetes jobs, and bringing unit costs down like that, or if they want to optimize a unit cost by using Pyroscope, to do CPU profiling, and bringing down the usage, or rearchitecting, or any number of ways -- I actually don't mind how they do it. All I mind is that they keep, like a hawk, an eye on this unit cost, and make sure it doesn't change significantly over time. So I'm not sure -- I think this is like down in the details of "How do I actually maintain a good unit cost and maintain that kind of cost economics over a long term?" And I think these are all just various techniques. +**Patrick Oyarzun:** Jen was. She got promoted. -**Deo:** So Tom, is this always the case? Are teams always supposed to have the same requesting limits? Is this always a best practice internally, at Grafana? +**Oren Lion:** Okay. -**Tom Wilkie:** It's not at the moment. It's something I think we're moving towards, at least as a default. And again, there's a big difference between -- again, our Mimir team literally spends millions of dollars a year on cloud resources. And they do have different limits and requests on their pods, and their team is very sophisticated and know how to do this, and have been doing this for a while. But that new service that we just spun up with a new team, that hasn't spent the time and effort to learn about this, those kinds of teams are \[unintelligible 01:03:30.16\] to have limits and requests the same, and therefore it's a massive simplification for the entire kind of reasoning about this. And again, these new teams barely use any resources, so therefore we're not really losing or gaining anything. +**Patrick Oyarzun:** So she's technically still over adaptive metrics, but her report, Nick Fuser, is our direct PM. -And I will say, there's that long-standing issue in Kubernetes, in the Linux Kernel, in the scheduler, where if you don't set them the same, you actually can find your job is like frequently paused, and your tail latencies go up... And that's just like an artifact of like how the Linux scheduler works. +**Mat Ryer:** Yeah. Credit there, because they don't have an easy job with this project, I think. -**Deo:** This is very interesting. It actually solves many of the things we have talked earlier. So you don't have to worry about cost allocation, because most -- like, GCP at least, they will tell you how much it costs based on requests. But if your requests and limits are the same, you have an actual number. +**Patrick Oyarzun:** Yeah, no, it's definitely -- I mean, anytime you're building something that's as unique as this, it's hard. I think a lot of product management is understanding the ecosystem, and kind of building - not necessarily what other people have already built, but like you definitely use that as a part of the calculus. And there's not a lot. There's not a lot of existing art for this kind of thing, and so we've kind of had to figure it out. -**Tom Wilkie:** Exactly. Yeah. +**Oren Lion:** Do you work with the adaptive logs team? -**Deo:** I think Grafana is a bit of a different company, because everyone \[unintelligible 01:04:21.24\] so they know their stuff. I think most of the engineering teams, at least for our case, having requests and limits the same, even though it would be amazing, it would escalate cost... Because people, they always -- +**Patrick Oyarzun:** Yeah. So the adaptive metrics team is staffed by Mimir maintainers, adaptive logs is staffed by Loki maintainers, and so we're technically separate teams, but... Yeah, I mean, we just had like a company offsite and there was a breakout for adaptive telemetry, and a few of us from each of the signals got together to talk, and they're all the time coming into our team channel and saying -- because they're a little bit newer, and so they're maybe a little bit further behind on the kind of maturity curve as a product, naturally... And so they'll come into our channel and ask "Hey, how did you handle this thing?" -**Tom Wilkie:** \[01:04:40.13\] Yeah, so the downside. The downside of this approach 100% is you lose the ability to kind of burst, and you're basically setting your costs to be your peak cost. Right? But I'd also argue -- it wouldn't be a podcast with me if I didn't slip in the term statistical multiplexing. A lot of random signals, when they're multiplexed together, becomes actually very, very predictable. And that's a philosophy we take to heart in how we architect all of our systems. And at small scale, this stuff really matters. At very large scale, what's really interesting is it matters less, because statistical multiplexing makes things like resource usage, unit costs, scaling - it makes all of these things much more predictable. And it's kind of interesting, some things actually get easier at scale. +\[00:51:25.20\] And the cool thing though is that we're actually finding a lot of the same problems. Customers are worried about needing data that they dropped, they want to know "Okay, well, you said it was used, but by who?" So these questions kind of naturally come up across the signals, which I think is a really good sign that we're kind of addressing a real problem. -**Deo:** Yeah, it's very interesting. So are teams internally responsible? Do they own their cost as well? Or no? +**Oren Lion:** Yeah... Curious, because they work in different ways, but have the same end goal. -**Tom Wilkie:** Yeah, 100%. And you mentioned earlier you have like Slack bots, and alerts, and various things... We've moved away from doing this kind of \[unintelligible 01:05:45.16\] We don't like to wake someone up in the middle of the night because their costs went up by a cent. We think that's generally a bad pattern. So we've moved to... We use -- there's a feature in Grafana Enterprise where you can schedule a PDF report to be sent to whomever you want, and it's rendering a dashboard. And so we send a PDF report of our cost dashboard, which has it broken down by team, with unit costs, and growth rates, and everything... That gets sent to everyone in the company, every Monday morning. And that really promotes transparency. Everyone can see how everyone else is doing. It's a very simple dashboard, it's very easy to understand. We regularly review it at the kind of senior leadership level. Once a month we will pull up that dashboard and we'll talk about trends, and what different projects we've done to improve, and what's gone wrong, and what's caused increases. +**Patrick Oyarzun:** Exactly, yeah. It's been really interesting to see like how -- because yeah, logs don't have a concept of like aggregation necessarily. And similarly, metrics don't have a concept of sampling necessarily. You have to look at them differently. Yeah. -And this is, again, the benefit of like Grafana and observability in our Big Tent strategy, is that everyone's using the same data. No one's going "Well, my dashboard says that the service costs this, and my dashboard says that the service costs this." Everyone is is singing off that same hymn sheet. It gets rid of a lot of arguments, it drives accountability... Yeah, having that kind of one place to look, and proactively rendering that dashboard and emailing it to people... Like, it literally -- I get that dashboard every morning at 9am, and it is almost without fail the first thing I look at every day. +**Mat Ryer:** Well, I'm afraid that's all the time we have... Unless there's any final thoughts. -**Mat Ryer:** There's also that thing where we have a bot that replies to your PR, and says like \[unintelligible 01:07:15.06\] +**Patrick Oyarzun:** So I'll just say, one big feature that's going into private preview right now - we just deployed it to our ops cluster - we're calling it rule segmentation. The idea is basically - we've seen, especially with larger organizations, that having a single configuration for adaptive metrics is not enough. And this is true for us internally, too. We've got, I think, around 6,000 rules supplied in our ops cluster, and with rule segmentation what that's allowed us to do is have a separate configuration for the Mimir team, for the Loki team, for the tempo team... Our machine learning team has one. Basically, there's around 50 or so teams, each gets their own configuration. And I'm excited about that because that means we can start to treat them in different ways. Maybe you're a brand new team and you kind of don't really know yet how you're going to define your SLOs, what you're going to monitor. Well, maybe you don't want to apply adaptive metrics at all. -**Tom Wilkie:** Oh yeah. Cost. Yeah. +Versus say the hosted Grafana team - they've been around a long time, they have very mature operational practices, and they have a huge amount of time series they're generating, and so they may want to dial it up. And so allowing especially larger organizations to break down their adaptive metrics config into these segments or teams or whatever you want to call them, cost centers - I'm really excited about that. I think it'll open the door for kind of a different way to think about it across all of these features that we've been talking about so far. -**Mat Ryer:** Yeah. So that's amazing. It's like, "Yeah, this great feature, but it's gonna cost you. You're gonna add this much to your bill." And yeah, you really then -- it is that transparency, so you can make those decisions. It's good, because when you're abstracted from it, you're kind of blind to it. You're just off in your own world, doing it, and you build up a problem for later... But nice to have it as you go, for sure. +**Oren Lion:** And did Mauro have any ideas he wanted to share before we head out? Just curious... -Thank you so much. I think that's all the time we have today. I liked it, by the way, Tom, earlier when you said "I think it's worth getting technical for a minute", like the rest of the episode was about Marvel movies, or something... \[laughter\] But you are my Marvel heroes, because I learned a lot. I hope you our listener did as well. Thank you so much. Tom, Vasil, Deo, this was great. I learned loads. If you did, please tell your friends. I'll be telling all of mine. Gap for banter from Tom... +**Mauro Stettler:** Yeah, in my opinion I think one thing that we should solve, which currently is something that's missing, is that we should -- when we generate recommendations, we currently tell the user something like "If you drop label X, you're going to reduce your active series count by Y." But I think what the user actually wants to see is not by how much their series count is going to be reduced, but by how many dollars. Right now, as it is, that's actually not very easy to predict. And I think that's one thing that we should solve, because what the user wants to see is "How many dollars are we going to save with the recommendation?" Yeah. I think once we have this, this is going to help a lot. -**Tom Wilkie:** Oh, you know, insert banter here... Like, no, thank you, Vasil, thank you, Deo. Great to chat with you. I really enjoyed that. +**Patrick Oyarzun:** I mean, if you have the billing rate, you'll be able to do some quick math to give the dollar amount... -**Mat Ryer:** Yup. Thank you so much. Join us next time, won't you, on Grafana's Big Tent. Thank you! +**Mauro Stettler:** So the problem right now is that the bill isn't always linear with the active series, because there is another factor, which is the DPM, the data points per minute inside the series, which can also affect the bill. And how the recommendation is going to affect the DPM is complicated. So that's one problem that we are currently trying to solve. + +**Patrick Oyarzun:** Good point. Thank you. + +**Mat Ryer:** Yeah. Well, good luck with that. And that brings us nicely to the end. That is unfortunately all the time we have today, but thank you so much for joining me and digging into adaptive metrics. We learned about what they are, what the problem is - too many metrics - learned about how adaptive metrics can automatically, but with your permission, help you address that problem, reduce costs and stream things down a bit. + +Excellent stuff. Thank you very much Patrick, Mauro and Oren. Thank you for joining us, and thanks for listening. We'll see you next time on Grafana's Big Tent. diff --git a/bigtent/big-tent-15.md b/bigtent/big-tent-15.md new file mode 100644 index 00000000..620b4cf5 --- /dev/null +++ b/bigtent/big-tent-15.md @@ -0,0 +1,293 @@ +**Mat Ryer:** Hello. I'm Mat Ryer, and welcome to Grafana's Big Tent. It's a podcast all about the people, community, tools and tech around observability. Today we're talking about monitoring Kubernetes. Why do we need to do that? Why do we need an episode on that? Well, we're gonna find out. And joining me today, it's my co-host, Tom Wilkie. Hi, Tom. + +**Tom Wilkie:** Hello, Mat. How are you? + +**Mat Ryer:** Pretty good. Where in the world are you doing this from today? + +**Tom Wilkie:** Well, I guess for all the podcast listeners, you can't see the video... But out the window you can see the Space Needle in Seattle. + +**Mat Ryer:** Okay, that's a clue. So from that, can we narrow it down? Yeah, we'll see if we can narrow that down with our guests. We're also joined by Vasil Kaftandzhiev. Hello, Vasil. + +**Vasil Kaftandzhiev:** Hey, Mat. How are you doing today? + +**Mat Ryer:** Oh, no bad. Thank you. And we're also joined by \[unintelligible 00:01:05.12\] Hello, Deo. + +**Deo:** Hey, Mat. It's nice to be here. + +**Mat Ryer:** It's an absolute pleasure to have you. Do you have any ideas where Tom could be then, if he's got the Seattle Needle out his window? Any ideas? + +**Vasil Kaftandzhiev:** He's definitely not in Bulgaria, where I'm from. + +**Tom Wilkie:** I am not, no. + +**Mat Ryer:** Yeah. Okay, is that where you're dialing in from? + +**Vasil Kaftandzhiev:** I'm dialing from Sofia, Bulgaria. This is a nice Eastern European city. + +**Mat Ryer:** Oh, there you go. Advert. Deo, do you want to do a tourist advert for your place? + +**Deo:** Yeah, I'm based in Athens. \[unintelligible 00:01:31.28\] almost 40 degrees here. It's a bit different. + +**Mat Ryer:** Athens sells itself really, doesn't it? Alright. + +**Tom Wilkie:** Well, I can assure you it's not 40 degrees C in Seattle. It's a bit chilly here. It's very welcoming to a Brit. + +**Mat Ryer:** I don't know how the ancient Greeks got all that work done, honestly, with it being that hot there. It's that boiling... How do you invent democracy? It's way too hot. + +**Tom Wilkie:** Is that a global warming joke, is it, Mat? I don't think thousands of years ago it was quite that warm. + +**Mat Ryer:** Oh, really? No... It must have still been. Actually, that's a great point. I don't know. Okay, well... Right. Tell me. Why do we need to do a podcast episode on monitoring Kubernetes. Aren't traditional techniques enough? What's different? Deo, why do we need to have this particular chat? + +**Deo:** Alright, that's interesting. First of all, I'm leading a DevOps team, so I have a DevOps background; I come like out of both ways. It can even be like an engineering background, or a sysadmin one. Now, if we're talking about the old way of doing monitoring things, I don't know. So I'm based on engineering. In my past positions I was writing code, and I still am... But the question "Why do we need monitoring?" is if we are engineers, and we deploy services, and we own those services, monitoring is part of our job. And it should come out of the box. It should be something that -- how we could do it in the past, how we can do it now... It's part of what we're doing. So it's part of owning your day to day stuff. It's part of our job. + +**Tom Wilkie:** I mean, that's a really interesting kind of point, where like, who's responsible for the observability nowadays? I definitely agree with you, kind of in the initial cloud generation, the responsibility for understanding the behavior of your applications almost fell to the developers, and I think that explains a lot of why kind of APM exists. But do you think -- I don't know, maybe leading question... Do you think in the world of Kubernetes that responsibility is shifting more to the platform, more to kind of out of the box capabilities? + +**Deo:** It should be that 100%. Engineers who deploy code, post code, they shouldn't care about where this leaves, how it's working, what it does, does it have basic knowledge... But everything else should come out of the box. They should have enough knowledge to know where the dashboards are, how to set up alerts... But in our example, most of the times we just deploy something, and then you have a ton of very good observability goodies out of the box. Maybe it wasn't that easy in the past; it's very easy to do now. The ecosystem is in a very, very good position to be able, with a very small DevOps team, to be able to support a big engineering team out of the box. + +**Tom Wilkie:** I guess what is it about Kubernetes in particular that's made that possible? What is it about the infrastructure and the runtime and all of the goodies that come with Kubernetes that mean observability can be more of a service that a platform team offers, and not something every individual engineer has to care about? + +**Deo:** Alright, so if we talk about ownership, for me it shouldn't be different. It should be owned by the team who writes this kind of stuff. Now, why Kubernetes? It's special. Maybe it's not. Maybe the technology is just like going other ways. But I think now we're in a state where the open source community is very passionate about this, people know that you should do proactive monitoring, you should care... And now Kubernetes - what it did that was very nice, and maybe spinned up, like maybe make this easier - healing. Auto-healing now is a possibility. So as an engineer, maybe you don't need to care that much about what's going on. You should though know how to fix it, how to fix it in the future... And if you own it, by the end of the day things will be easier tomorrow. + +So what we can have -- we'll have many dashboards, many alerts, and it's easy for someone to pick this up. By the end of the day, it's like a million different services and stack underneath. But all this complexity somehow has been hidden away. So engineers now they're supposed to know a few more things, but not terribly enough \[unintelligible 00:05:59.03\] Maybe that was not like in the past. But it is possible now. And partially, it's because of the community over there. How passionate the community is lately when it comes to infra and observability and monitoring. + +**Vasil Kaftandzhiev:** \[06:15\] It's really interesting how on top of the passion and how Kubernetes have evolved in the last 10 years, something more evolved, and this is the cloud provider bills for resources, which is another topic that comes to mind when we're talking about monitoring Kubernetes. It is such a robust infrastructure phenomena, that touches absolutely every part of every company, startup, or whatever. So on top of everything else, developers now usually have the responsibility to think about their cloud bill as well, which is a big shift from the past till now. + +**Deo:** You're right, Vasil. However, it's a bit tricky. One of the things we'll probably talk about is it's very easy to have monitoring and observability out of the box. But then cost can be a difficult pill to swallow in the long run. Many companies, they just -- I think there are many players now in observing the observability field. The pie is very, very big. And many companies try to do many things at the same time, which makes sense... But by the end of the day, I've seen many cases where it's getting very, very expensive, and it scales a lot. So cost allocation and cost effectiveness is one of the topics that loads of companies are getting very worried about. + +**Tom Wilkie:** Yeah, I think understanding the cost of running a Kubernetes system is an art unto itself. I will say though, there are certain bits, there's certain aspects of Kubernetes, certain characteristics that actually make this job significantly easier. I'm trying to think about the days when I deployed jobs just directly into EC2 VMs, and attributing the cost of those VMs back to the owner of that service was down to making sure you tagged the VM correctly, and then custom APIs and reporting that AWS provided. And let's face it, half the teams didn't tag their VMs properly, there was always a large bucket of other costs that we couldn't attribute correctly... And it was a nightmare. + +And one of the things I definitely think has got better in the Kubernetes world is everything lives in a namespace. You can't have pods outside of namespaces. And therefore, almost everything, every cost can be attributed to a namespace. And it's relatively easy, be it via convention, or naming, or extra kind of labeling and extra metadata, to attribute the cost of a namespace back to a service, or a team. And I think that for me was the huge unlock for Kubernetes cost observability, was just the fact that this kind of attribution is easier. I guess, what tools and techniques have you used to do that yourselves? + +**Deo:** Right. So I don't want to sound too pessimistic, but unfortunately it doesn't work that nicely in reality. So first of all, cloud providers, they just - I think they enable this functionality to be supported out of the box (maybe it's been a year) \[unintelligible 00:09:22.25\] And GCP just last year enabled cost allocation out of the box. So it means you have your deployment in a namespace, and then you're wondering, "Okay, how much does this deployment cost? My team owns five microservices. How much do we pay for it?" And in the past, you had to track it down by yourself. Now it's just only lately that cloud providers enable this out of the box. + +So if you have these nice dashboards there, and then you see "My service costs only five pounds per month", which is very cheap, there is an asterisk that says "Unfortunately, this only means your pod requests." Now, our engineers, like everyone else, it takes a lot more effort to have your workloads having the correct requests, versus limits, so it's very easy by the end of the day to have just a cost which is completely a false positive. Unfortunately, for me at least, it has to do with ownership. And this is something that comes up again and again and again in our company. + +\[10:25\] Engineers need to own their services, which means they need to care about requests, and limits. And if those numbers are correct - and it's very difficult to get them right - then the cost will be right as well. It's very easy just to have dashboards for everyone, and then these dashboards will be false positives, and then everyone will wonder why dashboards \[unintelligible 00:10:44.11\] amount of money, they will pay 10x... It's very difficult to get it right, and you need a ton of iterations. And you need to push a lot, then you need to talk about it, be very vocal, champion when it comes to observability... And again, it's something that comes up again and again. + +If you are champion on observability, sometimes it's going to be cost allocations, sometimes it's going to be requests and resources, sometimes it's going to be dashboards, sometimes it's going to be alerts, and then they keep up. Because when you set up something right, then always there is like the next step you can do - how can we have those cost allocations proactively? How can we have alerts based on that? How can we measure them, and get alerted, and then teach people, and engage people? It's very difficult questions, and it's really difficult to answer them correctly. And I don't think still cloud providers are there yet. We're still leading. + +**Tom Wilkie:** I'm not disagreeing at all, but a lot of this doesn't rely on the cloud provider to provide it for you, right? A lot of these tools can be deployed yourself. You can go and take something like OpenCost, run that in your Kubernetes cluster, link it up to a Prometheus server and build some Grafana dashboards. You don't have to wait for GCP or AWS to provide this report for you. That's one of the beauties of Kubernetes, in my opinion. This ability to build on an extend the platform, and not have to wait for your service provider to do it for you is like one of the reasons why I'm so passionate about Kubernetes. + +**Deo:** Completely agree. And again, I don't want to sound pessimistic. OpenCost is an amazing software. The problem starts when not everything is working with OpenCost. So for example, buckets - you don't get \[unintelligible 00:12:27.22\] Auto-scaling is a very good example. So if you say "I have this PR on TerraForm, and then it will just increase auto-scaling from three nodes to 25 minutes. OpenCost, how much will this cost?" OpenCost will say "You know what? You are not introducing any new costs, so it's zero." Fair enough. \[unintelligible 00:12:48.12\] is going to be deployed. But then if your deployment auto scales 23 nodes, it's going to be very expensive. So while the technology is there, it's still -- at least in my experience, you need to be very vocal about how you champion these kinds of things. And it's a very good first step, don't get me wrong. It's an amazing first step, okay? We didn't have these things in the past, and they come from the open source community, and they're amazing. And when you link all of these things together, they make perfect sense. And they really allow you to scale, and like deploy stuff in a very good way, very easily. But we still -- it needs a lot of effort to make them correct. + +**Vasil Kaftandzhiev:** I really love the effort reference. And at the present point, if we're talking about any observability solutions, regardless if it is OpenCost for cost, or general availability, or health etc. we start to blend the knowledge of the technology quite deep into the observability stack and the observability products that are there. And this is the only way around it. And with the developers and SREs wearing so many superhero capes, the only way forward is to provide them with some kind of robust solutions to what they're doing. I'm really amazed of the complexity and freedom and responsibilities that you people have. It's amazing. As Peter Parker's uncle said, "With a lot of power, there is a lot of responsibility." So Deo, you're Spiderman. + +**Deo:** \[14:25\] I completely agree. One of the things that they really work out, and I've seen it -- because you're introducing all these tools, and engineers can get a bit crazy. So it's very nice when you hide this complexity. So again, they don't need to know about OpenCost, for example. They don't need to know about dashboards with cost allocation. You don't need to do these kinds of things. The only thing they need to know is that if they open a PR and they add something that will escalate cost, it will just fail. This the only thing they need to know. That you have measures in there, policies in there that will not allow you to have a load of infra cost. + +Or, then something else we're doing is once per month we just have some very nice Slack messages about "This is a team that spent less money this quarter, or had this very big saving", and then some it could champion people... Because they don't need to know what is this dashboard. By the way, it's a Grafana dashboard. They don't need to know about these kinds of things. They only need to know "This spring I did something very good, and someone noticed. Okay. And then I'm very proud for it." So if people are feeling proud about their job, then the next thing, without you doing anything, they could try to become better at it. And then they could champion it to the rest of the engineers. + +**Vasil Kaftandzhiev:** There is an additional trend that I'm observing tied to what you're saying, and this is that engineers start to be focused so much on cost, that this damages the reliability and high availability, or any availability of their products... Which is a strange shift, and a real emphasis on the fact that sometimes we neglect the right thing, and this is having good software produced. Yeah, + +**Tom Wilkie:** Yeah, you mentioned a policy where you're not allowed to increase costs. We have the exact opposite policy. The policy in Grafana Labs is like in the middle of an incident, if scaling up will get you out of this problem, then do it. We'll figure out how to save costs in the future. But in the middle of like a customer impacting problem, spend as much money as you can to get out of this problem. It's something we have to actively encourage our engineering team to do. But 100%, the policy not to increase costs is like a surefire way to over-optimize, I think. + +**Deo:** We have a reason, we have a reason. So you have a very good point, and it makes perfect sense. In our case though, engineers, they have free rein. They completely own their infrastructure. So this means that if there's a bug, or something, or technical debt, it's very easy for them to go and scale up. If you're an engineer and you have to spend like two days fixing a bug or add a couple of nodes, what do you do? Most of the times people will not notice. So having a policy over there saying "You know what? You're adding 500 Euros of infrastructure in this PR. You need someone to give an approval." It's not like we're completely blocking them. + +And by the way, we caught some very good infrastructural bugs out of these. Engineers wanted to do something completely different, or they said "You know what? You're right. I'm going to fix it in my way." Fix the memory leak, instead of add twice the memory on the node \[unintelligible 00:17:35.22\] Stuff like that. But if we didn't have this case, if engineers were not completely responsible for it, then what you say makes perfect sense. + +**Mat Ryer:** Yeah, this is really interesting. So just going back, what are the challenges specifically? What makes it different monitoring Kubernetes? Why is it a thing that deserves its own attention? + +**Tom Wilkie:** \[18:01\] I would divide the problem in two, just for simplicity. There's monitoring the Kubernetes cluster itself, the infrastructure behind Kubernetes that's providing you all these fabulous abstractions. And then there's monitoring the applications running on Kubernetes. So just dividing it in two, to simplify... When you're looking at the Kubernetes cluster itself, this is often an active part of your application's availability. Especially if you're doing things like auto-scaling, and scheduling new jobs in response to customer load and customer demand. The availability of things like the Kubernetes scheduler, things like the API servers and the controller managers and so on. This matters. You need to build robust monitoring around that, you need to actively enforce SLOs around that to make sure that you can meet your wider SLO. + +We've had outages at Grafana Labs that have been caused by the Kubernetes control plane and our aggressive use of auto-scaling. So that's one aspect that I think maybe doesn't exist as much if you're just deploying in some VMs, and using Amazon's auto-scalers, and so on. + +I think the second aspect though is where the fun begins. The second aspect of using all of that rich metadata that the Kubernetes server has; the Kubernetes system tells you about your jobs and about your applications - using that to make it easier for users to understand what's going on with their applications. That's where the fun begins, in my opinion. + +**Deo:** Completely agree. If you say to engineers "You know what? Now you can have very nice dashboards about the CPU, the nodes, and throughput, and stuff like that", they don't care. If you tell them though that "You know what? You can't talk about these kinds of things without Prometheus." So if you tell them "You know what? All of these things are Prometheus metrics. And we just expose \[unintelligible 00:19:49.05\] metrics, and everything is working", they will look there. If you tell them though that "You know what? If you expose your own metrics, that says scale based on memory, or scale based on traffic." Either way, they become very intrigued, because they know the bottleneck of their own services; maybe it is how many people visit the service, or how well it can scale under certain circumstances, based on queues... There are a ton of different services. So if you tell them that "You know what? You can scale based on the services. And by the way, you can have very nice dashboards that go with CPU memory, and here is your metric as well", this is where things become very interesting as well. + +And then you start implementing new things like a pod autoscaler, or a vertical pod autoscaler. Or "You know what? This is the service mesh, what it looks like, and then you can scale, and you can have other metrics out of the box." And we'll talk about golden metrics. + +So again, it would take a step back... Most engineers, they don't have golden metrics out of the box. And that is a very big minus for most of the teams. Some teams, they don't care. But golden metrics means like throughput, error rate, success rate, stuff like that... Which, in the bigger Kubernetes ecosystem you can have them for free. And if you scale based on those metrics, it's an amazing, powerful superpower. You can do whatever you want as an engineer if you have, and you don't even need to care where those things are allocated, how they're being stored, how they're being served, stuff like that. You don't need to care. You only need some nice dashboards, some basic high-level knowledge about how you can expose them, or how you can use them, and then just to be a bit intrigued, so you can make them the next step and like scale your service. + +**Tom Wilkie:** You said something there, like, you can get these golden signals for free within Kubernetes. How are you getting them for free? + +**Deo:** Okay, so if you are on TCP, or if you are in Amazon, most of those things have managed service mesh solutions. + +**Tom Wilkie:** I see. So you're leveraging like a service mesh to get those signals. + +**Deo:** Yes, yes. But now with GCP it's just one click, in Amazon it's one click, Linkerd is just a small Helm deploy... It's no more different than the Prometheus operator, and stuff like that. + +**Tom Wilkie:** \[22:09\] Yeah. I don't think it's a badly kept secret, but I am not a big fan of service meshes. + +**Deo:** I didn't know that... Why? + +**Tom Wilkie:** Yeah, I just don't think the cost and the benefits work out, if I'm brutally honest. There's definitely -- and I think what's unique, especially about my background and the kind of applications we run at Grafana Labs, is honestly like the API surface area for like Mimir, or Loki, or Tempo, is tiny. We've got two endpoints: write some data, run a query. So the benefit you get from deploying a service mesh - this auto instrumentation kind of benefit that you describe is really kind of what is trivial to instrument those things on Mimir and Loki. And the downside of running a service mesh, which is the overheads, it's the complexity, the added complexity, the increased unreliability... There's been some pretty famous outages triggered by outages in the service mesh... For an application like Mimir and Loki and a company like Grafana Labs I don't think service meshes are worth the cost. So we've tended to resort to things like baking in the instrumentation into the shared frameworks that we use in all our applications. But I just wanted to -- I want to be really clear, I don't think Kubernetes gives you golden signals out of the box, but I do agree with you, service meshes do, a hundred percent. + +**Deo:** It is an interesting approach. So it's not the first time I'm hearing these kinds of things, and one of the reasons - we were talking internally about service mesh for at least six months... So one of the things I did in order to make the team feel more comfortable, we spent half of our Fridays, I think for like three or four months, reading through incident reports around service meshes. It was very interesting. So just in general, you could see how something would happen, and how we would react, how we would solve it... And it was a very interesting case. And then we've found out that most of the times we could handle it very nicely. + +Then the other thing that justified a service mesh for us is that most of our -- we are having an engineering team of 100 people. And still, people could not scale up. They could not use Prometheus stuff. They could not use HPA properly, because they didn't have these metrics. So this is more complex... Anyway, we're using Linkerd, which is just -- it's not complex. We are a very small team. It's not about complex. It's not more complex than having a Thanos operator, or handling everything else. Again, it has an overhead, but it's not that much more complex. However, the impact it had to the engineering team having all those things out of the box - it was enormous. + +And one last thing - the newest Kubernetes version, the newest APIs will support service mesh out of the box. So eventually, the community will be there. Maybe it's going to be six months, maybe it's going to be one year. I think that engineering teams that are familiar with using those things, that embrace these kinds of services, they will be one step ahead when Kubernetes supports them out of the box... Which is going to be very, very soon. Maybe next year, maybe sooner than that. + +**Tom Wilkie:** Yeah. I mean, I don't want to \*bleep\* on service meshes from a great distance. There are teams in Grafana Labs that do use service meshes, for sure. Our global kind of load balancing layer for our hosted Grafana service uses Istio, I believe. And that's because we've got some pretty complicated kind of routing and requirements there that need a sophisticated system. So no, I do kind of generally take the point, but I also worry that the blanket recommendation to put everything on a service mesh - which wasn't what you were saying, for sure... But I've definitely seen that, and I think it's more nuanced than that. + +**Mat Ryer:** \[25:59\] But that is a good point. Deo, if you have like a small side project, is the benefits of Kubernetes and service mesh and stuff, is it's so good that you will use that tech even for smaller projects? Or do you wait for there to be a point at which you think "Right now it's worth it"? + +**Deo:** Obviously, we'll wait. We don't apply anything to it just because of the sake of applying stuff. We just take what the engineering teams need. In our case, we needed \[unintelligible 00:26:21.07\] we really needed these kinds of things. We needed \[unintelligible 00:26:26.26\] metrics. We needed people to scale based on throughput. We needed people to be aware about the error rates. And we needed to have everything in dashboards, without people having to worry about these kinds of things. But the good is that we got out of the box, they're amazing. So for example, now we can talk about dev environments, because we have all this complexity away to the service mesh. We're using traffic split, which again, is a Kubernetes-native API now. + +So probably this is where the community will be very, very soon, but I think \[unintelligible 00:27:03.13\] DevOps on our team, it's in a state where -- we're in a very good state. So we need to work for the engineering needs one year in advance. And people now struggle with dev environments, releasing stuff sooner. Observability - we have solved it in the high level, but in the lower level still people struggle to understand when they should scale, how they can auto-heal, stuff like that. And service meshes give you a very good out of the box thing. But again, we don't implement things unless we really need them... Because every bit of technology that you add, it doesn't matter how big or small your team is, it adds complexity. And you need to maintain it, and to have people passionate about it; you have to own it. + +One other thing that I have found out that it's not working maybe, at least in the engineering department, is that people, they often change positions. And Grafana is a very big player, so it has some very powerful, passionate people... But the rest of the engineering teams it's not the same. So you may have engineers jump every year, a year and a half... So sometimes it's not easy to find very good engineers, who are very passionate. \[unintelligible 00:28:14.11\] own it, and then help scale it further. So it is challenging, I completely agree. + +**Tom Wilkie:** Yeah, I 100% agree. We encourage our engineers to move around teams a lot as well. And I think all really strong engineering teams have that kind of mobility internally. I think it's very important. I just want to -- you talked a lot about auto-scaling, and I do think auto-scaling is a great way, especially with the earlier discussion about costs... It's a great way to achieve this kind of infrastructure efficiency. But two things I want to kind of pick up on here. One is auto-scaling existed before Kubernetes. Right? I think everyone who's kind of an expert in EC2 and load balancers and auto-scaling groups are sitting there, shouting at the podcast, going "We used to do this before Kubernetes!" So what is it about Kubernetes that makes us so passionate about auto-scaling? Or is it just the standard engineering thing that everything old is new again, and this all cyclical? + +**Deo:** Could you in the past auto-scale easily based on the throughput, and stuff? I'm not sure. + +**Tom Wilkie:** Yeah. Auto-scaling groups on Amazon were fantastic at that. + +**Deo:** Alright. And what about the rest of the cloud providers? + +**Tom Wilkie:** Yeah, I mean... Are there other cloud providers? That's a bad joke... Yeah, no, you know, Google has equivalent functionality in their VM platform, for sure. I do think -- you do kind of make a good point... I think it's kind of similar to the OpenCost point we made earlier, of like Kubernetes has made it so that a lot of these capabilities are no longer cloud provider-specific. You don't have to learn Google's version of auto-scaling group, and Azure's version of auto-scaling group, and Amazon's auto-scaling groups. There is one way -- the auto-scaling in Kubernetes, there's basically one way to configure it, and it's the same across all of the cloud providers. I think that's one of the reasons why auto-scaling is potentially more popular, for sure. + +**Deo:** Very good point. + +**Tom Wilkie:** \[30:15\] But I would also say, you've talked a lot about using custom metrics for auto-scaling, and using like request and latency and error rates to influence your auto-scaling decisions... There's a bit of like accepted wisdom, I guess, that actually, I think CPU is the best signal to use for auto-scaling in 99% of use cases. And honestly, the number of times -- even internally at Grafana Labs, the number of times people have tried to be too clever, and tried to second-guess and model out what the perfect auto-scaling signals are... And at the end of the day, the really boring just CPU consumption gives you a much better behavior. + +And I'll finish my point - again, not really a question, I guess... But I think the reason why CPU is such a good signal to use for auto-scaling... Because if you think what auto-scaling actually achieves is adding more CPU. You're basically scaling up the number of CPUs; you're giving a job by scaling up the replicas of that job. And so by linking those two directly, like CPU consumption to auto-scaling, and just giving the auto-scaler a target CPU utilization, you actually can end up with a pretty good system, without all of this complexity of like custom metrics, and all of the adapters needed to achieve that. But again, I'm just kind of being -- I'm kind of causing argument for the fun of the listeners, I guess... But what do you think, Deo? + +**Deo:** I think it's a very good point. I want to touch a bit on vertical auto-scaling. Are you aware about vertical auto-scaling? + +**Tom Wilkie:** Yeah, I think it's evil. + +**Mat Ryer:** It's where the servers stack up on top of each other, instead of -- + +**Deo:** So vertical pod auto-scaling has to do with memory. So CPU is the easy stuff. So if you have a pod, and then it reaches 100% CPU, it just means that your calls will be a bit slower. With memory though, it's different. Memory pods will die out of memory, and then you'll get like many nice alerts, and your on-call engineers will add more memory. And then most engineering teams do the same. So pod dies of memory, you add more memory. Pod dies of memory, you add more memory. And then sometimes, like after years, you say "You know what? My costs on GCP, they are enormous. Who is using all those nodes?" And then people see that the dashboards, they see that they don't use a lot of memory, so they \[unintelligible 00:32:34.19\] and then you come out of memory alerts. + +So vertical pod auto-scaling does something magical. So you have your pod, with a specific amount of memory; if it dies, then the auto-scaler will put it back in the workload, with a bit more memory. And then if it dies again, it will just increase memory. So if your service can survive pods dying, so it goes back to the queue or whatever, it means that overall you have a better cost allocation, because your pods go up and down constantly with memory consumption. So this is a good thing to discuss when it comes to an auto-scaler. + +**Tom Wilkie:** No, a hundred percent. And to add to that, I actually think it's poorly named. I think calling it the vertical pod auto-scaler -- what it's actually doing is just configuring your memory requests for you. And classing that as auto-scaling is kind of -- you know, in my own head at least, I think that's more about just automatically configured requests. So we do that, we use that; of course we do. + +I do generally think, to your earlier point, teams that really, really care about their requests, and about the shape and size of their jobs, we've seen internally they shy away from using the vertical pod auto-scaler, because they -- and it is a huge factor in dictating the cost of their service. And then there's the teams that just want to put a service up, and just want it to run, and the service is tiny and doesn't consume a lot of resources... 100% pod vertical auto-scaling is the perfect technology for them. + +**Deo:** \[34:07\] You're right. It's fine. For most of the engineers, having something there that auto-scales up and down, it's fine. But for the few services that will really benefit for your spikes in traffic, or in any Prometheus metric that you expose, it makes a very big difference. Sometimes it makes a very big difference. But this is something that most people don't care about. They'll have hundreds of microservices in there, and they will work out of the box with everything you're having in a nice way, so you don't have to worry about it. But the most difficult part is to have people know that this is possible if they want to implement it, to know where they will go and have the monitoring tools to see what's going on, their dashboards, and where they can see stuff... And then if you are cheeky enough, you can have alerts there as well, to say "You know what? You haven't scaled enough", or "Your services have a lot of CPU utilization. Maybe you should adjust it", and stuff like that, in order for them to be aware that this is possible. And again, you need people that are very passionate. If people are not passionate about what they're doing, or they don't own the services, it's very difficult. It doesn't scale very well with engineering teams. + +**Mat Ryer:** Alright, let's get into the hows, then. How do we do this? Tell me a bit about doing this in Kubernetes. Has this changed a lot, or was good observability baked in from the beginning, and it's evolving and getting better? + +**Tom Wilkie:** I think one of the things I'm incredibly happy about with Kubernetes is like all of the Kubernetes components effectively from the very early days were natively instrumented with really high-quality Prometheus metrics. And that relationship between Kubernetes and Prometheus dates all the way back to kind of their inception from kind of that heavy inspiration from the internal Google technology. So they're both inspired by -- you know, Kubernetes by Borg, and Prometheus by Borgmon... They both heavily make use of this kind of concept of a label... And things like Prometheus was built for these dynamically-scheduled orchestration systems, because it heavily relies on this kind of pull-based model and service discovery. So the fact that these two go hand in hand -- I one hundred percent credit the popularity of Prometheus with the popularity of Kubernetes. It's definitely a wave we've been riding. + +But yeah, that's like understanding the Kubernetes system itself, the stuff running Kubernetes, the services behind the scenes... But coming back to this kind of thought, this concept of Kubernetes having this rich metadata about your application -- you know, your engineers have spent time and effort describing the application to Kubernetes in the form of like YAML manifests for deployments, and stateful sets, and namespaces, and services, and all of this stuff gets described to Kubernetes... One of the things that I think makes monitoring Kubernetes quite unique is that description of the service can then be effectively read back to your observability system using things like kube-state-metrics. So this is an exporter for Kubernetes API that will tell Prometheus "Oh, this deployment is supposed to have 30 replicas. This deployment is running on this machine, and is part of this namespace..." It'll give you all of this metadata about the application. And it gives you it in metrics itself, as Prometheus metrics. This is quite unique. This is incredibly powerful. And it means the subsequent dashboards and experiences that you build on top of those metrics can actually just natively enrich, and -- you know, things like CPU usage; really boring. But you can actually take that CPU usage and break it down by service, really easily. And that's what I think gets me excited about monitoring Kubernetes. + +**Deo:** \[38:11\] I agree one hundred percent. Yeah. And the community has stepped up a lot. I had an ex colleague of mine who was saying DevOps work is so easy these days, because loads of people in the past have made such a big effort to give to the community all those nice dashboards and alerts that they want out of the box. + +Now, I'd just want to add to what Tom said that even though kube-state-metrics and Prometheus are doing such a very good job, like native integration with Kubernetes, it's not enough, in most cases. I have a very good example to showcase this. Let's say one of the nodes goes down, and they get an alert, and then you know that a few services are being affected... And then you ask engineers to drop in a call, and \[unintelligible 00:39:00.07\] and then start seeing what's wrong... Unless you give them easily the tools to figure out what is wrong, it's not enough. And in our case, -- actually, I think in most cases - you need a single place where to have dashboards, from kube-state-metrics, Prometheus metrics, but also logs. You need logs. And then you need performance metrics, you need your APM metrics... + +So I think the Grafana ecosystem is doing a very, very good job. And I'm not doing advertisement, I'm just saying what we're using there. But in our case, we have very good dashboards, that have all the Prometheus metrics, and then they have Loki metrics, and then traces. You have your traces in there, and then you can jump from one to another... And then we have Pyroscope now was well... So dashboards that people are aware, and then they can jump in and right out of the box find out what is wrong - it's very powerful. And they don't need to know about what is Pyroscope, and what's profiling. They don't need to know these kind of things. You just need to give them the ability to explore the observability bottlenecks in their applications. + +**Tom Wilkie:** Oh, I one hundred percent agree. I would add like this extra structure to your application that's metadata that can be exposed into your metrics. This makes it possible to develop a layer of abstraction, and therefore common solutions on top of that abstraction. And I'm talking in very general terms, but I specifically mean there's a really rich set of dashboards in the Kubernetes mixin, that like work with pretty much any Kubernetes cluster, and give you the structure of your application running in Kubernetes. And you can see how much CPU each service uses, what hosts they're running on. You can drill down into this really, really straightforward and easily. And there's a project internally at Grafana Labs to try and effectively do the same thing, but for EC2, and the metadata is just not there. Even if we use something like YACE, the Yet Another Cloudwatch Exporter, to get as many metrics out of the APIs as we can, you're not actually teaching EC2 about the structure of your application, because it's all just a massive long list of VMs. And it makes that -- you know, the application that we've developed to help people understand the behavior of the EC2 environment is nowhere near as intuitive and easy to use as the Kubernetes one, because the metadata is not there. + +So I just really want to -- I think this is like the fifth time I've said it, that that metadata that Kubernetes has about your application, if you use that in your observability system, it makes it easier for those developers to know what the right logs are, to know where the traces are coming from, and it gives them that mental model to help them navigate all of the different telemetry signals... And if there's one thing you take away from this podcast, I think that's the thing that makes monitoring and observing Kubernetes and applications running in Kubernetes easier, and special, and different, and exciting. + +**Mat Ryer:** \[42:05\] These things also paid dividends, because for example we have the Sift technology - something I worked on in Grafana - which essentially is only really possible... You know, the first version of it was built for Kubernetes, because of all that metadata. So essentially, when you have an alert fire, it's a bot, really, that goes and just checks a load of common things, like noisy neighbors, or looks in the logs for any interesting changes in logs, and things, and tries to surface them. And the reason that we chose Kubernetes first is just because of that metadata that you get. And we're making it work -- we want to make it work for other things, but it's much more difficult. So yeah, I echo that. + +**Vasil Kaftandzhiev:** It's really exciting how Kubernetes is making things complex, but precise. And on top of everything, it gives you the -- not the opportunity, but the available tools and possibilities to actually manage it precisely. If you have either a good dashboard, good team, someone to own it etc. so you can be precise with Kubernetes. To actually check that you should be precise. + +**Tom Wilkie:** Yeah. A hundred percent. And just to build on what Mat said - no podcasts in this day and age would be complete without a mention of Gen AI and LLMs. We've also found in our internal experiments with these kinds of technologies that that meta data is key to helping the AI understand what's going on, and make reasonable inferences and next steps. So giving the metadata to ChatGPT before you ask a question about what's going on in your Kubernetes cluster has been an unlock, right? There's \[unintelligible 00:43:45.20\] a whole project built on this as well, that's actually seen some pretty impressive use cases. + +So yeah, I think this metadata is more than just about observability. Like, it's actually -- the abstraction and that unlock is one of the reasons why Kubernetes is so popular, I think. And you said something, Deo, which I thought was really interesting... You started to talk about logs and traces. How are you using this metadata in Kubernetes to make correlating between those signals easier for your engineers? + +**Deo:** \[unintelligible 00:44:15.18\] So one of the bottlenecks is not having any labels, not having anything. The other can be having too many of them. So you have many clusters, you have hundreds of applications... It's very often the case where it's very busy, and people cannot find quickly what's going on. So we cannot have this conversation without talking about exemplars. So exemplars is something that we -- it unlocked our engineering department for really figuring out what was wrong and what they really needed. So exemplars, they work with traces. And -- + +**Mat Ryer:** Deo, before you carry on, can you just explain for anyone not familiar, what is an exemplar? + +**Deo:** Yeah, sure. So an exemplar is just a trace, but it's a trace with a high cardinality. So when something is wrong, when you have let's say a microservice that has thousands of requests per second, how can you find which request is a problematic one? So you have latency. Clients are complaining about your application is slow. But then you see in your dashboards most of them are fine. How can you find the trace, the code that was really problematic? This is where exemplars are coming. And out of the box, it means that it can find the traces with the biggest throughput, and then you have a nice dashboard, and then you have \[unintelligible 00:45:34.25\] and then when you click this node, you can right away go to the trace. + +And then after the trace, everything is easy. With the trace, you can go to Pyroscope and see the profiling, or you can go to the pod logs, or you can go to the node with Prometheus metrics... So everything is linked. So as long as you have these problematic trace, everything else is easy. + +And this is what really unblocked us, because it means that when something goes wrong, people don't have to spend time figuring out where can be the problematic case, especially if you have a ton of microservices, a chain of microservices. + +\[46:11\] So yeah, exemplars was something that really, really unblocked us. Because it's really easy to have many dashboards; people are getting lost in there. You don't need many dashboards. You just give some specific ones, and people should know, and you should give them enough information to be able to do their job when they need to, very, very easily. And exemplars was really extremely helpful for us. + +**Tom Wilkie:** Yeah, I'm a huge fan of exemplars. I think it's a big unlock, especially in those kinds of debugging use cases that you describe. I will kind of - again, just to pick you up there... There's nothing about exemplars that's Kubernetes-specific. You can 100% make that work in any environment. Because the linkage between the trace and the metrics is just a trace ID. You're not actually leveraging the Kubernetes environment. I mean, there are things about Kubernetes that make distributed tracing easier, especially if you've got like a service mesh, again. I definitely get that. But yeah, exemplars work elsewhere. + +It's the ones that I think -- the places that are kind of Kubernetes-enhanced, if you like, in observability, is making sure that your logs and your metrics and your traces all contain consistent metadata identifying the job that it came from. So this was actually the whole concept for Loki. Five years ago, when I wrote Loki, it was a replacement for Kubernetes logs, for kubectl logs. That was our inspiration. So having support for labels and \[unintelligible 00:47:40.07\] and having that be consistent -- I mean, not just consistent, but literally identical to how Prometheus does its labeling and metadata, was the whole idea. And having that consistency, having the same metadata allows you to systematically guarantee that for any PromQL query that you give me, any graph, any dashboard, I can show you the logs for the service that generated that graph. And that only works on Kubernetes, if I'm honest. Trying to make that work outside of Kubernetes, where you don't have that metadata is incredibly challenging, and ad hoc. + +**Deo:** Exactly. When everything fits together, it's amazing. When it works, it's amazing. Being an engineer and being able to find out what is wrong, how you can fix it, find the pod... And by the way, auto-scaling can fix it; it's a superpower in your engineer. And you don't need to own all these technologies. You just need to know what your service is doing and how you can benefit out of it. + +One other thing as well is that those things are cheap. You may have seen, there are a ton of similar solutions out there. Some of them may be very expensive \[unintelligible 00:48:52.10\] The thing with Loki, and stuff is they are very cheap as well, so they can scale along with your needs... Which is critical. Because lately -- I hear this all the time; "efficiency", it's the biggest word everyone is using. You need to be efficient. So all these things are very nice to have, but if you are not efficient with your costs, eventually -- if they're not used enough, or they're very expensive, people eventually will not use them. So efficiency is a key word here. How cheap it can be, and how very well \[unintelligible 00:49:22.03\] + +**Tom Wilkie:** Yeah. And I don't want to be that guy that's always saying "Well, we always used to be able to do this." If you look at like the traditional APM vendors and solutions, they achieved a lot of the experience that they built through consistently tagging their various forms of telemetry. The thing, again, I think Kubernetes has unlocked is it's not proprietary, right? This is done in the open, this is consistent between different cloud providers and different tools, and has raised the level of abstraction for a whole industry, so that this can be done even between disparate tools. It's really exciting to see that happen and not just be some proprietary kind of vendor-specific thing. That's what's got me excited. + +**Deo:** \[50:09\] Okay. Now, Tom, you got me curious - what's your opinion about multicloud? + +**Tom Wilkie:** Grafana Labs runs on all three major cloud providers. We don't ever have a Kubernetes cluster span multiple regions or providers. Our philosophy for how we deploy our software is all driven by minimizing blast radius of any change. So we run our regions completely isolated, and effectively therefore the two different cloud providers, or the three different cloud providers in all the different regions don't really talk to each other... So I'm not sure whether that kind of counts as multicloud in proper, but we 100% run on all three cloud providers. We don't use any cloud provider-specific features. So that's why I like Kubernetes, because it's that abstraction layer, which means -- honestly, I don't think our engineers in Grafana Labs know which cloud provider a given region is running on. I don't actually know how to find out. I'm sure it's baked into one of our metrics somewhere... But they just have like 50-60 Kubernetes clusters, and they just target them and deploy them. And again, when we do use cloud provider services beyond Kubernetes, like S3, GCS, these kinds of things, we make sure we're using ones that have commonality and similar services in all cloud providers. So pretty much like we use hosted SQL databases, we use S3, we use load balancers... But that's about it. We don't use anything more cloud provider-specific than that, because - having that portability between clouds. + +**Deo:** And have you tried running and having dashboards for multiple cloud providers, for example for cost stuff? + +**Tom Wilkie:** Yeah, it's hard to show you a dashboard on a podcast... But yeah, 100%. + +**Mat Ryer:** You can just describe it, Tom. + +**Tom Wilkie:** Our dashboard for costs is cloud-agnostic, is cloud provider-agnostic. So we effectively take all the bills from all our cloud providers, load them into a BigQuery instance, render the dashboard off of that, and then we use Prometheus and OpenCost to attribute those costs back to individual namespaces, jobs, pods... And then aggregate that up to the team level. And if you go and look on this dashboard, it will tell you how much the Mimir team or how much the Loki team is spending. And that is an aggregate across all three cloud providers. + +The trickier bit there, as we've kind of talked about earlier, OpenCost doesn't really do anything with S3 buckets. But we use -- I forgot what it's called... We use Crossplane to provision all of our cloud provider resources... And that gives us the association between, for instance, S3 bucket and namespace... And then we've just built some custom exporters to get the cost of those buckets, and do the join against that metadata so we can aggregate that into the service cost. But no, 100% multicloud at Grafana Labs. + +**Vasil Kaftandzhiev:** Talking about costs and multicloud, there are so many dimensions about cost in Kubernetes. This is the cloud resources, this is the observability cost, and there is an environmental cost that no one talks about... Or at least there is not such a broad conversation about it. Having in mind how quickly Kubernetes can scale on its own, what do you think about all of these resources going to waste, and producing not only a waste for your bill, but a waste for the planet as well, in terms of CO2 emissions, energy going to waste, and stuff like that? + +**Deo:** That's a very good question. I'm not sure I have thought about this that much, unfortunately. As a team, we try not to use a ton of resources, so we'll scale down a lot. We don't over-provision stuff... We try to reuse whatever is possible, and using \[unintelligible 00:53: 52.04\] and stuff... But mostly this is for cost effectiveness, not about anything else. But this is a very good point. I wish more people were vocal about this. As with everything, if people are passionate, things can change, one step at a time... But yeah, that's an interesting point. + +**Vasil Kaftandzhiev:** \[54:10\] For me it's really interesting how almost all of us get Kubernetes for granted... And as much as we are used to VMs, as much as we're used to bare metal, as much as we can imagine in our heads that this is something that runs into a datacenter, with a guard with a gun on his belt, we think of Kubernetes as solely an abstraction. And we think about all of the different resources that are going to waste as just digits into the Excel table, or into the Grafana Cloud dashboard. + +At the end of the day, I should be right here, but approximately 30% of all of the resources that are going into data-powering Kubernetes are going to waste, according to the CNCF... Which is a good maybe conversation to have down the road, and I'm sure that it's going to come to us quicker than we expect. + +**Tom Wilkie:** I think the good news here is -- and I agree, one of the things that happens with these dynamically-scheduled environments is like a lot of the VMs that we ask for from our cloud provider have a bit of unallocated space sitting at the top of them. We stack up all of our pods, and they never fit perfectly. So there's always a little bit of wastage. And in aggregate, 100% agree, that wastage adds up. + +I think the 30% number from the CNCS survey - I think internally at Grafana Labs that's less than 20%. We've put a lot of time and effort in optimizing how we do our scheduling to reduce that wastage... But the good news here is like incentives align really nicely. We don't want to pay for unallocated resources. We don't want to waste our money. We don't want to waste those resources. And that aligns with not wasting the energy going into running those resources, and therefore not producing wasted CO2. + +So I think the good news is incentives align, and it's in users' and organizations' interest not to waste this, because at the end of the day if I'm paying for resources from a cloud provider, I want to use them. I don't want to waste them. But that's all well and good, saying incentives align... I will say, this has been a project at Grafana Labs to drive down unallocated resources to a lower percentage. It has been a project for the last couple of years, and it's hard. And it's taken a lot of experimentation, and it's taken a lot of work to just get it down to 20%... And ideally, it would be even lower than that. + +**Mat Ryer:** And I suppose it keeps changing, doesn't it? + +**Tom Wilkie:** Yeah, the interesting one we've got right now is - I think we've said this publicly before... The majority of Grafana Cloud used to be deployed on Google, and over the past couple of years we've been progressively deploying more and more on AWS. And we've noticed a very large difference in the behavior of the schedulers between the two platforms. So one of the benefits I think of GCP is Google's put a lot of time and effort into the scheduler, and we were able to hit like sub 20% unallocated resources on GCP. Amazon has got a brilliant scheduler as well, and they've put a lot of time and effort into the Carpenter project... But we're just not as mature there, and our unallocated resources on EKS is worse. It's like up in the 30% at the moment. But there's a team in Grafana Labs who literally, their day to day job is like to optimize this... Because it really moves the needle. We spend millions of dollars on AWS, and 10% of millions of dollars is like more than most of our engineers' salaries. So it really does make a difference. + +**Vasil Kaftandzhiev:** This is a really good touch on salaries, and things... I really see monitoring Kubernetes costs currently as the ROI of an engineering team towards their CTO. So effectively, teams can just now say "Hey, we've got 10% or 15% of our Kubernetes costs, and now we're super-performers and stars." + +\[57:59\] One question again to you, Deo. We have talked a lot of namespaces... But can you tell me your stance about resource limits, and automated, for an example, recommendations on the container level? I know that everyone is talking namespaces, but the little \[unintelligible 00:58:13.26\] of ours, the containers don't get so much love. What's your stance on that? How do you do container observability? + +**Deo:** Alright, so Containerd? Like, compare it to both? Or what do you mean? + +**Vasil Kaftandzhiev:** Yeah, exactly. + +**Deo:** So we're in a state where other than the service mesh, everything else is like one container equals one pod... Which means -- but it's difficult to get it right. So what we advise people is just have some average CPU and memory, \[unintelligible 00:58:44.09\] and then keep it there for a week. And then by the end of the first week, change the requests and limits based on usage. We just need to be a bit pushy, try to ping them again and again, because people tend to forget... And they always tend to over-provision; they're always afraid that something will break, or something will happen... And as I've said before, most of the times people just, I think, they abuse infrastructure, where they just add more memory, add more CPU to fix memory leaks, and stuff... So you need to be a bit strict, and educate people again and again about what means in terms of cost, in the terms of building, and stuff like that. But yeah, what we say most of the time is that have some average stuff, what do you think, and then adjust after the end of the first week. + +Now, we don't have a lot of containers in our pod, so this makes our life a bit easier. If that wasn't the case, I'm not sure. I think though that this is something that maybe will change in the future, but you will -- I don't remember where I was reading about this, or if it's just like from \[unintelligible 00:59:54.02\] mind, but I think in the newest version of Kubernetes, requests and limits will be able to support containers as well. But again, I'm not sure if I just read about it or just \[unintelligible 01:00:05.26\] I'm not sure. + +**Vasil Kaftandzhiev:** I think it's already available, that. + +**Tom Wilkie:** Yeah, I'd add a couple of things there, sorry. Firstly, it's worth getting technical for a minute. The piece of software you need to get the data into Kubernetes, to get that telemetry, it's called cAdvisor. Most systems just have this baked in. But it's worth -- especially if you want to look up like the references on what all of these different metrics means, go and look at cAdvisor. That's going to tell you per pod CPU usage, memory usage, these kinds of things. It's actually got nothing to do with pods or containers; it's based on cGroups. But effectively, cAdvisor's the thing you need. + +In Grafana Labs, we're moving towards a world where actually we mandate that limits equals requests, and everything's basically thickly provisioned. And part of this is to avoid a lot of the problems that Deo talked about at the beginning of the podcast, where if people -- there's no incentive in a traditional Kubernetes cluster to actually set your requests reasonably. Because if you set your requests low and just get billed for a tiny little amount, and then actually set your limit really high and optimistically use a load of resources, you've really misaligned incentives. So we're moving to a world where they have to be the same, and we enforce that with a pre-submission hook. + +And then the final thing I'll say here is, I'm actually not sure how much this matters. Again, controversial topic, but we measure how much you use versus how much you ask for. So we measure utilization, and not just allocation. And we bill you for the higher of the two. We either bill you for how much you ask for, or how much you use. When I say bill, obviously, I mean internally, like back to the teams. + +\[01:01:49.13\] And because of that approach, the teams inside Grafana Labs, they all have KPIs around unit costs for their service. So they're not penalized, I guess, if their service costs more, as long as there's also more traffic and therefore more revenue to the service as well. But we measure -- like a hawk, we monitor these unit costs. And if a team wants to optimize their unit costs by tweaking and tuning their requests and limits on their Kubernetes jobs, and bringing unit costs down like that, or if they want to optimize a unit cost by using Pyroscope, to do CPU profiling, and bringing down the usage, or rearchitecting, or any number of ways -- I actually don't mind how they do it. All I mind is that they keep, like a hawk, an eye on this unit cost, and make sure it doesn't change significantly over time. So I'm not sure -- I think this is like down in the details of "How do I actually maintain a good unit cost and maintain that kind of cost economics over a long term?" And I think these are all just various techniques. + +**Deo:** So Tom, is this always the case? Are teams always supposed to have the same requesting limits? Is this always a best practice internally, at Grafana? + +**Tom Wilkie:** It's not at the moment. It's something I think we're moving towards, at least as a default. And again, there's a big difference between -- again, our Mimir team literally spends millions of dollars a year on cloud resources. And they do have different limits and requests on their pods, and their team is very sophisticated and know how to do this, and have been doing this for a while. But that new service that we just spun up with a new team, that hasn't spent the time and effort to learn about this, those kinds of teams are \[unintelligible 01:03:30.16\] to have limits and requests the same, and therefore it's a massive simplification for the entire kind of reasoning about this. And again, these new teams barely use any resources, so therefore we're not really losing or gaining anything. + +And I will say, there's that long-standing issue in Kubernetes, in the Linux Kernel, in the scheduler, where if you don't set them the same, you actually can find your job is like frequently paused, and your tail latencies go up... And that's just like an artifact of like how the Linux scheduler works. + +**Deo:** This is very interesting. It actually solves many of the things we have talked earlier. So you don't have to worry about cost allocation, because most -- like, GCP at least, they will tell you how much it costs based on requests. But if your requests and limits are the same, you have an actual number. + +**Tom Wilkie:** Exactly. Yeah. + +**Deo:** I think Grafana is a bit of a different company, because everyone \[unintelligible 01:04:21.24\] so they know their stuff. I think most of the engineering teams, at least for our case, having requests and limits the same, even though it would be amazing, it would escalate cost... Because people, they always -- + +**Tom Wilkie:** \[01:04:40.13\] Yeah, so the downside. The downside of this approach 100% is you lose the ability to kind of burst, and you're basically setting your costs to be your peak cost. Right? But I'd also argue -- it wouldn't be a podcast with me if I didn't slip in the term statistical multiplexing. A lot of random signals, when they're multiplexed together, becomes actually very, very predictable. And that's a philosophy we take to heart in how we architect all of our systems. And at small scale, this stuff really matters. At very large scale, what's really interesting is it matters less, because statistical multiplexing makes things like resource usage, unit costs, scaling - it makes all of these things much more predictable. And it's kind of interesting, some things actually get easier at scale. + +**Deo:** Yeah, it's very interesting. So are teams internally responsible? Do they own their cost as well? Or no? + +**Tom Wilkie:** Yeah, 100%. And you mentioned earlier you have like Slack bots, and alerts, and various things... We've moved away from doing this kind of \[unintelligible 01:05:45.16\] We don't like to wake someone up in the middle of the night because their costs went up by a cent. We think that's generally a bad pattern. So we've moved to... We use -- there's a feature in Grafana Enterprise where you can schedule a PDF report to be sent to whomever you want, and it's rendering a dashboard. And so we send a PDF report of our cost dashboard, which has it broken down by team, with unit costs, and growth rates, and everything... That gets sent to everyone in the company, every Monday morning. And that really promotes transparency. Everyone can see how everyone else is doing. It's a very simple dashboard, it's very easy to understand. We regularly review it at the kind of senior leadership level. Once a month we will pull up that dashboard and we'll talk about trends, and what different projects we've done to improve, and what's gone wrong, and what's caused increases. + +And this is, again, the benefit of like Grafana and observability in our Big Tent strategy, is that everyone's using the same data. No one's going "Well, my dashboard says that the service costs this, and my dashboard says that the service costs this." Everyone is is singing off that same hymn sheet. It gets rid of a lot of arguments, it drives accountability... Yeah, having that kind of one place to look, and proactively rendering that dashboard and emailing it to people... Like, it literally -- I get that dashboard every morning at 9am, and it is almost without fail the first thing I look at every day. + +**Mat Ryer:** There's also that thing where we have a bot that replies to your PR, and says like \[unintelligible 01:07:15.06\] + +**Tom Wilkie:** Oh yeah. Cost. Yeah. + +**Mat Ryer:** Yeah. So that's amazing. It's like, "Yeah, this great feature, but it's gonna cost you. You're gonna add this much to your bill." And yeah, you really then -- it is that transparency, so you can make those decisions. It's good, because when you're abstracted from it, you're kind of blind to it. You're just off in your own world, doing it, and you build up a problem for later... But nice to have it as you go, for sure. + +Thank you so much. I think that's all the time we have today. I liked it, by the way, Tom, earlier when you said "I think it's worth getting technical for a minute", like the rest of the episode was about Marvel movies, or something... \[laughter\] But you are my Marvel heroes, because I learned a lot. I hope you our listener did as well. Thank you so much. Tom, Vasil, Deo, this was great. I learned loads. If you did, please tell your friends. I'll be telling all of mine. Gap for banter from Tom... + +**Tom Wilkie:** Oh, you know, insert banter here... Like, no, thank you, Vasil, thank you, Deo. Great to chat with you. I really enjoyed that. + +**Mat Ryer:** Yup. Thank you so much. Join us next time, won't you, on Grafana's Big Tent. Thank you!