More episodes
Telemetry Now  |  Season 2 - Episode 14  |  September 26, 2024

From Monitoring to Observability in the Cloud Era with Paige Cruz

Play now

 
Hosts Philip Gervasi and Leon Adato sit down with observability expert Paige Cruz to discuss the evolution from traditional monitoring to full-scale observability in the cloud era. Paige shares her insights from years of experience in the field, talking about critical topics like data overload, meaningful telemetry, and how to ask the right questions to get the most out of observability systems. Whether you're an SRE, developer, or just interested in cutting-edge monitoring practices, this episode is full of valuable insights.

Transcript

For folks running cloud environments and maybe who used to run networks and today maybe throw in some container services and SaaS, things have gotten a little complex.

Now one interesting thing that's happened in the last few years because of all of that is this new resurgence in interest in more advanced monitoring methods including observability.

And it's not quite so easy as buying an observability platform and then leaving all the settings on default.

We need to think about what kind of data we wanna ingest, what we're gonna do with all that data, what do we really care about. We need to think about our budget, what do we do with anomalies, and a whole host of other things as well.

So with us today is Paige Cruz, an experienced engineer working in observability and trying to answer these questions for herself, and today, for us as well. My name is Philip Gervasi. This is Telemetry Now.

Paige, thanks so much for joining today.

It's a real pleasure to, have met you recently, through Leon and, and then talk to you about a variety of things, not just telemetry, but apparently your passionate love for Robert Frost poetry, and Deep and abiding love.

Yeah. Yeah. Deep and abiding.

But in any case, it is really a pleasure to have you today, to talk about telemetry, why it matters to us, why it might be not as trustworthy, as we might, initially believe. But before we dive into that, and I do wanna dive eyeball deep. You know, I really wanna get into the nuts and bolts and and as technical as you'd like to get today, I'd like to hear a little bit more about you professionally, personally, whatever you'd like to share just to get an idea of who Paige is.

Yeah. Oh, I'm delighted to be here. Who am I? What am I doing here? Well, I currently work at Chronosphere, which is, an observability company that is built on the lovely open source standards like OpenTelemetry and Prometheus.

I actually got my start in tech almost a decade ago, over at New Relic working in the people ops department.

I had a very long and winding path to get to a three month stint as a developer before I said, no. No. I care about what hardware and infrastructure my service is running on. I want to know more.

I need more details and follow that passion over to the infrastructure side. Some companies call it DevOps engineers. I don't agree with that, but money is money and, or SRE. And so I spent my time working either for a monitoring company vendor or for a customer of one of the big vendors.

And so I over time in my career, I just got this really, I see the tension between how observability is failing our customers today and our end users and the poor SREs who are woken up all the time for false alerts, but also the what it takes to deliver highly available and reliable observability system. And so while I personally, had my journey as an engineer closed due to, an unfortunate case of burnout, I now spend my time advocating for sustainable reliability practices and helping people discover more about observability, what it means for them in their context and their company. So, I'm hoping to help my friends on call lives get better, while I have given up that pager life.

And I love alpacas. You'll see me talking about that Okay. On the web.

Yeah. That that's funny that you mentioned that because I have never heard of engineers in our business, you know, experiencing burnout. That's a really unusual thing. Pretty rare.

Right? Yeah. Right? What is this burnout that you speak of?

I have never You can cut the iron the verbal irony and sarcasm with a knife.

With a spoon and scoop it up. It's so sweet and delicious.

Now, Paige, you you mentioned a couple of different companies, one of which I'm very familiar with. Those are more application performance monitoring companies, I I think. But you also mentioned that you were very focused or have been very focused on the infrastructure side. So though on one hand, you're very much focused on the application. I get it because that's what we really care about. Like, nobody cares about, you know, switches and routers. We care about the data and and the applications.

You are still very much focused on the telemetry that's derived and ingested from the actual infrastructure that delivers those applications. Right?

Yes. Yeah. It what really the kind of fork in the road for me is when I was an application you know, a software developer, and I had these services that I was on call for, that I cared for. And I said, well, what happens if the container orchestrator fails, or how can I see those metrics about the node that my app is running on?

I care about that. And one of the teams was like, that's our job. You worry about the app layer up. And I thought, that is so strange.

Both my software and your hardware and the orchestration software in between all needs to work together. And while you'll be on the hook for fixing a problem at your layer, I still care if anything fails in my little layer stack that I'm responsible for. Because, ultimately, if one thing goes wrong, how is that affecting the user experience? If my container can't start and I can't deploy, there's no availability, and users can't do what they need to do.

So I kept digging, digging, digging until I got the level of access and job responsibility where I said, okay. I'll take platform as a service. You give me a hosted Kubernetes. I'll let Google SREs handle, the on call for the Kubernetes, you know, control plane.

Mhmm.

But I really cared about getting that full picture, and it used to be a lot harder. Ten years ago when I started, it was a lot harder to get that full picture. Today, we have an embarrassment of riches when it comes to telemetry.

Right.

And I would almost say we've got a data hoarding problem.

I'm rolling my eyes in failed DevOps streams. And just, like, the the the absurdity of NTTI, mean time to innocence. Like, I understand that people's tools and, you know, whatever your your behaviors are to identify whether it's your problem to fix or not. But the number of times I've seen monitoring used specifically and only for someone to basically, you know, brush their hands off and say, well, not me. Have a good night. And, you know, walk away is maddening.

And and, honestly, it is the thing that DevOps and then SRE and then Mhmm. PlatformOps then DevSec, bug, hamster ops, like, whatever ops Yeah. The thing it is today. Like, that's what it's supposed to fix, but it's really that I'm I'm unsurprised but disappointed to hear one of your coworkers express that mindset.

Like, yeah. You're right. It's my job to fix the network, let's say, for example. Like, I have to jump on the router and fix it, but that doesn't mean that other people can't be and shouldn't be interested, involved, even possibly Where? First aware, the first person to notice and then to get me involved.

This is again, and I recognize I might be Pollyanna with the whole DevOps thing. Let's all hold hands, and we'll sing Kumbaya as we do a SevOne call.

Well, Leon, you you use the term monitoring, whereas Paige used the term observability several times. Oh, here we go. So so let me throw this over to you, Paige. Number one, is there a difference? Yes. And talk to yes. Okay.

Moving on. No. I'm kidding. I'd like to hear more. So if if we could dig into that a little bit, number one, I'd like to hear what your definition of observability is because you are speaking in in larger terms than segments of the network. You know, the the the person who runs your Kubernetes environment or cloud engineer or network engineer, but you are looking at this holistically. I can hear it in the in the subtext of your words already.

And how does that differ from monitoring or traditional monitoring? You all you also mentioned this whole shared responsibility. Does that play into it? May is that is that the key there? I don't know.

Sharing, collaboration, synergy.

Do I think these are great questions, and I think these are great questions for everybody to ask themselves and their colleagues as well, because, unfortunately, both the term monitoring and observability are very overloaded, very diluted at this point. And my own as I have learned more about both spaces, my own personal definition has changed over time. So is there a difference between monitoring and observability?

Yes. Monitoring is the act of I am I found a metric query that indicates some aspect of the service that I'm delivering, whether that is availability. Can someone reach this endpoint? Whether that is latency.

Can somebody quickly get a response, if we are up? Or, you know, things like correctness, did they get is the functionality performing as expected? So you already have monitoring set up, hopefully, in your organization over components you care about. It is the act of setting up a monitor that will continuously run that metric query over some aspect of your service level and compare it against a threshold that should should is a horrible word, but it should be meaningful to relay that the user experience is impacted such that either it's not a big deal or emergency, so let's do a daytime ticket, or, oh my god.

We need human eyes in intervention on this stat we're gonna page somebody. So to me Right. That loop of I have a metric that measures something I care about, and there's a threshold at which the service level is unacceptable or damaging to the user experience.

Or in indirect terms, you know, like, if the queue's filling up and we don't have enough capacity and it's filling up quicker, it will be a problem soon, that sort of stuff. That whole loop to me is monitoring, and that never goes away, ever, ever, ever even if you are practicing observability.

Leon, how does that definition of monitoring square up with your your definition?

So I my definition is a little bit broader and a little bit Okay. Honestly, less less specific is Okay. It it boils down to for monitoring or observability, how it thinks about the known unknowns versus the unknown unknowns, how it thinks about cardinality, how it thinks about, correlation and, and domain, and whether information has to be really domain specific or it can be very general in terms of the quality of the telemetry. So it's I tend to define it in a a really different kind of way, mostly because for the last ten, eleven years, I've worked for vendors that, you know, are are helping people solve their problems. And I don't wanna get into a dogmatic holy war of whether or not what we're doing is the one true observability or not, or it's doing dad's old monitoring or like, that just not helpful.

Yeah. I like that. So it's more about the overall approach and practices that are involved in.

It's so hard to not use the terms monitor or observe when defining these to understand system behavior. It's about more than just the act of setting up alerts.

That's the initial idea of observability in the classic sense, the idea of control theory, which is, gathering telemetry, right, through monitoring mechanisms.

Mhmm.

Ingesting that to observe, to see what the overall state or health of the system is. Could have nothing to do with applications, networking. It could be a factory. And so you're observing, the outputs of the system in order to to determine the system's health.

Is it doing what it's supposed to do? And so I actually look at it as not, they're not divergent things, monitoring observability. Observability is a higher level practice that takes in all of the monitoring and also what you said, context, what is meaningful to an engineer. As someone I spoke to, a few years ago, told me there was a difference, and he was talking about how, you know, we throw alerts using, like, some ML model to do, perhaps, like, some kind of linear regression model and find out what's an outlier.

Right? Mhmm. Well, what if it's an outlier, but it's not bad? And I'm like, oh, what do you mean?

He goes, well, there's a difference between weird and bad. Something might be weird, but it doesn't affect user experience, which is actually something that you mentioned a couple times, Paige. Is that the point here? So I just said higher level stuff, you know, bigger context.

Is the higher level, bigger context just the user experience with an application?

Yeah. I think in in the most direct terms, yes. I think where it gets tricky is maybe I am a internal platform provider for, like, data infrastructure.

I may my customers maybe my engineering peers who are using my Gervasi. And so there's kind of one layer between me and that end user. Then I think about infrastructure teams who are the underlying, like, the foundation which everything rests upon, they are maybe even two hops away from what you would consider, like, monitoring the user experience. And so I do think we need to shift our focus from I only care about the services I'm on call for that I write code for my little corner of the overall massive microservices state to what is that full full stack, another overloaded term, what are the full slice of everything needed to accomplish something like streaming a video, sending an email, the those higher level services for users? Yeah. And I your the the kind of academic definition of observability is always the place I tell people to start.

Mhmm.

It's important to know where where this originated from, where we copied the other industry from.

And my personal kind of editorialized definition has turned to what is the ability to me, observability is a measure of how well or how deep, how complex of questions you can ask about your system and get a meaningful response.

And it's using all of the existing telemetry that's powering monitors and monitoring.

But what I think the intention was by introducing this new term was to say, hey. How we've been doing it? Metrics and logs. There's another way.

You also could be using your metrics and logs more efficiently. Are we linking things together? How can you go from a high level systems view? I'm looking at all the nodes in my cluster, and I care about the health at a macro level to, oh my gosh.

Somebody had issues signing in. We pinpointed it down to this one instance of an application running in this pod, and and it's this one server that's weird.

That might the yeah. So in a nutshell, my definition observe of observability is what types of questions can you ask? What types of detailed questions can you ask of your system and actually get meaningful answers from your telemetry? How do you take that massive data that you've collected and turn it into information?

Right. Okay. And and that brings and that brings up an interesting sort of next point, which is that the way you get the the data, you know, which for a lot of folks was the dividing line between monitoring observability. You know, monitoring is only when you get, you know, metrics, and then when you get the like, they have all sorts of dogmatic, again, no guess.

Aren't monitoring.

Traces aren't monitoring. You know, your monitoring is not you know, one it's no true Scotsman's monitoring is monitoring or whatever. So it's not about the method of collection or the even the repository of the data or how it gets from point a to point b, but what you do with it. And that brings up the point of when you talked about a data hoarding problem.

I've been known to say a lot of times, like, you know, monitor, collect all the data you want. Don't alert on all of it, but monitor everything. But it sounds like you and I might disagree on that monitor you know, collect everything kind of methodology. And more to the point, once it's collected, what you do with it, how you present it is more important than almost more important than what you collected. Is what I hear you getting at, or am I misunderstanding?

No. I I think that is a great summation, and I would say there is, I think, a middle ground between the philosophies.

I think you should have the ability to collect everything. And my dream, my vision for observability is that you have dynamic ways to tune. Hey. I need a little bit I need a higher level of logging just for this ten minutes while I do a risky deploy.

Or, I have a customer who's gonna recreate an error scenario. Let's turn on debug logs for that user, for that feature, in that cluster, or whatever. I I what I have seen gone wrong with collect everything is then you gotta pay to store everything. And then sometimes when you're collecting everything, you put a lot of extra load on your query system.

You've gotta query a lot more data. Sometimes it takes longer to return, which could be meaningful in a very critical incident depending on your SLAs and time levels. So I like to think about not necessarily collecting everything, but understanding of all the things you could collect, looking at that, and figuring out what's most important on a day to day basis and kind of adjusting for that. But knowing, like, I on off call my podcast, I interviewed somebody from CockroachDB, and he said, you know, we've got, like, fifteen hundred metrics that we admit.

I use, like, ten to fifteen on everyday basis, and I only reach for those, I can't do math, eight hundred odd other metrics when there's a really specific reason because they're measuring stuff that's not in the critical path. That's so collecting everything is great if we had infinite storage and infinite money, but I've seen so many orbs run into problems where instead of focusing on the quality of their data, training and enabling engineers to become better at monitoring and observability, everybody's focused on cutting costs, which doesn't teach you that much about your system or how your users are using it, unfortunately.

I'd like to pivot now over to this conversation around data.

You you you made a couple comments about, dealing with data, especially very large amounts, enormous amounts of data.

But there is an argument to be made that if you aren't collecting everything and it is not all available to be analyzed, perhaps you have lower fidelity monitoring, available at your fingertips and, therefore, perhaps a skewed perspective of the health of your system. Now some I said some in air quotes. Some might argue that.

Yeah. Sure.

So what are some of the ways that data, if not ingested properly, if it's not sufficiently large dataset or if it's too small data, whatever, or if it's not processed properly? How how what are some of the challenges that we could face?

Yeah. There's off the top of my head, I think a big challenge comes from, Leon, you mentioned cardinality, one of my favorite terms. The I I believe it's a measure of the number of unique labels within a dataset, the the metadata. The metadata is ironically almost as important as the actual numerical measurements you're gathering. Because if I just gave you a number of, hey. Response times for get request to endpoint login took five hundred milliseconds.

Okay.

What if we expect them to take one hundred? Where can I go to account for that extra latency? Where am I looking?

It might help to know what cluster this is in, what region this is in, maybe the type of user device. Maybe this is only cropping up on mobile versus web requests. And so the challenge with cardinality is more is better. More is going to help you figure out problems up into a point.

Where it goes wrong is folks flipping on auto instrumentation or kind of just dropping in, you know, some default Prometheus library and saying, great. I've got my metrics. I got my Grafana dashboard. I'm monitoring things.

We're good to go. Well, maybe you don't need the labels that they have put on by default. Maybe there are extra maybe you don't care about pod ID for certain, types of telemetry, or maybe pod ID is super important to your SREs. There's a lot of there's a lot of waste in metadata specifically just for labels on metrics.

So that alone is a whole class of problems that I think folks would do good to look at and kind of examine what are high cardinality labels. Are those useful? If they are useful, it doesn't high cardinality doesn't mean bad. It just means a lot of uniqueness.

So if that uniqueness is helpful for you and your investigation, you should make the case to advocate for keeping that label and maybe cutting out someones that aren't as important.

Okay. That makes sense. Yeah.

So that's yeah. I would say, like, the label cardinality and then just the kind of accepting everything that comes from default auto instrumentation, that is what gets people into a a bad state right almost right away. Just like from the get go, we haven't narrowed down what we actually care about monitoring.

Right. Okay. I know in our system at Kentik, we have, one of the columns that shows up in our portal is a percentile, p ninety five, p ninety nine, things like that. And, I never understood exactly why. Can you explain, number one, why that's the right way to do it as opposed to other mechanisms like averages and things like that?

Yeah. Oh, I love a percentile. I love a I love it. So, yeah, as I was kind of coming up in my career, I, yeah, I kept hearing p ninety five, p ninety nine, p fifty.

I go, okay. You know, I go find a statistics textbook. I'm like, what is this thing? Like, better figure it out.

It seems important. Right? And, really, what I came to that to understand is that percentiles are a way of getting actual measurements from real user experiences, say, latency for a HTTP request as opposed to the other method, which would be by calculating averages.

Why averages kind of come up in conversations?

They're easy to calculate. They're aggregates. So you to store a set of averages versus all the raw data points, the average data set is gonna be a lot smaller because it's combining all those raw measurements into smaller buckets of numbers.

What percentiles do instead is say, hey.

We would rather look at the actual user experience, and so we will make a bunch of different buckets that represent different ranges. So zero to point one milliseconds, point one one to two milliseconds, and so on. And we'll count for every request that comes in. We'll grab the latency, and we'll put it we'll put a little tally in whatever bucket it falls in. And what you end up with is this beautiful or sometimes horrifying, depending on what the data shows you, results of distribution of the actual actual latency measurements, pulled out. So when you look at p fifty, in the percentiles, you're saying fifty percent of requests. So it also matters the total amount in the dataset, like the throughput.

Fifty percent of requests were responded to in five milliseconds or less. It's fifty percent of users had this latency experience or better. Faster is typically better.

And if you did the same average, you would not necessarily get a number that's even reflective of any one user's real experience because it could come it will come up with the, like, the aggregate.

Mhmm. Yeah.

I've got a couple examples that might put a little bit finer of a point on it. But, really, it's about we care about the actual user experience. That is the only thing that's important to alert on and monitor.

If we use measurements that are not reflective of actual experience, we risk false alarms, getting wrong information, getting the wrong perspective, missing weird things or outliers.

So doing a pure mathematical average with telemetry data can produce a an incorrect conclusion.

For what you're saying.

Yeah. Yeah.

And So in effect, the telemetry can lie to us in that sense.

It can. There's so many ways to didn't come in. Lie.

Yeah. Yeah.

Yeah. The what really drove that point home for me aside there's kind of two sides. One, on a conceptual level, there's a story about in the forties, the US Air Force was ramping up pilots on the jet powered planes, which just handled totally different than the other class of planes they've been flying to the point that it was a danger to their pilots. Pilots are you know, you gotta train them.

You wanna keep them healthy and alive. And so Mhmm. Their equipment was was risking their valuable pilots' lives. So they said, we've gotta fix this.

We gotta give the pilots more control over the plane. Let's redesign the cockpit where they're, you know, mission control.

Unfortunately, they did not use percentiles.

They chose instead the researchers chose to take averages. So they took a set of four thousand pilots. They took a lot of measurements. I'll give them that. A hundred and forty different dimensions.

Thumb length, distance from pilot eye to ear really got in there. Probably had this great spreadsheet with all this data, but they calculated the average for each of these dimensions across those four thousand pilots.

There is one man, our hero, Lieutenant Gilbert Daniels. While he was not able to stop the redesign, he was skeptical, and he just said, how many pilots are really average?

Oh, yeah.

And what he found was out of their four thousand pilots, not a single one fit within the average range of the cockpit that they had designed to the perfect average pilot. Not one actual person could functionally use the design that they built based on these averages.

Yeah. That's worth sorry. It's it's just it's worth repeating for people who are, you know, washing dishes while they, you know, listen to the podcast or whatever it is, is that you took four thousand people. You got you took all their measurements. You averaged them, and not a single person actually met the average. And for a lot of us, and I put myself in this bucket because I am not a mathy person, that is counterintuitive.

You would think that if you're doing an average, it would capture some people. And in this case, not one person out of four thousand was the average, which is why the word doesn't mean what I think our brains are conditioned to think it means. I just I I had to reemphasize it because it's so critical that people understand that when you're getting an average because everyone's software does this. You know?

I want a graph. Show me the average CPU usage in fifteen minute increments or whatever. You're basically effectively I I should say probably, not even possibly. If you get the average, you're probably not getting any actual data that matches reality.

You're getting an average, which is not an average.

In another, maybe more closer to the technical example is, back when I was paying off my student loans, I would log in and their dashboard, the one graph that they thought was important was my average interest rate. Now, I had probably eight or nine different loans. And depending on the year that the loans were issued, the interest rate changed. And there was a variance of about probably four to five percent between the loans I took out of my first year at five percent.

And by the time I left, those things were, like, eight, nine percent. So as somebody who has to pay this stuff back and as somebody who's responsible for seeing the purse the how the interest rate affects the overall amount that's coming out of my pocket at the end of the day, the average was meaningless because I had specific interest rates attached to specific loans that I was on the hook for. And the average just took, you know, between the five percent and the nine percent. Like, oh, your average is, you know, six point five percent.

That's great.

To who?

Not for me. Not for my wallet.

So that I think you're I think what you hit on, Liana, is right. In our day to day, the app the average person will have some conception of what average means and that it means the middle, that it means this is a representative sample of everything that it's taking into account. And that's just not quite true, or it is not quite what you need when you're trying to measure things with a higher degree of accuracy or precision.

Like Yeah.

And I'm gonna jump in and and and just for those people again who aren't mathy like me, when you hear the word average, you cannot think the majority.

When people say the average user, the average consumer, the average whatever, what they mean is actually the typical, the majority of people interacting with the sort service.

Sure. But average doesn't mean that. So I gotta ask, now that we've completely shot a hole in the concept of averages, what's another option? What's a better option, especially for those of us who make a career out of monitoring and observability?

Yeah. And that is where the percentiles come in. Now how do you get a percentile? Because similarly to an, you know, an average, we take the sum of everything, and then we divide it by how many things there were.

Percentiles are measured one way. The most common way that I've seen are with histograms, where it is that range of buckets, and we say, okay. Oh, Leon's request took one hundred milliseconds. That tally won in this bucket.

What that means for us as users of monitoring and observability tools is, one, if you've seen a heat map, heat maps are powered by histograms. It's kind of a fun way. Or there's the kind of traditional bar chart. Maybe you've seen the bell curve of, you know, grades for a college class. And, you know, most people get c's, and this is where the bell curve is.

That is actually a really great accessible explanation that that's a little bit closer to something that we've experienced.

And so what used to be really frustrating for ops folks is we knew percentiles were a better way. We knew histograms were the way to go. But with Prometheus until, like, version two point four, When you were setting up a histogram, you had to already know what buckets you wanted and what ranges you wanted, and you had to predetermine that. You had to already have a really good understanding of the distribution, what that range of what the longest request would be, what the shortest request, and the best way to bucket those. That's a lot of complex math that folks are in you would have to do on a continuous basis as your system changed, as your users used the system changed, and so on. And so averages were an appealing shorter path to that people thought could get to the same answer.

The good news is as of version two point four o for Prometheus, there is experimental support for what they're calling native histograms, which is just ones that dynamically calculate the buckets and ranges for you. So you don't have to do that manual labor. It is easy today to get started with histograms. If you use hotel, they already thought they thought about that from the get go. They watched and learned, and so they offer both the classic histogram and then their version of native histograms, which they call exponential histograms. I am sorry that all of these terms and naming is hard. I wish we just took our cues from, like, the security hackers because all those exploits have cool names that are easy to remember, and we're stuck with exponential histograms and native histograms.

It is what it is though.

So the histograms are used for, data analysis in terms of data distribution.

Yes.

Is there any problem? Is there any challenge with being able to identify outliers from the distribution or perhaps it's actually beneficial to identify the outliers and not obfuscate them, which can ultimately lead to a performance degradation in the application delivery? Because even though if everything looks great and everything is really, really good, one outlier in latency can break, you know, a TCP session or something.

Absolutely. And I always think about, yeah, okay. What's worse, a one minute blip of an outage or a total, you know, system being down for five minutes? Well, if that one interaction was a very important thing that an administrative user needed to run and it was a timebox, that actually might be worse than a total outage for everybody. Yeah.

Like, that is why, to your point, it matters.

What are those outliers? What are those weird behaviors? What are those weird requests that took way too fast than we expected and probably didn't return results that matter to the user? Or what is that stuff that took so long, and why did it take long?

The practice the investigative practice workflow that I got used to, is working with traces. So assuming you have a system that has end to end tracing instrumentation, not necessarily over everything, but over one key workflow like a login, like a checkout, like add to cart sort of thing, end to end tracing of all the components involved in there, I would look at that distribution, and we would say, okay. We have SLO set for the p ninety nine response time. Ninety nine percent of users had the experience that matched their expectations.

It was good, fast, fast, correct, available.

What is that one percent? And when you have that distribution, you can go, okay. Show me that one percent of requests. Now tell me what is different about these compared to the ninety nine percent? Let's look at the metadata, the tags, the labels.

Is there a weird error that only pops up for here? That is where the power of your observability platform comes into play. How quickly can you run correlations across tons and tons of data to understand what makes that one percent, what makes those weird outliers weird. And is it bad weird, or is it good weird, or just weird weird?

No. We're just weird weird.

Or just weird. This, you know, stuff happens.

But you did mention tons and tons of data, which is probably an understatement with when we're thinking about, application delivery in two thousand twenty four, and the systems that are integrated globally, front end in AWS, back end in Azure, stuff on prem, networking, containers, all like, a million different things plus, you know, the end users metrics, whether they be qualitative or quantitative in nature. There's a tremendous, tremendous amount of data there. So I know one of the arguments I've had from time to time is the idea of downsampling. Right? And I see here in the notes that you you do mention it as a as as a concern, as something that, we have to take into consideration, as far as trade offs. Right?

Yes. And I think that's something if you haven't either operated a self hosted observability monitoring stack or worked for a vendor, you the downsampling can be sometimes just invisible to to users, logging in. And so that is I did wanna talk about that because Yeah. It's one of those things I don't see talked about enough. So downsampling, in a nutshell, because we do not have infinite money, infinite compute, infinite storage, we have to make choices about what data to keep and when.

And for monitoring data, for application data, typically, it is very useful right when we take it for kind of real time system behavior analysis.

It's useful a little bit historically. So it's helpful to know how my performance is this month compared to last month, last quarter compared to this quarter. And then on occasion, it might be helpful to zoom out into a year view, or compare year over year. Here is the problem.

We collect so many metrics.

Whether or not they're useful or not, most systems are emitting just a crap ton a crap ton of metrics, and you have to store those. And if you wanna store them durably and make sure that they're gonna be available and resilient and that data is gonna be there when you need it, a lot of engineering effort has to go into that. And so what companies and the industry has decided is there's a time horizon from when that at least that metric data is not as valuable to store in super fine grain detail, like, what you would need for real time analysis, five second resolution. Data points collect every five seconds. When you're looking at a month, would you need that level? Are you looking at broader trends over time?

You probably need less. And so what every vendor does, what your systems do is they make choices about when to aggregate that data from high resolution five second increments to bigger buckets of one minute, one hour Okay. One, you know, one day.

What that means for you, the user, is when you go to say, hey. Time window, one year. Time window, once six months. Show me my latency for this endpoint over the last six months.

We depending on how your product vendor company, whatever chooses, you may not get the fidelity you're thinking.

And why that matters is, as we talk about averages, how they can skew they can both hide outliers and skew what you see as sort of average peaks. Yep. You may be missing things, when you're looking at a larger window of time. You the what might be a very subtle thing will get lost in a in a big aggregate curve or a best fit or whatever.

And so downsampling is a trade off that everybody has committed to. So what you need to do is understand at what point does my data go from fine grained to medium grained and then medium grained to course. So you can keep that in mind as you're looking at these charts. Okay.

Gotta be a little more skeptical. If I wanna dive into the details, I should look at this data over that.

And the other piece of it aside from, like, just the metric storage piece is we literally have a fixed block for charts. I mean, I'm sure some of the there's responsive design now, but we have a set of pixels that are gonna display a given time window.

And if that x axis needs to represent five years versus one year versus ten minutes, the fidelity of the data, what each individual pixel data point Yeah.

Yeah.

I'm not front end I'm I'm not front end fluent yet, but, like, that individual data point is representing a lot more samples, the larger your time window is. And so keep that in mind as you're accessing and querying your data. What is our retention policy? What does downsampling look like? Where is that line of what I can trust for the highest screen fidelity, and is that the right choice?

And that brings up another point that I've talked about before and I will probably talk about again, which is the flip side of we wanna keep all the data is the awareness you have to have that your systems, especially your monitoring and observability systems, every time you have an alert, not run an alert, not trigger an alert, but have an alert. That alert is a query, and it runs against the entire database.

So the larger your dataset, the harder the system has to work to simply check that the conditions of an alert and and I will include dashboards, reports, anything that that interacts with the data, the harder those things are gonna have to work because there's a larger dataset. So if you have twenty queries that run every minute because they check every minute to see if the conditions have been satisfied, then you have twenty queries banging against the data every minute, day in and day out, and there's a cost to that, and there's a drag on the system to it. So I again, I'm I'm not team throw out all the data.

I'm not team aggregate or average or summarize, but I am team be realistic about it. And the question you can ask yourself is, what would I do with this specific data point x amount of time later? What would I do with this one thing, this piece of latency on this interface out of this in you know, this network? What would I do with that piece of data six months from now?

What would it help me identify?

Nothing?

No. Okay. Now we're talking summarization. That's Right. I just wanted to throw that in.

Alright. No. That's so, so important. The the the load that your monitor queries are putting on your system is not to be, taken lightly at all.

And if anything, I do think most people could get rid of a good percentage of their alerts, and even dashboards as, it's sort of the season of observability migrations. It feels like everyone or most companies did that big digital revolution. We're going to the cloud. And now that we're in the cloud, oh my god.

Things are different. Our monitoring and approach to understanding the behavior needs to change. Uh-huh. We need to move to observability, better monitoring.

So, I mean, I have to imagine, though, that it is somewhat the the the level of granularity and how much data basically that they're gonna that one is going to store and then ultimately query, is contingent on the role that they play in the company or the organization. So if you're a developer, you're a network engineer, you're a security person who wants all the packets all the time forever, obviously, which is true. I mean, that's why you can buy, you know, tap networks for a hundred billion dollars, and it's basically just a second network, and you keep all the packets forever. And that's a security person's dream. But, let let me go back to the data and to and the histograms.

Aren't histograms and averages and percentiles not necessarily groundbreaking technology? Like, those are things that, like, people learn in high school and college and stuff. So it's not really new.

Why why does this matter?

Okay. Before you even answer, Paige, I just wanna point out, I have a theater degree. This is new stuff for me. I don't want anyone listening to this to think that, like Yeah.

You know, you people come to tech from a lot of different things. For for some of us, it is new. However, I will I will emphasize or I will reinforce the idea that a lot of folks who've been in the ops side and have been in tech for a while probably have heard these words thrown around even if they couldn't give a complete or dictionary definition of it. So there we go.

I I I've disclaimered the heck out of it. Paige, go ahead.

Oh, and I I agree. I came well, I technically got a, engineering management degree, which was MechE plus business.

As a developer, I had to learn a lot of things. Hey. I had to go from being a user of technology to being an administrator and creator of at least software, and it is a whole learning journey. But it is true.

Like, we did not just instantly discover, histograms two years ago, or percentiles. These are statistical concepts that have been with us for a while. And I think there are there is a greater share of expertise when it comes to these monitoring concepts of, oh my gosh. We're collecting too many metrics.

We're gonna have to drop them or aggregate them. How do we make those decisions? Ops folks, whatever title you go by or they pay you for, have had to solve they've been tasked with solving those problems. And in the DevOps revolution where we gave the the pager to developers, we invited them on call, I think we missed the part about explaining some of the nuances of interacting with understanding and analyzing all of this telemetry.

We didn't say, hey. When you're looking at a metric chart over three years, be a little skeptical of what you see. Have your antennas up because it it may be misleading, what you see on that screen. Or by the way, we have to aggregate and downsample over time.

So if you have data you care about, tell me, and maybe we can preserve that. I think what tools have maybe done a disservice is treating the gathering of this data in the collection and the storage of it as really set it and forget it. Really static. Right.

We care about this. We collect this. We store this. Well, I was, Stephen Townsend of slight reliability podcast.

We were talking about downsampling, and he said, as a performance engineer, it really bothered me to downsample because I care about year over year performance for events like a Black Friday sale, for a Singles' Day, you know, for Boxing Day. There are some really key times where it would help. It's one day. Let's store all the raw measurements for that one day for that dataset so we can have really accurate comparisons year over year.

And I thought, oh my god. Why are we not offering that as a feature? Why are vendors treating it still as the static config? You care about the same stuff at all times.

So Right.

So what should, just to kinda wrap things up, what should engineers and operations folks, SREs, all these people that are, dealing with telemetry as part of their job, what should they be thinking to themselves now that they have a better understanding of averages and percentiles and histograms and, how how data is is both critical and could be possibly misleading?

Yeah. I think the best place to start is to make a friend on the on the other side. So if you're on the dev side, find an ops person. Ask them to walk through the data they use to determine if the infrastructure your services are running on is healthy.

And now you'll know a little bit more about how to investigate that or gauge that on your own. And similarly, if you're on the ops side, find a developer, maybe on a team that's, you know, has a great operational hygiene or something, you know, make it easy. But ask them how they measure whether or not a user is getting good service. And between both perspectives, you'll arrive at a at a better conclusion or a more accurate measurement than staying siloed.

So I think that's my first is always go make a friend. Go make a friend that doesn't do the same job as you. And then bring a really critical eye when you are interacting with dashboards and alerts. I, verify them trust. Every, just my baggage from working at companies with not great monitoring hygiene is I don't ever trust an alert even if it paged me. I always go look at the monitor query to say, does the intent of this query match the intent of this monitor? And if not, is this really an issue?

That's my personal baggage.

Generally, the advice would be when you're presented with data, ask yourself what's going on in this graph. What do you notice?

If you notice something and you're gonna kinda make a claim, where is the evidence to back that up? Oh, then it's the network. Really? Show me.

Not in a in in a challenging, prove it, but sort of, hey. I'd like to understand. It will help me to and you to look at the same data to validate this. Think about what you wonder.

What what other questions come up for you, and how could you find those answers in your telemetry?

And then finally, of course, what could be misleading? What is my time window? What is my downsampling policy if it's for metrics? If it's for traces, it's your regular head or tail sampling. If it's for logs, it's whatever you're doing to filter and drop, you know, in your pipelines.

But really be skeptical and and don't just trust the data as it's displayed.

That is that is my best advice. Keep a curious mind, and you will not go wrong as an engineer.

That's excellent. Thank you, Paige. So I would like to wrap it up here and, and thank you again for joining today. A lot of stuff to chew on for sure, and a lot of stuff to think about with regard to how we handle data, and how this idea of observability really is this higher level concept that includes monitoring for sure. Monitoring is not a bad word, but perhaps requires that we look at things in a in a broader, more holistic, manner. So and, and and, again, like you've said many times, keeping the end user in mind and their experience.

So, Paige, how can folks reach out to you online or find you online if they have a question, have a comment, if they'd like to check out your podcast as well?

Yes. So on the web, on socials, you can find me.

I am most active on Mastodon and LinkedIn.

Mastodon is paigerduty, p a i g e r d u t y, and LinkedIn, Paige Cruz. You should be able to find me.

My podcast, Off Call, you can find at offcall.simplecast.com, and we will by the time this releases, we will have Leon's three parter out for you to listen to. If you cannot get enough of Leon's wonderful wit, his great perspectives, I've got an hour of content for you.

Absolutely. And we love having Leon here on Telemetry Now and on Telemetry News Now, which is our news, tech news podcast coming out this fall. So thanks so much, Paige. Now if you have an idea for an episode or if you'd like to be a guest on Telemetry Now, I'd love to hear from you. Please give us, a I almost said give us a call, but please email us at telemetrynow@kentik.com. So for now, thanks so much for listening. Bye bye.

About Telemetry Now

Do you dread forgetting to use the “add” command on a trunk port? Do you grit your teeth when the coffee maker isn't working, and everyone says, “It’s the network’s fault?” Do you like to blame DNS for everything because you know deep down, in the bottom of your heart, it probably is DNS? Well, you're in the right place! Telemetry Now is the podcast for you! Tune in and let the packets wash over you as host Phil Gervasi and his expert guests talk networking, network engineering and related careers, emerging technologies, and more.
We use cookies to deliver our services.
By using our website, you agree to the use of cookies as described in our Privacy Policy.