Kentik - Network Observability
More episodes
Telemetry Now  |  Season 1 - Episode 30  |  January 9, 2024

SNMP is Dead. Long Live SNMP!

Play now

 

The networking industry has declared that SNMP is dead, but is it? Is streaming telemetry all we should focus on today, or is SNMP still useful even in modern networking? In this episode, Chris O'Brien, Product Manager at Kentik, joins us to talk about the reality of SNMP and streaming telemetry and his thoughts on where the industry is headed.

Transcript

SNMP is dead, long live SNMP.

Now I remember a blog post with that title or something like that, maybe seven or eight years ago, And, you know, that makes sense to me because that was right around the time, if you remember, that there were some engineers from Google at NANOG 73. That was back in two thousand eighteen. That had a presentation called "SNMP is Dead." And, of course, then there was a lot of buzz in the network industry around that time that we were all gonna replace SNMP, like the whole industry altogether.

And the solution was streaming telemetry. And for that specific presentation, it was a GRPC framework and then the associated Genima management interface.

But I have to say It's been a few years now, and I have not seen everyone in the entire industry ditch SNMP in favor of streaming telemetry. Now it is true. I have seen streaming more and more. And I understand that we're always gonna have those use cases and corner cases out there where there's a network operations team that chooses to stick with SNMP or maybe stick with streaming for whatever reason.

But what really happened over the last few years? Why hasn't SNMP really been totally abandoned and then replaced with streaming telemetry by the entire industry? Like we all predict So maybe it's that it's too heavy of a lift to make the change or maybe that it's the benefits of streaming telemetry over SNMP were a little overhyped. And then when engineers got into the weeds, they decided, you know what?

This isn't worth the effort. Now we all know the reality. We all work in IT. Right?

We all know the reality is that the answer is it depends. It's more complicated than that. This is networking, and it's probably one of the reasons I really love this field. So in today's telemetry now, Chris O'Brien, product manager at Kentik and a subject matter expert on network monitoring and telemetry is with us to explore the answer to this question of whether SNMP is really dead or not.

My name is Philip Gervasi, and this is Telemetry Now.

So, Chris, it's good to have you on. You and I have been chatting with regard to SNMP and streaming telemetry and networking in general for a little while. So it's really cool to have you on the show. But before we get into it, I would like to get a little bit more of your background. I know you are a subject matter expert in monitoring and telemetry. I mean, I know what you do for a living, but Give me a little bit more about your technical background in engineering and in networking.

Yeah. So I was a network engineer for about a decade maybe six different places I worked at at varying sizes and complexity. I spent most of my career on the enterprise side, but I did start in service provider. So about a decade as an network engineer, and after that, about a decade working as a product manager, which basically means building software tools for network engineers.

As my career suggests, even when I was a network engineer, I was pretty obsessed with monitoring. So I've been spending a lot, I suppose twenty years, which is the majority of my life thinking about monitoring at this point, thinking about observability.

Okay. So when you say network engineer, are you talking about, like, a traditional, what I would say, a traditional field engineer turning a physical and a virtual wrench, like, at the command line racking and stacking, designing networks, troubleshooting, why this thing isn't working like that kind of an engineer. Right?

Yeah. Maybe to add a little bit of precision. So all of that, yes. And I started at wireless ISP.

Oh, nice.

The largest in the country. At the time, CIDR got acquired by AT and T. I worked in their call center and then in their network operations center on the phone, and then I became a network engineer and then a senior network engineer and a lead network engineer. And so definitely leaving service provider, it was more running the gambit of running enterprise data centers back when that was, much more popular, you would have one at your office or at local, colo. So both the physical side and configuration and then obviously later in my career, more of the architecture.

Right. Okay. And, you know, I ask you that question on purpose. And it's because I wanted to know if you have that in the weeds, field, network engineer, network operations day to day keeping the lights on background as opposed to, like, a purely academic background or, like, theoretical or quote unquote thought leader.

Oh my goodness. I just said those words. But you know what I mean. Right? The difference between somebody who has design, fixed operated networks and knows the reality as opposed to just a purely academic perspective.

And that that's what I wanted to know. And so I'm I'm glad to hear your answer. So is SNMP debt? And that's kind of the topic here, and is it being replaced with streaming telemetry?

What's happened over the last few years? But we got a level set. What is, really super short? Because our audience is mostly, networking professionals.

They're gonna be technically minded, so they're gonna probably know what SNMP is. From your perspective, in your definition, in your words, what is SNMP? And why has it been dumped on so much over the last five, seven, eight, nine years by thought leaders and maybe even vendors in the networking community.

Yeah. So simple network management protocol, One of the biggest books about SMP says it's amazing SMP every single word in the acronym is wrong. It's not simple. It's not for the network only.

Management isn't how it turned out to be. And so SNMP was invented a long time ago And we've all been working with SNMP for a long time, and that means two things. Number one, it's aged a lot as networks around it have changed a lot. And so it's becoming a less and less good fit for our current networks.

And then it's also true that all of the problems that SNMP has as network engineers are working at scale, like SNMP is ubiquitous. Right? So we have all experienced a ton of challenges and limitations and problems from SNMP as we apply it to our network and we're all too familiar with those. So I think that's a big part of it too.

I'm glad you put it that way where you said that it has aged and therefore maybe isn't appropriate for where networks are today or how we do networking today because There's a subtle point that you made that I kinda disagree with. I don't believe that the age of a technology, but you can't conclude logically that therefore it's right? Like, BGP is therefore bad because it was invented in the late sixties. No, or TCP IP.

Sixties. I didn't know that about BGP.

You may give us opportunities.

I don't know. Whenever that napkin was I don't know. Right. Either way, it's old. It's decades old.

TCP IPs is decades old. You know, spanning tree is decades old. Maybe that is a bad technology. But my point is that the age of a technology doesn't preclude its ineffectiveness or it's not being good anymore in our year, two thousand twenty three, two thousand twenty four, and beyond.

However, the way we approach networking today can differ, and therefore its usefulness can change as a result. So I don't think it's its age because let's say, for example, that we still did networking the same way, then it would be fine. You see, I think that it's the changes in the networking industry over the past forty years, especially in the last ten years. That have resulted in us looking at SNMP differently.

And then looking at possibly, are there other monitoring solutions, data types, formats, protocols that we should be considering? Is it that ineffective? I mean, why has the community been dumping on it in your opinion?

I mean, age is a proxy for understanding the degree of change that the network has gone through around it. Right?

And Alright. That's fair.

Yeah. SMP originally invented, like, in the eighties, we've had a lot of changes to our network since that time. And, I mean, I would almost flip this on its head and say, SMP was a wild success with nearly every network in the world, depending on it for visibility, and really no serious competitors in terms of percentage of use. Right? S and P has been ninety percent of it since the eighties. That's incredible. If you were designing a protocol, you could hardly hope for a better result.

As far as why people hate it you know, vendors do implement SMP in different ways. And so there's a sort of a constant hassle of figuring out where the data is and making sure the data is collected. And this isn't always the case, but theoretically, the data that's available from S and P could change, you know, with every vendor, with every make, with every model, with features that you deploy on that system with every software change and with every software version. So because we're always building more equipment and more software, you know, we're kind of forever trying to catch up with SNMP. And it's not one tool trying to do this. It's a bunch of different management tools. And so dealing with missing data is is a common frustration.

So is there anything about SNMP as far as what it gives us in terms of information and metrics and, the data that we wanna collect from our network. Is there anything that's missing that we could say SNMP is lacking? Or is it that we're not really implementing it to its fullest potential? And I ask that mostly based on the conversation I had with a friend and colleague a few months back on flow data and how, you know, a lot of people look down on flow as well when the reality is they're just not using it to its fullest potential. And there's just so much more we can do with it. That's the same case with SNMP?

I think it is a little bit different. Right? And I think some of the challenges people or limitations people run into is the architecture of S and P, the nature of the protocols to request a data point. So the manager request a piece of data from the network infrastructure, like, let's call it a router for simplicity sake.

And then the router returns that data point. Well, we're all using this data to build graphs, which means we're all saying our management station should ask that question of the router once every x minutes, x seconds. And so you get this kind of ridiculous constant asking the same question and then answering with a new value that's happening forever for all of our devices, which is not a super intelligent way to get a stream of data, the same data. Right?

Okay. And that would lend itself into scalability problems, constantly talking to the device, asking it for information that you probably don't even need. Oh, your interface is still up. Great.

Oh, your interface is still up. Great. So there's redundant data. Therefore, you're there's a performance penalty possibly, both on the network side and then on the device side itself.

And then you're telling me that there's no, like, industry standards device by device where individual vendors are gonna implement SNMP differently on their own platforms. So that's that's a limitation.

The then what is streaming's telemetry in comparison? If I'm trying to get that same kind of information, is it really providing me any benefit or is it just solving those few problems I listed?

Yeah. Well, maybe just to close the loop on what you just said. The other limitation with SMP is you ask these again, because it wasn't made to build a graph of data, but that's kind of what we're all doing with those data points. You end up with challenges around your intervals, essentially.

You might have some delay. You know, so say you're trying to draw a graph with your rate of something for every five minutes for over the last year. Right? Then you need one data point for every five minutes So you often will pull at every five minutes.

But if the return of that value gets near a boundary between those intervals, you may find that you know, your first interval has no readout and your second interval has two. So then in your first interval, you're saying, Hey, there's no data being sent across this interface as an example. And then second interval, you're saying there's twice as much data as usual. So in this way, because of this interval problem, SMP can create the sense that there are spikes and loles.

That aren't actually true spikes and lulls. It's also true that because you pull so infrequently, whatever time there is between your polling intervals, you're averaging. In reality, what's happening is you're saying how much traffic was sent outbound on this interface in total. Let me get that.

That's a counter. You do that again in five minutes, and then you calculate the difference And you say that over that five minute period was the rate. Well, that's the average rate over that five minute period. And more and more frequently networks are dealing with microbursts and other spikes that are a whole lot less than five minutes.

So that five minutes tends to really smooth those. One minute polling, which I think is the right sort of default for today's SNMP implementations still is a lot of smoothing or averaging out those spikes. And so this is a common problem with S and P. More indirectly because of its architecture because it doesn't scale.

Right? You can't pull your devices once a second or once every a hundred milliseconds because, you know, you're asking that same question over and over, and the nature of that is requesting that router whose primary job is to move traffic to sort of interrupt its CPU and ever so briefly spend its CPU time instead on preparing an answer to this question. So it's very inefficient and the faster you go, the worse it gets in terms of efficiency. And the net result of that is, you know, no one's doing SNMP at one second intervals.

So you you get this thing where you're averaging out the spikes and the lulz, and you get this thing where you're actually introducing spikes and lulz that don't exist. So that's like a lot of loss of accuracy in your monitoring where the core purpose of what a lot of us are doing with S and P is drawing these darn graphs. Right? These graphs need to be accurate.

Yeah. That's the whole point, obviously, especially in a very high transaction environment, in an environment where bandwidth is scarce.

And when you can't drop packets and therefore, no congestion is tolerated or very little congestion, which I think is more and more common today. Especially in, networks that operate AI workloads and things like that. So does does streaming telemetry, and we can get into the different types where it came from, how it developed does it solve those problems?

The biggest difference with streaming telemetry is you are subscribing to a data source. Right? And that's a huge difference. That means you're saying, hey, I have this question. Just send me updates to that question at this interval.

Right? So that is a much more efficient way to do it. The router can schedule in the processing. It doesn't have to always do that.

So interrupt driven, it can batch that. And it also prepares the sort of can timestamp these things at the router. And so it can batch a set of these things and send them over. And so this is a much better fit if your goal is to draw these graphs and send the same data point over and over for years.

Much, much better way to go about it.

And so that would speak to this concept of a push versus pull driven methodology.

And then you mentioned a scheduled, versus a purely event driven delivery methodology. Right?

Yeah. Now you scheduled in the time span of seconds here. Right? It's not necessarily that you're gonna schedule it once every thirty minutes, but imagine you're the CPU.

There's a big difference between a question came in, and I need to respond before that S and P timer goes up or ideally. And the request is to do it immediately. So I'm looking at processor interrupt versus, you know, a scheduler like Cron or even something where you're trying to go to put that task a few seconds out. That's a lot of new flexibility for the CPU of that router.

And then it knows what it needs to query, like, knowing ahead of time what the query is and that you're gonna have to repeat it a thousand times. You can imagine in computer science that's way easier than I have to facilitate any question that comes in in all of SNMP at any moment within a second.

Alright. So you're gonna get the benefit of a higher degree of granularity or higher degree of definition, in the data without necessarily the hit on CPU utilization and then ultimately performance of that device. Right?

That's right. Like, people are running large chassis with hundreds thousands of interfaces and in twenty twenty three, when we're recording this, you know, I think polling the state every five or ten minutes is is crazy, crazy slow. And so you wanna move to a minute, I would suggest as a minimum, you wanna be at a minute, but you can't pull one of these chassis maybe has a thousand interfaces.

A lot of times a single interface will have fifteen, twenty, twenty five metric series. So think the state of the interface, whether it's up or down. Well, you need that for administrative state as well as operational state, then think about traffic alone. You've got in bits per second, you've got a lot of people will do multicast packet count, regular packet count, or all packet count, broadcast packet counts, And then both of those things you do in both directions.

Right? So you multiply all of that out, and typically people are monitoring fifteen to twenty five series of metrics per interface.

So you got a thousand interfaces. You're doing twenty thousand data points per interval. So if you do that at one minute interval, that's not a insignificant load on that router, right, or that chassis. That's a significant load.

And so as soon as people get into fast or polling and larger devices in terms of interface count or any other sort of trigger causing them to monitor a whole bunch of metric series. They start running into scale problems. They start running into challenges. And typically, really, the only solution is to slow down.

And you know, I've talked to a lot of folks who are running at five minutes and they're having to slow down to ten minutes. This is not the right direction. Yeah. This is not gonna work.

So then clearly, streaming is much more scalable than SNMP. I think you've mentioned that just now. Which is a a major concern, especially we get into larger networks like hyperscalers, web scale companies, service providers, things like that. And then what about the reliability of the data? Because I know that SNMP is gonna send this information over UDP. And so there is a potential to drop some data when it's transiting the network itself. Is that how SNA rather streaming works as well, or is it a different method?

Yeah. Streaming uses TCP amongst other things to ensure reliability definitely more reliable. You know, SMP uses UDP. So theoretically, you could drop packets, and that does happen I mean, if you imagine just how like, I just talked about twenty five thousand requests per minute or per five minutes, whatever you're pulling interval is, Given that, it's really amazing how reliable SNMP is.

It's amazing that you're not constantly having gaps in your data. From SMP, you know, took the industry some time to figure that out. But I would say, you know, there's little gaps in your charts. Has been a problem in S and P for decades, and the problem is not gone.

And one of the contributors are certainly the fact that the data is delivered unreliably. So stringing telemetry takes a big step forward by simply moving to TCP. This is interesting when it comes to metrics. But it's way more interesting when it comes to events.

Right? Because events are sent once. They're typically urgent. They're really important information.

And if you miss it, you miss it. That's a real problem. Right? So stuff like SNMP tracks. We haven't even talked yet about of the different use cases with S and P, I would say the most common one is that, you know, asking the same question over and over and drawing a graph with it SNMP traps was designed to facilitate that sort of event based stream of information, which is super valuable and and sort of a different tool in your toolbox to monitor and troubleshoot a network. And then, you know, referencing back to that simple network management protocol, none of these words are true. The management section, all of the put stuff in SMP was A lot of the protocol is designed over that quote management and what I mean really is writing changes to our network devices.

But practically speaking, we don't use that. That's not how we use SNMP. I mean, because you don't have change control versioning. You don't even have a decent facility to manage when you write the change versus when the change is just in memory.

So you have to be extremely careful and you get mediocre result when you try and use SNMP to push changes, and there's just better tools to do that. So it's almost like half of the protocol, the management half isn't didn't really function how we wanted to. It it's just vestigial. Right? It's there, but we don't use it.

And then the, the specific framework that I remember hearing in the Nanong presentation a few years ago was GRPC. What is it the remote procedure call framework? Specifically, g m I, which is the, you know, the actual network management interface. And they're gonna communicate over things like NetCOM Fresh, which a lot of folks are familiar with today.

I mean, you're seeing vendors support this. And so does that mean that streaming telemetry is there for? There is less deviation vendor to vendor and how streaming is supported. Because you did mention that one of the weaknesses of SNMP is that it's gonna be implemented differently.

Mibs are gonna be different by device. Right? See, the audience can't see that you're smiling. This is not a video podcast, but you haven't even been worrying.

Right now, Chris is is smirking in the camera.

Yeah. I think, unfortunately, the opposite is true. So despite the fact that S and P still varies so much, In SNMP, you have this facility of the, Mib that defines a lot of the data that's available that doesn't exist to the same degree and the same availability for end users with string telemetry.

Further, maybe even the more impactful point is that streaming telemetry is much newer And where it came from was the Googles and the others of the world that can white box and apply overwhelming pressure to their vendors for the changes they would like to see done.

But we do have things like the open config initiative. I'm sorry if that's sacrilegious to say that word But we do have those initiatives to make that implementation, more ubiquitous and industry standard. Right? I mean, isn't that where we're going with this?

Yeah. That's true. So I think that's where we're going, but I think the fact of the matter is we're still sadly despite being years into it, we're still very early on in that journey. And because the initial drive for streaming telemetry did not come from your average enterprise, but your average enterprise is looking to now use it that has caused additional fragmentation.

So, you know, we've got some teams working on stringing telemetry, and we see a lot more variation and what's available, how it's available, you know, and how we interact with the box, a lot more variation on the streaming telemetry side than we see on the SNMP side.

Yeah. Where I was going with my question earlier is that maybe, you know, that that's a problem that streaming is solving, is that it is more, again, more consistent platform to platform and you completely thwarted my argument. But that's why we're talking is I wanted to understand this. And I do know that there are some vendors out there.

They haven't made any public announcements whatsoever about support for streaming telemetry in general or any let alone anything in particular. So then that's a negative, but based on our conversation so far, it's been twenty minutes or thirty minutes of discussing the weaknesses of SNMP. And and now we're talking about some of these weaknesses of of streaming telemetry, but the weaknesses aren't technical in nature. You're just telling me that we're not there yet.

Okay.

But should we be getting there? Should streaming replace SNMP is SNMP dead? I'll be at, you know, dying. Right? Where we're still making streaming telemetry more commonly adopted in the industry.

Yeah.

Or is SNMP still very much alive and well and should be part of our overall you know, strategy of monitoring.

Yeah. And if we step one back from that, it's like if streaming telemetry is so great, then why are we all still using SMP. I think there's something fundamentally wrong there with how we're approaching the situation.

And I think the first step we need to take to solve that is to recognize that SNMP isn't dead or close to dead. SNMP is running in maybe ninety nine point nine percent of networks today and the primary method of monitoring in maybe a full ninety nine percent of networks So if we're starting from the position of S and P is dead, we don't understand even where we are. It's hard to figure out how to navigate to where you wanna go without understanding where we are. Where we are today is SNMP is absolutely ubiquitous.

And vital in almost every network in the world.

So starting from that position, the next step is we've got this new tool stream telemetry, we wanna use it more. You know, the first thing that has to happen is manufacturers have to build support. They are and they have. Right?

But the number of devices most people have that support streaming telemetry is relatively few. So maybe it's ten percent, maybe it's five percent, maybe it's less than five percent. So the bad news is, you know, you got all this gear that doesn't support streaming telemetry, And in some networks, it makes perfect sense to run that gear into the ground. You're gonna have that gear seven years, ten years longer than that.

So replacing your gear to support streaming telemetry is like a silly proposition for many or most companies And so a lot of the approach so far has been how do we replace SNMP with string telemetry And I think that is what is causing a lot of the delay and basically making it impossible for folks to adopt streaming telemetry.

What the world is is a place where maybe today it's ninety nine percent S and P and point one percent streaming telemetry And maybe a year from now, we want it to be if everything was fantastic and we had our druthers, and we've got all the benefits streaming telemetry as fast as we could, maybe the world would be ninety five percent SNMP and four point nine percent streaming telemetry.

And so my view is that's the ideal world. We have to design a way to interact with our networks that has SNMP not as relegated as archaic technology from the past that you know, you can get nothing from, but really treats S and P like, first class citizen. Like, this is a large part of the lifeblood of the network. And then on these, you know, this five percent, ten percent, whatever portion of our gear that does support streaming telemetry.

Like, the good news here is this tends to be your more expensive gear and your more critical gear. So that that five percent can be an outsized portion of the value and the import of your network. So what you wanna do is be able to collect data use stream telemetry on that gear and really put these things side by side and make it so that, you know, if you're looking at a dashboard, if you're writing an alert, or receiving alerts, if you're building some sort of query over your data, all of these things return data regardless of whether the data sources SNMP or streaming telemetry, it's just with streaming telemetry.

The data is much fresher. It's more frequent. Maybe there's other sources of data. We need to be able to use both of these things like first class citizens.

Mhmm.

And so, ultimately, what you're describing is more of a device life cycle, IT operations problem, or at least hindrance to ass adoption or quicker mass adoption of streaming telemetry. But the reality is we have devices that we're trying to depreciate over time. At least that's what the finance team wants to do. And then also just the IT operations, the operational aspect of lifting or rather ripping and replacing all of my gear with other gear.

I mean, the reality is, you know, you have your hardware refreshes every five, seven years, maybe ten years on core devices, And if you're in the middle of it, you're not touching it right now. And if it doesn't support it, it doesn't support it. So it doesn't sound like that streaming is inherently fraud with weaknesses and inaccurate information that it's just not a good telemetry tool or monitoring tool but it's just that it's much more difficult to move to it than I think at least based on what you're saying than what the industry described back in twenty eighteen and in the past seven years since, seven years.

Not that many, six years, five years.

Yeah. Like, a wholesale cut over. This term of cutover is completely inapplicable. We won't be able to get there. It's a crazy idea, and it's, really just not again, I think if we start from man is S and P valuable, which seems like a foregone conclusion when you consider how many of us are using it to what degree. We're using it, then it's much easier to get to the point where we have a rational plan where we still use the S and P and then our starting to use string telemetry as a new tool to solve some of these problems we're seeing, where applicable, not as a full replacement.

That's the path forward for sure.

Yeah. And in the interim and for probably quite a few years, I mean, you're going to have to have systems in place that are able to ingest all of these types of telemetry, including SNMP, despite it having that moniker of legacy and archaic like you used before, It is still very much relevant and therefore has to be taken into consideration even among the most modern and forward thinking monitoring platforms. Right? And that's alongside what we're doing now moving forward with ingesting streaming telemetry information. Just agree with me, Chris. Just agree with that. Yeah.

I agree. Yeah. I I agree with what you're saying. Yeah.

Right. Right. You know, Chris, this has been a great discussion. I would like to get into the weeds more specifically on how streaming works, on how SNMP works, how we can utilize it today better than we have in the past, possibly, maybe just flesh it out a little bit more in the weeds that I think we had time for today.

But I didn't want to address this idea of SNMP is dead I remember hearing once in a in a conference in a small group I was in, somebody made the comment. SNMP, he needs to be taken out back and shot. And I was just like, really? Because I know a lot of people that are still using it and operating their networks with it.

So it struck me as strange. So this has been a great conversation Chris, I'm gonna assume that somebody wants to yell at you that has heard this podcast today. So how can they reach out to you online with a question comment, angry or positive?

Yeah. You can contact me at kentik.com, cobrien@kentik.com. I love talking about this stuff, so it should be an email.

Yeah. Great. And then, of course, to our audience, look out for future podcasts where Chris and I get into this more, because I am I have a lot more questions, and we have a lot more stuff in the show notes that we didn't cover. And you can find me online at Twitter still at network underscore fill.

You can search me on LinkedIn, of course, and you can email me at pgervasi@kentik.com. Now If you have an idea for an episode or if you'd like to be a guest on telemetry now, please reach out to us at Telemetry Now at kentik.com. And, of course, keep an eye out for future episodes coming your way. And for now, thanks for listening.

Bye bye.

About Telemetry Now

Do you dread forgetting to use the “add” command on a trunk port? Do you grit your teeth when the coffee maker isn't working, and everyone says, “It’s the network’s fault?” Do you like to blame DNS for everything because you know deep down, in the bottom of your heart, it probably is DNS? Well, you're in the right place! Telemetry Now is the podcast for you! Tune in and let the packets wash over you as host Phil Gervasi and his expert guests talk networking, network engineering and related careers, emerging technologies, and more.
We use cookies to deliver our services.
By using our website, you agree to the use of cookies as described in our Privacy Policy.