No, You Haven’t Missed the Streaming Telemetry Bandwagon - Part 1
Summary
Streaming telemetry holds the promise of radically improving the reliability and performance of today’s complex network infrastructures, but it does come with caveats. In the first of a new series, Kentik CEO Avi Freedman covers streaming telemetry’s history and original development.
“One of the wonderful things about standards is that there are so many to choose from.”
— Inscription on the Tomb of the Unknown Network Engineer
One recurring hot topic in network observability is streaming telemetry. The application stack and many of the layers of abstraction, orchestration, and automation saw huge innovation in telemetry over the last 15 years, and networking leaders wanted to take a similar approach to support both “known” telemetry as well as supporting more modern observability approaches. To listen to the vendors and practitioners, streaming telemetry is the promised land of network telemetry and observability. But what is it and how does it relate to SNMP and other more traditional types of network metrics telemetry?
This is the first of a two-part blog series on streaming telemetry.
In it, I’ll give background about the history and original development of streaming telemetry, and in the second part I will cover enterprise and service provider adoption, as well as our take on likely next steps of evolution in streaming telemetry.
The basics of streaming telemetry
Let’s start with terminology to level set the conversation. Streaming telemetry is one of five methods for collecting performance data from a device in a network (e.g., routers, switches, servers, interfaces, and links).
The other four are:
- Command-line interface (CLI) for devices like routers
- Syslog messages
- APIs to the config and control plane of routers (most prominently Juniper), used by large customers who create their own network management systems
- Simple Network Management Protocol (SNMP)
Of these methods, SNMP is by far the most widely used.
Streaming telemetry is fundamentally different from the other four in one crucial way: it made a design choice to be a “push” method to send all of its telemetric data, while the others are mostly “pull” for metrics. I’ll get into the significance of that difference a little later. For now, though, I need to make an important observation about taxonomy.
There are classes of network telemetry: device, traffic, synthetic, config, and metadata. Some network professionals believe streaming telemetry covers all these versions. But it doesn’t. In my view, the focus of streaming telemetry has been mainly on device telemetry.
Telemetry (of any kind) is critical to network health. Without understanding what’s happening on a device at a distance, I&O teams can’t see directly when something goes wrong, troubleshoot the problem, predict future failures, and implement fixes. Device telemetry has always been an essential element in networking.
Enter SNMP.
SNMP: A love/hate story
SNMP is the de facto standard for internetworking management. But it’s like the U.S. tax code: there’s something in it for everyone to hate (for some folks, a lot to hate), and everyone still uses it.
Before diving into what’s wrong with SNMP, let’s acknowledge one of its strengths: the Management Information Base (MIB). The MIB collects, stores, and organizes telemetry data from devices on the network. While MIB has its faults, it does one crucial thing: it provides a common way to examine the activity and health of equipment from multiple vendors. In today’s complex multi-vendor, multi-layer hybrid cloud environment, that is invaluable.
The bottom line is that SNMP is functional and ubiquitous, and no alternative has achieved significant momentum.
What is it about SNMP that people hate? Simply put, it ’s the wrong paradigm for the third decade of the 21st century.
To start with, it demands a lot of administrative attention, and the arcane interface is at odds with today’s expectations of easy-to-use GUIs.
But the main problem with SNMP is that it’s pull-based. The devices being monitored send data only when requested by the network management system. I&O teams have to select the devices to poll, set the timing for polling, etc. In highly complex networks, like those of service providers or global enterprises, the interval for polling may be anywhere from 30 seconds to even five minutes – an unacceptable length of time when you’re talking about events potentially occurring in the millions per second. This leaves the network vulnerable to catastrophic interruptions in service if it takes too long to spot an anomaly, error, or (even worse) a DDoS.
(There are such things as SNMP traps, which can send a message about the device without being polled. But traps are limited in functionality and require significant operator intervention. They are intended to be emergency messages, but in reality, not all traps signal a true emergency, and even some emergencies don’t generate a trap.)
SNMP polling involves so much host-client communication that CPUs can be overwhelmed, particularly if multiple network management tools are employed simultaneously (as is common) by I&O teams.
Network operators look at all this traffic and ask: why can’t each device send this information once, by itself, so everyone can absorb it when and how they want?
It’s an important question, especially when today ’s highly complex networks are expected to be always-on and highly reliable. Any network performance monitoring and diagnostics (NPMD) system must constantly and thoroughly check the network’s health. The goal is a consistent flow of real-time and comprehensive data to deliver meaningful, actionable network intelligence.
Enter streaming telemetry.
Streaming telemetry: A river of useful data
Streaming telemetry is a push-based methodology in which data from network devices is streamed continually and automatically, a “set it and forget it” approach that lifts much of the burden imposed on I&O teams by SNMP.
Streaming telemetry taps into three important trends in today’s IT environment:
- Big data – The more information you get and the more frequently you get it, the better informed you will be to take actions that optimize network performance and avoid problems before they occur.
- Automation – The fewer repetitive tasks I&O teams have to perform, the more attention they can pay to higher-value, more strategic priorities.
- Proactivity – Network management must be a constant and conscious effort to optimize network performance and reliability that applies directly to the goals of the organization and the quality of the end-user experience.
Streaming telemetry holds the promise of radically improving the reliability and performance of today’s complex network infrastructures, as well assisting with prediction, capacity planning, cost analysis, performance, and even security. But, there’s a flaw in streaming telemetry that goes back to its roots more than ten years ago. The creators of streaming telemetry abandoned the idea of multi-vendor normalization and dispensed with the MIB.
Why?
Back then, the mega-scale web companies were trying to tackle one of their biggest problems: providing reliable service to businesses and consumers amid the hyper-scale growth of web traffic. All they wanted, they told the network equipment makers, was a constant stream of data from all their devices. No need for a MIB, they said; we can do our own debugging, troubleshooting, and optimization. Those web giants got what they wanted, and it has worked for them for the most part.
When streaming telemetry technology was presented to large enterprises and service providers, the reaction was initially enthusiastic. But when they actually saw what streaming telemetry delivered to them — and required of them — they concluded: “We can’t handle all this by ourselves.” Streaming telemetry was looking like it would become another in a long line of hipster tools and technologies that generate more buzz than adoption.
So, if you’re worried that you’re late in adopting streaming telemetry, rest assured the train has not left the station.
In the second part of this blog, I ’ll explore what’s happened to streaming telemetry since those early days and where it might be headed.
I’ll also discuss a key question on the minds of many — since some devices don’t (and may never) support streaming telemetry, how do you unify SNMP and streaming telemetry to get a coherent view of device state?