20 Years of Flying Blind
Summary
It was 1994 and I was running one of the first multi-city ISPs in the country. We were one of SprintLink’s first BGP customers, and we had dozens of Cisco and Wellfleet routers. What we didn’t have was a way to answer any of the many urgent network utilization questions we faced every day:
- Where was our users’ traffic was coming from? Which types of customers had traffic traversing expensive links?
- Where should we order our next T1? Should it be a DS3? Where should it plug into the network?
- Who’s transferring data with our servers that shouldn’t be?
- Were our servers talking to each other on weird ports?
- Were we under attack? Were we compromised?
- Should we interconnect with a local competing ISP?
I was young, I was new to the networking world, and everyone was new to the Internet. I just needed a “top -b -d 5 -i > networktraffic.txt.” As it was, to get info for a network link that was 200 miles away, I literally had to drive those 200 miles, enable a span port, risk crashing our 3Com switch, capture packets, and manually decipher IP addresses from a text dump. Universities and big enterprise companies had been running networks for years, I reasoned, so there must be something easier that would give us remote visibility. But as much as I searched in Gopher, I couldn’t find the solution.
Five years later, just before the end of the world (Y2K), I had moved up to running one of the first high-speed data cable networks in the country. We had DS3s, Gigabit Ethernet, and a company-owned, multi-state dark fiber network of thousands of miles. And we had hundreds of “big” Cisco routers that were capable of generating Cisco’s new NetFlow network data. But NetFlow implementation in routers was so unstable that I still had basically no idea of where my traffic was coming from or going to. It was like piloting a ship by holding my finger up to the wind. I spent endless nights changing routing and enabling/disabling links to try to determine traffic patterns. All I needed was a little window on the side of the fiber to show me where the bits were going.
Early in the new millennium NetFlow was finally somewhat stable. If I ran a brand-new version of router software with 276 serious bugs and caveats, I’d be able to get a flow of information out of the router, the equivalent of telephone “call records.” I thought: this is exactly what I need. I tried it, and while it didn’t work well, it did get me some data. But I had nowhere to send or collect it, and no way to work with it. Damn. So close.
Between 2001 and 2011, NetFlow export from routers matured. But the ability to collect it and visualize it did not. There was some open-source software that worked for low volumes of traffic, and some over-priced, feature-poor commercial systems, but nothing that worked well. The rate of growth in traffic far outpaced any improvement in NetFlow tools. I kept thinking that there must be some viable collection/visualization solution out there, but I couldn’t find it.
Meanwhile, starting in about 2008, I began having regular conversations with my good friend and fellow ex-ISP operator, Avi Freedman, about the state of traffic visibility. We agreed that it was a critical requirement to run a network, and that there wasn’t a good solution. The conversations went kind of like this:
Dan: “NetFlow still stinks.”
Avi: “Did you buy the $500k solution that doesn’t do what you need yet?”
Dan: “No, did you build me one yet?”
Avi: “Give me a list of what you need it to do.”
Dan: “OK.”
And that went on for years, every 6 months. And, yes, I sent him the list every time.
By 2012, everything was in the cloud and big-data had prevailed (whatever that means). I was running Netflix’s CDN, and guess what? Same tools, same vendor evaluations, same problem: still no solid NetFlow solution.
I ran into Avi in 2013 and he informed me that he was going to build a data platform that could consume massive amounts of NetFlow, make it available in real-time, and provide both a UI and an API. When he invited me to test it in his lab, I said that what I really wanted was to see it running in the cloud. Was the whole thing for real? Frankly, I was doubtful.
By mid-2014 Avi was working with Kentik co-founders Ian Applegate and Ian Pye. He said to me, “Look, we’ve got a prototype. It’s a SaaS platform capable of consuming 1M fps with no pre-aggregation, and of returning a query answer in <1s.” So I asked the engine the same questions I’d been wanting answers to back in 1994, and in 1999 and 2004 and 2012… and it worked! I could quickly visualize where traffic was coming from and going to. I could drill into a “bump” on the graph and within a minute figure out if it was legit or not. I could see all the weird traffic a server was doing that wasn’t expected, and where it was going to. I could see who the heavy users were, who they were talking to AND tell if those people were expensive bits or cheap bits. In short, it was real.
That early version was just a prototype, bugs and all. But looking under the hood I could see that it had a solid, scalable, open foundation — something that had been missing from every previous NetFlow tool I’d ever evaluated.
A few months later I joined Kentik.
So that’s how I got involved in what I think will be a major advance in the way that network operators are able to understand what’s happening over their infrastructure. Over the next several months I’ll be posting about how to use NetFlow and the Kentik products to get that understanding, which is a key part of operating a successful large-scale network. So we’ll look next time at the top 5 ways to use Kentik Detect to get a 360-degree view of your network.