Telemetry Now | Season 2 - Episode 37 | March 27, 2025

Demystifying Data Center Networking for AI Workloads with Chris Kane

Play now

In this episode of Telemetry Now, we explore why AI workloads require a fundamentally different approach to data center networking. From Ethernet vs. InfiniBand to mitigating issues with packet loss, latency, and link flapping, along with emerging standards from the Ultra Ethernet Consortium, we unpack the technologies reshaping how AI infrastructure is built.

Transcript

In today's episode of Telemetry Now, we are back at it again talking about AI, except this time, instead of algorithms and models, we're laser focused on networking, specifically data center networking for AI training workloads, how it's really an entirely new beast of networking when you compare it to traditional DC networking and even high performance computer networking, if you have experience with that.

So with me today is returning guest, Chris Kane, who is an engineering leader at Arista. And we're gonna be diving into what's near and dear to my heart, moving packets super fast, reliably, and efficiently to make the magic of AI happen. My name is Philip Gervasi, and this is Telemetry Now.

Hey, Chris. Thanks so much for joining the podcast today. It's great to have you on. Actually, it's great to have you on again because we had you on, along with Jason Ginter, just a few weeks ago or a month ago, something like that, to talk about the USNUA.

So, this is a great class reunion for sure.

Absolutely. Thanks. I appreciate the opportunity, Phil. There's, yeah. Well, we could probably have a a get together every other week at the rate of, technology evolution right now. We can Yeah. We could get regular get together at a regular, cadence just in an attempt to keep up.

Actually, if if for if you weren't at GTC last week, there was a, one of the NVIDIA VPs of infrastructure, was giving a presentation, one of the sessions, and she had up there the, the Red Queen effect from Lewis Carol's Through the Looking Glass, where the the concept there is you have to run at full tilt. Right? You have to run at full speed just in an attempt to stay in the same place. And I think that's, that was very, accurate as to kind of the workload we're all under right now.

Yeah. That is very true. It's something that I experience day in and day out right now trying to keep up with, you know, the latest models, the latest changes on the networking side with regard to training models, what's happening with all of the adjacent technology in the AI space, trying to find the niche of, you know, use cases and the applicability of artificial intelligence to the NetOps space and really working through that, which I still think is in progress. So there is just so much in.

And then not only that, Chris, I am, very much, involved with studying like traditional data engineering right now because that is, like, the vast majority of the activity that happens with an AI initiative. So I have been just studying, data pipelines and transformation and all how to do all that stuff and storage and and all that, just because I find it to be, critical in understanding how then to, you know, apply these kind of technologies into network operations and how to, not fail and have the never ending POC. Right? You know, that that POC that just doesn't go anywhere because it's just too much of a lift to actually go into production.

So can't keep up. But at the same time, at the same time, I love it. I love it because it feels like it's something new I can finally latch on to. Like, do we really have to talk about BGP more?

Maybe I shouldn't say that on... a right.

On a working process. No. No. I no. You're absolutely right. And it it's, I've talked to people about this over the last couple years. I've been very much a network engineer through and through.

But in these environments, you've you've got to pull back. You've gotta recognize what's happening with the compute, with the storage, with the power. Like, you really it's the entire data hall. Right? You you really do need to have an appreciation for everything that's changing there. And and I've come to start calling the the half life of this, about six months.

You've gotta stay on top of it because within about a year or some of the approaches and strategies do roll over. So it's, you know, kind of back to the Red Queen effect and and keeping it all in.

So let's focus in on the the networking piece. I mean, you mentioned power and cooling, and there's the the actual overall architecture and how we're going to design racks moving forward and things like that. I I get all that. And, I was at DCD Connect in New York where those were, very top of mind for a lot of people.

Really interesting stuff. But but so was networking, because as you know very well, but for our audience's edification, there are a couple of things to consider here. Here. Number one, I don't know if this is still true, but for for a while, the, the statistic was that thirty to almost fifty percent of a AI training job was time spent in networking, packets in networking.

Maybe it's a little bit different now, but that was the number for a while. And so the network, because we are talking about distributed compute in order to make this magic happen, so the network is more critical than ever, to, you know, artificial intelligence job training, model training, to job completion, and to doing it efficiently in a way that makes sense. And not just training, but also inference, of course, as well. So there's that, which we'll talk about.

So I I wanna get into that first, Chris. Why is data center networking for artificial intelligence unique and different than traditional networking? And if I can put a finer point on it, maybe even different than HPC, high performance computing networking?

Yeah. That's a it's a good call out. It is different than HPC. But yeah. And you'll hear people talk about it's very compute intensive.

It's very data intensive. But I think I I'm not sure that those descriptions land. So if we take a step back for a minute and try to think about what these data scientists are are are are trying to accomplish and what they would like to have, I, you know, I like to to to paint in a picture.

Imagine you're walking into an empty data center, just completely empty. It's like a fifty thousand square foot data center. You know, the walls are like a hundred and twenty five feet by a hundred and twenty five feet, completely open. And then if you could picture kinda at at at eye level, this huge processor, just one big ginormous processor that it's so big it touches all four walls.

Right? It's it's just it reaches clear across the entire room. And then above that, if you could picture a a similar shelf where it's just one big huge chunk of memory. Right?

Laying equal size and and equal density there above that ginormous processor. Right? And then you picture connectivity pass or interconnects between that ginormous processor and that huge chunk of memory. Right?

That's what they like to have. That that would be the ideal situation for them. And and obviously, even with, you know, companies like, Cerebras that has the wafer scale engine, you know, we we we're we're not capable of building something just simply that big for them. So the next best thing is to is to build these clusters.

And then by the very, you know, definition of that term cluster, we have to consider that, hey. I've I've got one of these things, and somehow I've gotta wire it up to ninety nine thousand nine hundred ninety nine other ones, right, for for the for the hundred thousand, size clusters that we have today. But, you know, anyone that followed GTC last week and and they showed how they they intend to build you know, we're gonna go from servers to whole racks. Right? These are entire systems where you're gonna have a rack or now possibly even two racks, where the concept is the the servers are on line cards or they're on shelves. And the networking's similar, where it's on, like, a line card, and you're gonna slide those line cards in and out of, these racks, if you will, that are that are just gonna be ginormous.

That one with the the the intent of coming out in, what is it, the second half of twenty twenty seven with Kybar, you know, six hundred, kilowatts of power required, for that ginormous system. So so you start off with this idea of, hey. We we want to cluster these things together, but but understand that it's it's not just the average data center where a system might talk to another system occasionally. This is more like synchronized swimming, right, where everybody's gotta work in unison, everybody's gotta pull their weight, and it all has to happen in real time.

Right? As we get it's and they've talked a a lot lately about the increased compute capacity that we're gonna need as we move into training and as you move into things like the agentic AI and the in in the tents that they have with things like robotic AI where it's gotta be real time. And so they envision that we're gonna need about a hundred times the compute capacity that we even have right now. So so hopefully that that paints this picture of, hey.

I've I've got a data center. All these things have to be cabled together, but the expectation is that they're not talking to each other incidentally or infrequently.

They're synchronized. Right? And they they have to work in unison to to complete those jobs.

So synchronous communication, for GPUs talking to each other, and then there's going to be, I understand as I understand, model training, clusters or pods of GPUs that then will send that result of that computational activity to another pod to do something with. So there's these large flows as well. So there's a couple of things that this suggests to me. One, the synchronous activity because you said the word real time and then I know in networking it's always quote unquote real time.

Right? It's near as near real time as you can get because there's always a problem with loss latency jitter. So I have to assume that, and that's always been a thing in networking. I'm sorry, in data center networking.

That's always been a thing in high end data center networking, lossless connectivity and all these things. But Ethernet is inherently loss lossy. And, you know, there are problems with load balancing and, you know, oversubscription of links and things like that. So how, what what are the specific problems that we need to solve?

I mentioned a few, and I know we're gonna talk about those. Yeah.

This kind of data center networking and and how are we solving them? Why don't we go into that?

Yeah.

I'd say over the last two years, the a big focus has been on the load balancing aspect. So, when you if again, stepping back at thinking about how we connect these together, yes, there's, you know, a bit still of a a discussion back and forth about Ethernet versus IB.

InfiniBand is not gonna go away. This is it's still gonna be here. I think, though, as we build these clusters bigger and bigger and ever bigger, and and and we go from one data hall to another. Right? We're getting to the point where some of these clusters, it's not applicable to everybody, but some of these clusters are so big they can't fit in a data center. We've gotta have a campus, and we've gotta connect them together.

Right? So and and you'll see Ethernet leveraged more and more there. So I think that the industry predictions from Del Oro and six fifty group and others has been, hey. Ethernet port shipments are gonna go way up. Right? It's not that IB is gonna stop. It's just that Ethernet will eventually, exceed the the number of Ethernet ports deployed in these environments.

Maybe maybe for more greenfield, deployments, things like that.

So moving forward, it'll be the dominant for sure.

Yeah. Absolutely. And and when you roll these out, you also have to think, in symmetry. So we have to have symmetry everywhere. And that actually came up again, you know, another topic that it was discussed at GTC last week was, a focus of symmetry even in memory.

And because I need to have symmetry in memory, I'm unlikely to mix and match, generations of accelerators inside of a single cluster.

And and so for the network, what that symmetry means is we need to have the exact same number of links and the exact same bandwidth, from the leafs to the spines. And in the case of the the bigger builds, if you're if you're in a three tier class network, you're gonna have, you're gonna aim for similar symmetry from the spines to the super spines. And at least from the leaf to the spines, we're doing something a little different than you and I grew up with, which was oversubscription.

You know, the next thing beyond trying to deal with load balancing and attack that with symmetry and attack that with some enhancements to Ethernet, and then a couple vendors are gonna have their own little twist on that and how they do bit spraying.

But the but the next thing to deal with is that, you know, you really need no oversubscription whatsoever between the leafs and the spines.

So you're gonna match every bit of bandwidth that you connect down to those, accelerators. You're gonna match that, one for one between the the leaf and the spine. So then now we're into this realm of I have to have a ton of connections. I have to have a ton of cables or a ton of optics.

Right? And and so the beyond load balancing, which I think we're pretty close to solving, I think everybody's got something, in flight now or already available to address the load balancing challenge. The next one is, you know, with its sheer number of interconnects that we have, what do you do to help minimize disruption of service when it comes to link flaps? Link flaps has has been a huge problem.

And there's a couple ways to address that. One is with the the overprovision. Just throw more of them out there to help with that.

Another approach could be, you know, what do we do with lasers and the reliability of what's happening, with the actual optics themselves, that are that are an approach there. And then, another part of it kinda gets into your realm of things is the importance of visibility.

Right? And and how am I monitoring this? How do I know exactly how do we detect a problem? What do I do?

Do I take immediate remediation action? In some cases, a loss of a link or loss of an accelerator will equate, hey. Pull a whole host. Pull the whole node out of the job.

Right? Because it's better just to remove that by, you know, the way the job is configured. It's better to remove that entire host than it is maybe to to keep it limping along with, one less GPU there, as a resource that's available. So so links and focusing on that is is is, I think, really where I feel most of the industry is focused right now.

It's a lot of discussion around observability and visibility because the even if we take smaller numbers, let's let's forget the hundred thousand. That just doesn't apply to all of us. So let's say if you were to build a a cluster of two thousand GPUs, you're talking about two thousand NICs that need two thousand cables or optics. Right?

And then on the leaf side, the other end of that, I need two thousand optics, right, or the other end of those cables at the leaf. And then from the leaf to spine, rinse and repeat. I need to match that one for one, so I'm gonna have two thousand interfaces, optics, transceivers, cables between the leaf and the spine. So it quickly adds up when when you say, hey.

I've got I'm I'm aiming to do a cluster of two thousand. You know right away, okay. I've got, four to six thousand optics, right, that I could be dealing with in that environment just for that cluster. And so when someone comes up to you and says, hey.

I'm struggling with this job. It's running slower than it ran the last time. Could you troubleshoot it? It could be a needle in a haystack.

I mean, it could be really hard to understand. Well, we have two thousand resources here, which of the two thousands are you using for this job? Then that tells me, you know, which interfaces to look at or which weeks or spines to begin troubleshooting. So that that visibility piece, you know, near real time based on streaming, and and how do we react to those problems, so that we minimize the job disruption.

Yeah. Yeah. Absolutely. And, you know, the the the speed is always, you know, top of mind for a lot of engineers talking about four hundred, eight hundred gig and then moving into, you know, one point six terabits per second is what I'm hearing coming down the pike not too long from now and reading some stuff from the Ultra Ethernet Consortium really neat but it really isn't just about speeds and feeds it really is you know reducing that job completion time and addressing those drop packets which can compound exponentially you know really really hosing a training run it really is about lossless connectivity it really is about making the most of every single link available as you said with load balancing and being efficient as possible and you mentioned link flapping you know having optics now this many optics and not only the optics, but the, you know, the associated cables and nicks and all that.

There's just so many points of failure. So we're seeing things of like low power optics and then options that, remove optics from the equation altogether.

But over over many years ago, I mean, we had InfiniBand, which was very high bandwidth, inherently lossless. Right? And inherently, you know, had no issues with latency.

And it was, you know, centralized. You had a subnet manager that you install the software in one of the switches or whatever, and it was great for high performance computing. So what's the problem with InfiniBand moving forward? Now I I kind of already know the answer. I just did a debate with somebody on this, and we ended up both agreeing that Ethernet's moving forward, or rather we're moving forward with Ethernet because of these reasons. But I'd like to hear from you since you are in the weeds with customers, doing this every day.

Yeah. I I think there's two. You know, there's one that's directly technical and maybe one not so much. But I I I again, you know, huge respect for InfiniBand in its place. And, you know, but there is a difference between the use of InfiniBand, when we talk about the difference between an HPC shop and, like, a true AI based, or AI factory is is we're wanting to call these things now.

But and and, you know, you saw the focus, the co packaged optic, the CPO made available right away on the Quantum line from NVIDIA as an example with with something similar coming to the Spectrum line, the Ethernet line, I think later next year.

Other vendors are doing, LPO or LRO or LDO. You'll see that represented a couple different ways. And we could talk in in a little bit about, you know, specifically what what those gains are. But, I think part of the problem is just the the the the number of engineers out there that are, understand InfiniBand at all.

Right? And the reason the the human resources that we have that that can design and build these and operate, You know, it's one thing to design and build it in the scheme of a network or the life of a network. That's about five minutes worth of work. Right?

I mean, the real real meat of this is in the operations to day two as as we say.

And and so the lack of engineering familiarity, the lack of tooling that's out there that's familiar with IB, and then there's the more technical aspects of it, which which are around scaling. And as we get bigger and bigger that, you know, how well, just leaving that question out there, how well does InfiniBand as it is today? How well does it scale to the idea of supporting a hundred thousand, two hundred thousand, three hundred thousand? And and there's no let up with this idea.

There there's a goal of a million endpoints in these clusters. Like, they they started talking about that last year, and they they haven't let that go. There there still seems to be this march towards getting to a million. So when I think what we're what we're leveraging a lot is, you know, if we could live vicariously through the CSPs, the core weaves, the the Googles, the the Metas, and others that are building, infrastructures as big and the lessons they're learning.

And for them to be able to say, hey. If I build a cluster of twenty six thousand, GPUs together with InfiniBand, and then I put that side by side with twenty six thousand of Ethernet, you know, what does that look like? Can I survive on Ethernet? Is it as good, or is it good enough?

And, you know, in production, right, truly out there moving production packets, we have some of those cloud service providers that have been very transparent that, yeah, Ethernet's just fine. Like, we we've we know its weak spots. We know where it's at today. We know how to tweak that and manage that.

And then we expect it to become more plug and play as folks like the UEC, right, the Ultra Ethernet Consortium, help us with transport protocols, putting, some some more smarts in the NICs that are involved in these environments.

So I think that'll close the gap of of some of the performance differences between Ethernet and IB. And then going back to the original piece there is the the the network engineer awareness and experience with with the technology. You know, the the network engineers are gonna inherit these, are gonna say, hey. I'd much rather have Ethernet.

I understand it. I know it. My tools already know it. My team's staffed up for it.

In a world of automation, we already understand that, right, as as we try to increase our network automation. So I think, ultimately, it's just, the the scale that catches up to it. It's whether that's the scale of one cluster or that's the scale of, you know, on the planet, how many of these clusters in total get deployed. Right.

I I actual I mean, I agree with the day two operations, aspect and then the, overall and long term total cost of ownership of operations, for for a data center network that's doing this. And if you, you know, InfiniBand or rather engineers that know InfiniBand that well are far fewer, like you said. But not only that, you know, who's running an InfiniBand front end network? You know, nobody.

It's almost always so you're now deploying gateways. You're now deploying you're now managing different networks and, and with the way that things are going. And I do know that it is very difficult to do these kinds of bake offs where you can say, let's do a real, real detailed scientific perf test performance test and and see what's better. And there aren't that many to look at.

But I do know, like, for example, Meta did that with llama, llama two, I believe, I think. Right. And, they did a bake off, and the performance differences between, Ethernet and InfiniBand at that time were negligible, statistically zero. Though that's not officially zero because as you scale up to very large, you know, those kind of very, very small differences actually can, be highly impactful.

But I'm gonna say, throwing a number out there, that that's like two percent of AI workloads. The vast majority, vast majority are, the performance of Ethernet is one hundred percent fine. Not only that, it's it really is synonymous and probably outperforms InfiniBand in some ways. I get that there's some complexity there that, you know, if you're doing just like a pure no no configuration bake off, you're gonna have some trouble because ethernet is inherently lossy.

I get that. But, you know, with things like, PFC, ECN, we don't have adaptive routing like we have in InfiniBand. There's so there's things there. And so there's like these QoS components we have to add.

But whatever we know that stuff that's kind of understood technology and it it solves that problem. And I feel like the cost of adding that little bit of complexity is well worth the benefit considering that we still have eight hundred gig ethernet and, you know, terabit ethernet coming just like in, you know, it's still it's still keeping pace with, what we need for bandwidth.

And, and it because it's a ubiquitous technology, all those reasons. And and here's one that we didn't mention, you aren't locked into one vendor.

I know you can't manage a single vendor.

I personally don't think that that's like the end of the world, though, because I, I don't know about you, Chris, but when I was designing and building networks, including data centers, nobody cared about being multi vendor. Like being like completely vendor agnostic wasn't a thing.

It was always in a blog post or a podcast. We want to be vendor agnostic. And then I talked to customers, big customers. I'm like, no, no.

We don't want that. We we have, you know, operations and tack and documentation. We want one vendor. What I found was that they, you know, a lot of my customers wanted a single vendor for, like, the data center and perhaps a different vendor for security and a different vendor for campus, different vendor for wireless.

They were agnostic in, like, these network blocks, not like switch by switch in their data center. I never, never saw any appetite for that ever.

Right.

When you're I mean, that was just my experience.

No. No. You're not far off. I mean, yeah. Wait. Some people might call those, you know, infrastructure silos or no. Yeah.

Area silos for those those different areas of the But with InfiniBand, you are locked into a single vendor, and if that's a problem for you, then, well, that's a problem for you.

Well, and I think some of that might be a hangover from, you know, supply chain issues recently, right, in in the not too distance past. You know, hey. I'd I'd like to be able to get equipment from more than one, organization. And then, you know, Gartner and others, I think there were philosophies at one point about, you know, leveraging one vendor against another, right, to drive down your cost, ideally, as you negotiate with your account team.

But, yeah, picking on you do you don't wanna pick on InfiniBand for being, you know, proprietary just for just from one vendor. But I I think just in general, betting against Ethernet has been proven to to not be a good bet.

If you look at you know, we have precedent of what Ethernet has been able to absorb. So if you think about, hey, IP telephony. Right? We added we we we're we're talking to each other. We're doing video right now. It's IP packets. It's predominantly over Ethernet.

Videos, you know, if you watch the summer Olympics last year, the majority of that broadcast stuff, broadcast has moved or it it has been moving to video or excuse me, to Ethernet for some time now.

This is just another one that you can roll in. Right? And each time, I think Ethernet's had to adapt. There's been little changes. Hey. You know, adjust the transport protocol. A little more work with the QoS or change our QoS algorithms perhaps to deal with those, give them strict priority for the voice traffic.

So it so it's adjusted. Right? It hasn't necessarily been ready to absorb it right away. But but with the with all those laid out as a precedent of, hey. We've seen this before. Something comes along, Ethernet just kinda expands itself and its capabilities, and it's able to swallow that next technology up.

Yeah. And and and I really do believe that the little bit of complexity that's added in configuration in in, you know, making ethernet work the way we want it to work for this particular, use case more than outweighs the benefits.

I really believe that. And that's why I, one of the reasons, not not the, but one of the reasons that I think InfiniBand is, I mean, you know, Ethernet is the is the way moving forward. Yeah. It's not my opinion. That's kind of the the consensus of the industry.

Yeah. It is. And I mean, well, and I really appreciate the work by Del Oro and I and I recall, you know, they put out a report last fall where they predicted a crossover in the number of ports shipped. Right?

Where Ethernet would it would exceed IB. And and I think, originally, that was expected to be, like, in, maybe twenty twenty eight. And then just four months later in February this year, there's a new report that comes out, and they're like, we we need to pull that in. This is accelerating.

And and so they pulled it in by, like, a year and the prediction of when that crossover occurs. So, I I think it's gonna continue to be pulled in. There there's no end in sight here. We're again, red queen effect.

We're we're running as fast as we can. The things are coming out very quickly.

So I think, I I I think that that crossover will happen sooner, than than, you know, maybe even the the industry or some of the industry experts to to to pay attention to those things, will have predicted. Yep.

You know, but all the talk about the the the QOS and PFC and ECN, and I I think that's where you'll see a lot of vendors coming out with enhancements. And this to me, this this comes back to why I want every network engineer to be excited about this. I mean, I'm jazzed because I'm lucky enough to be around or adjacent to a lot of these projects.

But with these kind of pressure cooker situations where you're solving problems never seen before, you're building networks of sizes we've never built, never, Never ever built before. Nobody has this all figured out. We are, you know, okay. We know what it takes to get it to work at thirty thousand. Now we know what it takes to get it to work at fifty thousand. Hey. Here's some adjustments to get a hundred thousand GPUs working together.

Out of that's gonna come enhancements that every average network engineer is gonna benefit from. So you mentioned having to tweak QoS, PFC, or ECN, or so many other things. You already see from multiple vendors, fabric solutions that have come out now. Right?

Like, my import in the case of Arista. Right? We've done that. Juniper's done that, where there's more of a plug and play approach to it where, hey.

Okay. Now that we know that these are the things you need, let's put some defaults in. Let's make some of this easier. Let's see how much of the network can configure itself as you get started.

Right? So we're trying to work on minimizing, those touch points that you need to to hit when you build it. And I think even more important, again, back to day two lasting forever, we don't want you to have to tweak over and over again down the line. Right?

As as these jobs change, as the workloads change, as we see differences in communication libraries, we're already hearing about those. You know, NVIDIA talks now, not only about Nickel, but about Nixel, right, NIXL, where in the inference world, they're probably gonna be using different communication libraries and what they've used in the training world. So, you'll see vendors helping with that. As these lessons are learned, features will be developed.

And ideally, hopefully, your networking vendor just makes that available to you. Right? Hopefully, it's not some special one off piece of hardware or special one off piece of software that's just aimed at, you know, towards the cloud service providers or the Neo clouds, but but benefits that that hit everybody. So, and I think that'll happen in your space too.

Right? And then monitoring and invisibility stuff. It's like, hey. As we get better in understanding that we need to watch for queue drops, that we need to pay attention to how long we're queuing a packet before it actually makes it to that egress port.

That added visibility that then ultimately ought to end up being, some automation too, some some automation, some AI ops type stuff in the network that we could take advantage of.

It really puts, you know, it just shines a spotlight on the visibility and operation side of things.

Sure. Sure. I actually think that's one of the one of the, easiest easiest and that's not the right word, but I think that's a great use case for AI in network operations, at least a starting point. Obviously, there's management. There's, you know, there's a lot of things that you can do.

I also think it's ironic that we're talking about, like, we're talking about trading AI models and the network is so important, yet it, like, doesn't even take advantage of the model that it just That's true.

Not not enough yet. No. Yeah. Right. Shame on us.

We're we're so busy trying to build these things and focus on the solutions that we But but applying AI to, network operations is tough because, you know, we deal with, like we've discussed, near real time telemetry, such a variety of telemetry, invisibility and observability.

But I do think that that's one of the, a great use case to start with, simply because, you know, it is a practice in data engineering and data science and data analytics. And so, you know, ingesting data and then using, AI models or even just statistical analysis, algorithms and ML models applying that to get some insight, understand things at a scale that we really can't do manually.

I think that makes a lot of sense as a first step. And so you don't need to get scared of like, oh, I'm gonna hand the the actual management of my network over to some, you know, team of AI agents and then lose all control and all these kind of things, which is cool if that happens down the road. Fine. But right now, today, I think, starting with things like just getting information and then being able to analyze it properly, at great, great scale and a greater depth and insight than we have before, I think is a great use case for AI in NetOps.

But what do you, what do you think then is the near future of data center networking for these kind of workloads? What do you think that looks like? We could talk about the, Ultra Ethernet Consortium, of course.

And you talked about how every six months is a pretty much a revolution of technology. So let's stay in the near term. What are you seeing happening with you know, conversations with customers and in your own research and reading?

Yeah. So there you're gonna hear if you haven't already, you're gonna hear more about scale up versus scale out.

That was emphasized at GTC last week, and and that's been brought up, I'd say, I feel like since about q three, q four last year, people have been talking about that a little bit more. And in order to to build these systems to respond in in the way that we want them to, a lot of times, that's gonna have to be a focus on scale up. So in the case of scale up, you're gonna hear a lot about, proprietary solutions and then industry, you know, vendor agnostic solutions. You're gonna hear about UALink and UXL and CXL and a lot of things that are happening inside of, you know, what's moving from a system to moving to a rack. So instead of just having, hey. Here's a server with eight GPUs in it that need eight four hundred gig connections. Or later this year, we're anticipating the eight hundred gig NICs to be available.

And then do you mention one point six terabit? I I think we'll see that. We probably mentioned or or maybe slightly available at the end of this year, but probably really more of the focus of next year for for wide availability of of one point six terabit interfaces.

So, yeah, the the UEC is gonna help, on the scale outside. So scale up, what's happening in these racks, or multiple racks that are clustered together and then scale out. So, you know, how I get across the entire data center and then subsequently for the ones that are just doing the crazy builds, how do I get multiple data centers on the same campus, to be working?

I think other areas to keep an eye on, are what are happening with the interconnects inside of these systems. Because as a network engineer, you know, even in when we're not directly responsible for something, we sure do spend a lot of time helping people troubleshoot and pinpoint where a problem is. Right? I mean, if it if something written on a wire is incredibly important to us, we have that visibility, you know, Wireshark is your friend. And I think understanding that we need to apply that even to the interconnects between, you know, the the GPU to GPU communication, the GPU to memory, the GPU to CPU, and and those little other networks that are running inside of these systems.

Like in the case of NVIDIA, you've got your NVLink, you should be aware of. In the case of AMD, the Xfinity, you have your PCIe. You know, they're working to increase speeds for PCIe for us. Right?

And then you get outside the host. Right? And then then you get from leaf to spine or or from NIC to leaf. Right.

Yeah. And so I think, you know, focusing on those. Another area that is right around the corner, we saw the CPO announcement, the co packaged optics announcement from Quantum, or from NVIDIA for their Quantum line, their IB line. You're gonna see other vendors coming out with co packaged optics.

You're also gonna see an alternative to that, where you'll see it as either LDO or LPO, linear drive optics, where, it's a similar concept, but, believed to be more serviceable.

Ultimately, what's happening there is the the brains of the optics are being moved inside, to this to the Philip, to the to the switch ASIC itself. And that gains us a couple things. We've been talking about power. Right? Power is killing us in these environments. A lot of data centers built yesterday do not have the power, to say nothing of getting ready for liquid cooling. And so if you move the the brains of of that transceiver into the silicon, then, I don't need as much power for that component.

And then, also, without the brains being there, I hope on that that latency discussion. Like, you could go from maybe a ballpark of a a hundred nanoseconds per transceiver in a in a in a in a path down to about one nanosecond, relative for for that transceiver. So we have a huge gain in in both speed and power, which is what everybody's, you know, bigger, better, faster, more. That's what everybody's after.

Right. So I think keeping an eye out is is is maybe you don't have a project right now, but doing the homework on, CPO and LPO and just what's available from your preferred networking vendors and what that means to you as as you build these. You you've gotta keep going back to this is not your average data center. I'm not plugging this into my existing data center connectivity that's twenty five gig or a hundred gig.

Right? We've already talked about eight GPUs in one of these nodes means I've got likely either four or eight four hundred gig connections I'm being asked for. Then you add the CPU at twenty five to a hundred gig, or then you add storage at two hundred to four hundred gig. Quickly, you realize, holy cow, we're like twelve connections or more for every one of these servers, that show up in their incredibly high speeds.

Right? And then you get into you know, if you haven't studied or learned about the front end versus the back end, there's there's some things here you could from the industry, you could really be doing due diligence on right now to get an understanding so that you're you don't feel overwhelmed, when these projects actually do land on your desk. Right? The network engineer is is only one step better than security, I think, in a lot of projects in that, you know, we're next to last to hear about these projects.

We're not often engaged early. And so, to minimize that shock and and the homework you'll have to do, I think paying attention to general industry updates like CPO and LPO and, the four hundred gig to eight hundred gig, what does mean? You know, doing that homework now in in preparation for their arrival will help. And and just one more thing in that area of doing the homework, I would mention, I feel like this is similar to to the automation story that we see playing out, Phil.

I I don't know how it is with your customers, but with mine, let's say I have a network team of seven folks. Right?

Not all seven become automation gurus, at least not that I've seen. Typically, what happens, there's one or two people on that team that gravitate towards the automation. It it just makes sense to them. They fall in love with it.

They start playing around with it. And then they're doing the automation work for the entire team. Right? And then you find the other five folks bringing ideas to them, and then the the two automation specialists roll it out.

I think this is similar. This is so different, than the average network. There's just so much to be aware of, the compute, the storage, the memory, the jobs, that I think on these teams, you're probably gonna find one or two people that gravitate towards it. Let them do this homework now.

Right? Because you'll be better served doing that due diligence now than in in a scramble at the last minute, just hoping your vendor can design something, you know, appropriate for you. So so in that case, I I I feel a little comfort that that automations are kind of shown us how that works. Right?

That we kinda understand, hey, when there's something entirely new that we're trying to merge in here, that maybe the entire team doesn't have to take that on. We can just designate a couple specialists and let them run with it.

Yeah. Yeah. And you bring up a couple points. I mean, there is one thing you mentioned is that there there is a difference between the networking happening in between two GPUs and then between switches and the bit then between entire pods of GPUs and another pod or cluster of GPUs. Right? And then between a front end and a back end between the storage network, and and and the, you know, the actual computational activity that's happening. So those are all different, like, network blocks there within the data center now that we have to contend with and, and address.

And you mentioned the the changing nature of optics and protocols.

You know, we we didn't discuss, like, RDMA, but we that was like a solved problem. We had InfiniBand RDMA, memory to memory. All all our problems are solved. And we realized, well, not not exactly.

So we've put that on Internet and we got RoCE, RDMA over converged Ethernet. Then we had a version two because we wanted routing and some other things. And so every time we solve a problem, we're like, alright, we're good. And then, you know, a few months later, we're like, we're not good.

We need more. And so it's amazing to me how, even in the networking space, it's true. It's being driven by the applications and by the artificial intelligence itself, the result and the use cases for it. It's not networking for networking sake.

We are supporting that, but it's right in line. The the development in in our niche of tech, has been keeping pace right alongside with the with the latest model and DeepSeek r one and then ten minutes later, DeepSeek, you know, r three And, and it needs this now. What what about, like, new standards? You know, I I know the the UEC has been talking about the Ultra Ethernet transport protocol.

Are you familiar with that at all?

Yeah. A little bit. Yeah. I think we're all anxious to to see that, roll out get get stamped and get implemented.

I I think a lot of the enhancements will probably end up, in the mix. Right? It's not just the the network switches and the network vendors that have to adjust. I think they're pretty ready for it.

A little bit of software change, I think, is all they're gonna need. So I think I think networking vendors are gonna be able to to absorb those recommendations and newer transport formats and some updated protocols for us. It'll be more probably, I think the heavier lift is is aimed towards the NICs, themselves. But, yeah, again, that's you know, these are enhancements.

These are these are things that are gonna make Ethernet better across the board. It's not like Ethernet's just, you know, gonna be solely aware of, you know, that this is RDMA traffic versus, you know, whatever the web or SIFS traffic or, you know, whatever you normally see in your environment.

But it's there's just this, you know, crucible of of demand, on Ethernet that it's it's not just the UEC, right, that'll that'll be a part of enhancing Ethernet for us. Right? The the the individual networking vendors are gonna come up with, you know, hopefully, interoperable, not proprietary solutions, there as well. And and, hey, how great to see NVIDIA join the UEC and NVIDIA join the Ethernet Alliance.

I mean, that's another group. Like, that's a group I had never followed before. I attended a conference last fall, the Ethernet Alliance Forum. It was great to hear the SerDes engineers talk about how they're going from two hundred and twenty four gig to four hundred and forty eight gig, right, SerDes speeds lanes to get us our eight hundred gig and one point six, ultimately, three three point two terabit interfaces.

So it it's gonna come from a a bunch of angles, not just the the UEC.

Absolutely. And integrating with things like power and cooling and the the build of the the racks and everything like we said. So it's amazing. It's an amazing time of these things coming together.

And, you know, so I'm looking forward to, the speeds more than anything else because as a traditional, turner of a wrench, virtual and physical and configuring routers and data centers in the cold and hot aisles, I love seeing those speeds. Yes. I get it. We wanna eliminate out of order packets and drop packets and latency and all these things.

I get it. But it is so cool to hear speeds like one point six, six terabytes in my data center on this. You know, that's, that's pretty amazing. I remember, you know, I remember in data centers jumping from one to ten gigs and then ten to twenty five and the forty and one hundred and things and like, just being enamored with this and and how, they're never gonna use this thinking to myself as I'm designing this project.

And and now, the network really, really is the greatest potential bottleneck for as a bottleneck than than ever before. And I know, you know, there there are times when compute was the bottleneck or storage or things like that, but, it really does come down to networking because of the way these workloads, happen today. So really interesting stuff.

You're you're spot on. Yeah. Even even with the the NBL five seventy six. Right? The the the the announcement, it's a couple years out for those racks and and the focus on scale up, you know, but but that's still if I could say this, that's still only five hundred and seventy six.

You know, you're gonna end up with a lot of those cabinets still trying to talk to each other. So you're right. Scale out is still gonna matter. Mhmm. And, you know, the excitement of being able to solve those problems and being involved around those projects and seeing that technology unfold, and again, even if you're not directly involved with them, the benefits that'll that'll hit everybody, whether that's in, the performance of your your Ethernet switches, the configurations, the features that are available, but also the visibility. Right? And the the other benefits that I'll get that'll that'll be applicable all over my infrastructure.

I think I I I we're blessed. I I I appreciate that this has happened. I mean, I I really haven't seen this kind of energy and excitement around networking in in a real what feels like a couple decades.

So I I think, you know, it's it's a boon to the Ethernet side of things as well.

Yeah. Yeah. And to and to a lot of industries and sub industries, you know, obviously to switching manufacturers and all the adjacent manufacturers in the Ethernet world. But also, like you said, you mentioned visibility and observability several times. I mean, this is a new challenge and also critical to managing the operations of a data center today, these kinds of data centers.

So, and as we know, the visibility and observability space is all about data. It's all about data pipelines and analytics and ingesting and figuring that stuff out without adversely affecting, without any performance penalty on the data center itself. And figuring that out is, is critical in making sure we've reduced things like the job completion time, and we address those link flaps. And we understand where the elephant flows are and all that kind of stuff.

And we can and we can, you know, continue to empower the latest and greatest model, the latest and greatest model training to happen. So, Chris, this has been a really interesting episode. I love this stuff, and I agree with you. It is really cool to be able to talk about cool new things happening in networking specifically, specifically with moving packets, old school networking, but like, you know, in the new age.

Really fun. So thanks so much for joining me today.

Yeah. Thanks, Phil. Appreciate it. And, look forward to the next chat.

Absolutely.

So if you have an idea for an episode or you'd like to be a guest on Telemetry Now, I would love to hear from you. You can reach out to us at telemetrynow@kentik.com. So for now, thanks so much for listening today. Bye bye.

About Telemetry Now

Do you dread forgetting to use the “add” command on a trunk port? Do you grit your teeth when the coffee maker isn't working, and everyone says, “It’s the network’s fault?” Do you like to blame DNS for everything because you know deep down, in the bottom of your heart, it probably is DNS? Well, you're in the right place! Telemetry Now is the podcast for you! Tune in and let the packets wash over you as host Phil Gervasi and his expert guests talk networking, network engineering and related careers, emerging technologies, and more.

All Episodes

Kentik is network intelligence for modern infrastructure teams.

844-356-3278

Platform

Solutions

Technology

New and Notable

Learn

Company

We use cookies to deliver our services.

By using our website, you agree to the use of cookies as described in our Privacy Policy.