Telemetry Now | Season 1 - Episode 25 | October 10, 2023

How data center networking is changing in the age of AI

Play now

The advanced computing hardware that handles artificial intelligence workloads has special requirements more traditional data center networking can't meet. In this episode, Justin Ryburn, Field CTO at Kentik and a veteran network engineer in both the enterprise and data center spaces, joins us to discuss changes in data center networking to accommodate the latest AI workloads.

Transcript

Buzzword alert, artificial intelligence.

It's probably the buzzword of the year, and that's both in tech and in popular media. Especially if you're thinking about things like large language models or LLMs and, platforms like chat GPT.

The thing is that AI, it's really just not a buzzword anymore. We we really are getting into a new world of very advanced computing and data analysis with more powerful computers running more complex workloads and huge datasets.

The thing is that these powerful computers, they're not just standalone mainframe boxes, these giant monolith computers that sit in data centers or in university labs.

What we're seeing today is that these are actually a collection of distributed computers working together to distribute an AI workload across many GPUs, and that's usually connected over a pretty traditional network in some ways.

So if you think about it, the network that connects all of these GPUs, it's absolutely critical to how we perform AI task. How those AI jobs get done today. And what we're finding is that traditional data center networking isn't cutting it anymore.

So what we're gonna talk about today is how networking has adapted or is adapting to accommodate these new types of distributed AI workloads.

Now, I'm being joined by Justin Ryburn, the field CTO at Kentic, and he's also a veteran network engineer in both the service provider and enterprise spaces. And I'm Philip Gervasse.

This is telemetry now.

Hey, Justin. It is good to have you on again. How are you?

I'm doing well, Phil. How are you?

I'm also doing well, very well, actually. I'm I'm recording on the road today. I'm at my mom's house, actually, because my brother is getting married tomorrow. So we're all here packed in the house and a lot of commotion, a lot of fun. And looking at your calendar, I see that, you have a lot of PTO lined up the next few days. Actually, more than a few days. Are you doing anything exciting?

I am. Well, first of all, congratulations to your brother.

I am going to be doing a vacation on the Northeast. So, my wife and I are taking my parents on a little vacation for their anniversary. We're flying into Boston taking a cruise ship that goes up the the northeast and does some stops in Maine in Canada. So this time of year, as we're recording this in late September, it should be absolutely beautiful in that part of the country. So we're really excited looking forward to that.

Yeah. Upstate New York and then into New England, when really is special this time of the year. It's, the the the colors of the lease changing. The weather is it tends to be drier and it's in the, you know, the sixties and sunny. So it's it's very comfortable.

It is interesting to me how it seems like somebody flips a switch on September fifteenth, to take us from from summer to fall, that quickly and dramatically. It's it's neat how that happens every year. But we got pumpkin spice, everything. We got apple scented candles everywhere.

It really is nice. Like a like a Norman rockwell painting that's all kinda unfolding before your eyes every morning. So, we we really like this time of the year. And then, of course, when we hit the end of November, it goes into the the brutal winter of upstate New York and and New England for the for the few months, till about April.

Actually, one thing that you should consider checking out if you're into that kind of thing is, since you're heading into the that part of the country in late September, early October, a lot of the micro breweries around upstate, New York, and New England, when they're having their October fest celebrations this time of year.

Yep. We have some of those on our list, actually.

So Well, I hope you guys really enjoy it.

You and your whole family, it's gonna be great. So the last time that you and I spoke, though, was maybe about a year ago, as far as on this I mean, you and I speak all the time, but, the the last time you were on telemetry now was when you shared the opinion that, that network engineers out there in the world are under utilizing flow data, Netflow S, flow J flow, all that stuff. And and how there's just so much more we can do with it. But today, I don't wanna talk about flow necessarily.

We can if we need to, but what I wanna get your opinion on is is what's going on with data center networking, how it's changing, or how it needs to change, to accommodate new AI workloads. And I know that's not in all data centers, but it's in those data centers I are are kind of purpose built to run these types of workloads, but there's new networking requirements. There's very specific requirements for how traffic moves and how these artificial intelligence workloads function So Justin, let's start with that. What is different about networking for AI workloads?

What what is special about it?

You know, I think that the the workflows themselves are very different than what we're, used to when it came to how we built architected scaled data centers for traditional web applications. I mean, most of most of the applications that are served up by, you know, your normal spine leaf for three tier switching architectures in a data center were really designed for web applications of one form or fashion. And AI workloads are are much different read a little bit about how they're solving these large data problems, in in these data centers for AI, it's really a huge distributed computer. Like, the entire data center becomes like one big computer.

In you can no longer put enough GPUs enough CPUs into a single piece of sheet metal in a rack to be able to you know, for Moore's law to keep up and be able to process enough of these data sets in one machine what you wind up doing is you distribute those GPUs all across the data center, and then you interconnect them across the network and, it it becomes, like I said earlier, just the entire data center becomes one huge computer for for munching all these data sets.

Yeah. Yeah.

And like sun sun microsystems said back in the early mid eighties, the network is the computer. And I think, you probably agree based on what you just said that that's that's never been more true than it is today. I mean, if you think about it, yeah, we're not standing up one giant monolith mainframe. Right? That's doing all our computational analysis and database pulls and all that stuff is it's we can't physically do that. The the the nature of where we are with, wafers and chip technology down to, like, five nanometer.

The physics won't allow us to get much smaller and add more, resources on those wafers without extraordinary increases in cost that don't result in that much of an increase in, the ability to do these calculations and to process these workloads. So the answer, like you said, is to distribute them among many, many, many nodes. And I think it's, some data centers that are doing this this kind of activity. They're up to thousands or tens of thousands, like thirty, thirty two thousand GPUs in a particular, AI interconnect and that and that's a term I wanna throw out there.

We're talking about networking for artificial intelligence workloads. So we have all these GPUs connecting we're gonna call that an AI interconnect, and it's the the networking that connects all these GPUs that are doing their work together not necessarily the network that's connecting, that, that entire kind of group of GPUs to the rest of your traditional network. Especially your web servers that you mentioned earlier. So the AI interconnect is that thing that we're talking about today.

Now is this just IP networking or are we talking about kind of fancy, you know, proprietary vendor specific thing?

I it's, you know, there's some competing standards. I mean, it's IP, but the under the layer too in the OSI model.

You know, there's still some competing standards between Infiniband and Ethernet that, you know, Some people land on one side of that. Some people land on the other side of that. I think, you know, if I were gonna make a bet on this, I'm gonna bet on ethernet just because it's very ubiquitous.

You know, we've used it in a lot of other applications. I mean, even your modern cars now are using ethernet for connectivity between the the chip that drives the vehicle and all the various components in it. It's just become so easy to cable out and the protocol so well understood and so many of your staff already understand it. You know, if I were if I were a betting man, I'm gonna I'm gonna bet on the side of ethernet on that, but, you know, there are arguments, I think, to be made for for Infiniban because, again, we're talking about a a high bandwidth low latency type of network for these GPUs.

Which Infiniband has worked very well in in storage area networks in a very similar type of, environment where you have similar engineering requirements for what you're trying to accomplish.

Yeah. It sounds like the problem then with Infiniband and similar technologies, right, is that one, their proprietary. Right? So they're not like you said, ubiquitous among the entire industry.

And so Ethernet is plug and play. It's everywhere. It's It's very straightforward and simple. Everybody's familiar with it.

But also from a, where are people putting in their research time and money So moving forward, I think it makes sense that since the industry has rallied around Ethernet, that that it makes sense for that reason, where we're gonna mean five years, well, we're not putting a tremendous amount effort into Infiniban. And also I think for scaling purposes, you know, we have the technology and Ethernet, especially as we get into very high bandwidth. And, yeah, I get it. Costs with optics and things that we can run entire data centers off of Ethernet and then run that interconnect between that AI workload and then our traditional data center with Ethernet and so on and so forth.

So I would agree with you there.

But but we're talking about We're still just talking about the fact that these GPUs are talking over a network. Okay. Who cares? Like, what's the actual problem here? Why don't we run it over a traditional data center I mean, they can't can't the GPUs talk to each other that way?

Well, it's interesting because the coordination between these GPUs is super high. Right?

In a traditional data center, if you had multiple web applications, they're kind of operating independently. Like, your traffic patterns are very north south in nature. What I mean by that is you may have a user out on the internet that comes and accesses your web application and then your web application responds to them with the page. And so you have a very north south traffic heavy flow.

Sure. You have some east west where you may have to pull information from a database or something along those lines, but the majority of your traffic is north south. In the in the data center environments, AI interconnect environments that we're talking about, there's a heavy coordination between the various GPUs because they they rely on one another to process this data. So as just an example, a tangible example here, you may have one GPU that's processing one dataset.

The results of that are an input into the next run of, you know, like a language model that needs to be processed. So until one GPU completes and passes its data along to another, kinda hung up waiting for the next GPU to to complete. And then entire job as one job has to be coordinated so you need high amount of bandwidth between these devices, you need low latency.

Ideally, no, loss or very low loss. You need a non blocking architecture. There's a lot of requirements on the interconnect network that you don't really have in a traditional data center, fabric that's serving up web patience.

Yeah. Yeah. You're talking about the job completion time, which is ultimately how long it takes for these GPUs in a synchronous fashion not working autonomously and in an asynchronous fashion, working together in a partial mesh or a full mesh to complete this AI job And, they are going to be these jobs. The job the the job completion time is going to be determined by the slowest link. Which could be a particular GPU that is stalled because of some congestion on the network that that's that particular one single GPU among thirty thousand is experiencing.

And then when you have one GPU that sits idle for a moment, stalls all the GPUs because everything is now waiting and sitting in that idle idle state for those maybe millisecond or two milliseconds, but keep in mind that you know, one point five milliseconds, two milliseconds, which may seem like nothing like a throwaway number when compounded over time over many seconds minutes hours, months, even. That's how long some of these large AI workloads take to complete literally weeks and months. You're talking about huge delays in the completion of of the job, the job completion time. So, you know, that means that means, yeah, we we end up with these this movement of individual flows that are very, very large, because we're transferring entire datasets from maybe one pot of GPUs to another pot of GPUs.

And, yeah, I I think you're right. That's in that's in contrast to the way a lot of traditional data center networking is where it's a lot of north, south, you have a lot of hits on web servers and other services. And, yeah, that web server might or cluster of web servers might be then in turn making calls on back end databases And there's some east west, but it's not it's not nearly at the scale as we're seeing with AI workloads.

So yeah, we're we're looking at, I guess, what you could call elephant flows. That's what we call them back in the day. Right? When you have these large individual flows, as opposed to many, many, many individual lightweight flows, which is typical for web traffic. Right? Most web traffic is on the lighter side.

And asynchronous, also.

So it sounds like the one of the goals of of this new type of networking in the data center for AI workloads is to find ways to reduce that job completion time because of the danger of the network itself to be the bottleneck.

And so so how do we do that? You know, you talked about having higher bandwidth, and I know that some of these high end nodes can can connect into the network at two hundred gig. That's a pretty serious bandwidth. And I know there are standards coming down the pike for double and and quadruple that.

You mentioned something around being non blocking. Maybe you can define that for us. And and I think you talked about subscription, but we should definitely talk about that as well. So, yeah, how how do we solve this problem when in what specific ways is this different than my traditional data center.

Yeah. I mean, there's some really interesting articles out there on the the the concept of job completion time And like you said, you know, it may not, like, a small millisecond delay. It may not seem like much, but once you multiply it over a long run time on a job, as well as thirty two thousand GPU's, like you mentioned earlier, that becomes exponential and it's really fascinating to read about how they they come they calculate those job completion times. But, you know, to your to your point, I mean, show my age here a little bit, but I can remember when, like, one gig on an interface was a lot of bandwidth.

And now we're talking about, you know, two hundreds kinda the standard bare minimum in these AI interconnects and standards are being ratified for four hundred and eight hundred gig. Well, I think we'll talk a little later about an Ultra ethernet consortium that's, coming up and and working on trying to drive these speeds and standards even faster because they just need so much bandwidth on any given link. But the only way you can really service the type of elephant flows and the type of bandwidth demands that we're talking about here is you have to have multiple links. Right?

And so you have to you have to take that traffic and, spray it as a term that they use across all the links that are available to you. And so, know, I can remember early on in my career that was how we did traffic across multiple links, because we actually did spraying. The downside of that was then you had to have some way to reorder those packets on the other end because if you put a packet on, a packet that's part of a of a larger flow on one link and then another part of another link and they come the opposite end of that path in a different order due to whatever delays in the network, then you have to buffer and and reorder those packets.

And so For a lot of your TCP type connections and a lot of your web applications, people don't wanna do that because that just causes more, headache than it's worth to have to reorder and and re package those, re reassemble those packets on the far end.

With the AI workloads, that's actually the state of how they're doing things. They're actually doing this concept of packet spraying, and and they schedule jobs. We talk a little bit more about that. That they can handle that reordering.

We were kind of back to an earlier technology that we had in the early days of, like, some of the ATM and stuff where we're intentionally just spraying and load balancing per packet load balancing across all of the links that are available to us. I think the one other thing you mentioned there, Phil, that I'll I'll talk about is is, over subscription. You know, we used to least I used to when I was involved in data center designs. A lot of times, we would just intentionally design over subscription between, like, the spine and the leaf layer.

Right? So, you know, we may have a two by ten gig link down to a from a top of rack switch down to a to a server. And then we'd have maybe like half as much, data going up to the spine layer or maybe three to one, four to one five to one over subscription, and that was considered to be okay in the design. It's a trade off.

Like anything in engineering, it's a trade off, lowers your cost, but the presumption was that not all of your web applications would be filling up your links at any given time so that oversubscription was a good engineering trade off to get to a reasonable amount of cost and still provide good experience to the applications.

When you start talking about these huge elephant flows that these AI workloads are creating an AI interconnect, you you can't really do that. You have to have one to one. You can't have that over subscription in your design that that we're used to.

Right. Right. And the reason we don't want that over subscription. We want that one to one is because we wanna make use of all available pads and links.

You remember that we can't even tolerate, like, a one second delay on a on a particular GPU's traffic send, data transfer. Right? And so we want, these packet spraying, right, where that's how we're gonna load balance. That means that every single pet, we can't drop a packet.

We can't drop one of those packets among the many links that we're choosing. And so, we need everything available, and we need high bandwidth and ultra low latency with no packet packet less, no blocking, none of that stuff.

And that and that's different than the way we used to do it with, like, a, like, a lag, right, or ECMP, which is flow based you have some kind of a hashing algorithm and you're you're pinning an entire flow to a link or to a path. And then from the very first packet of that flow, the very last packet of that flow, you're taking that path. It's deterministic. Whereas packet spraying, or multipathing, I've also called it, rather. I've also seen it called that.

Makes the most use of all your physical links in your in your fabric. So it gets us away from flow based path decision making.

And and and like you said, it's per packet, but but, you know, to do per packet load balancing, when it's not just a hash. Right? That has to mean that there is a serious there is some serious intelligence going on on making those decisions a in a runtime environment.

So how how does a control plane work for a packet spraying environment that we've been talking about?

Yeah. So that's another area that's very interesting as far as innovation in the industry.

You know, there's a lot of competing standards. It'll be interesting to see where we land if one standard wins out over another, but a lot of your, hyperscalers that are building these AI inter x. They each sort of have their own approach on how they're doing it, but generically, they have some sort of scheduler, right, that's figuring out the path through the network that, a particular AI workload is going to take and figuring out what bandwidth is available and then spraying the traffic across those links to, to maximize the utilization of every single link that that is available to it. Now

that's typically a centralized controller that's making those pathing decisions, but then pushing that, that intelligence, that knowledge down to the individual switches because the end of the day, the individual switch has to know what do I do with a packet that arrives on a particular inbound interface? Like, what interface do I shove it out of on the outbound side. Right? So the forwarding plane, the forwarding decisions still have to be made by the individual switches in your your AI interconnect fabric, but, the scheduling of that and figuring out how to program those asics, is typically done by a centralized controller, which is you know, I don't know.

Maybe maybe the term's SDN. This sounds like a term we we've had before here, Phil. Yeah. It's an industry.

Right?

Sure. Absolutely. And and, we're always in these days looking at how SDN, which was kind of, architecture at five, seven, eight years ago, ten years ago, is now start starting to manifest itself in these various ways SD WAN, programmatic infrastructures, and now with, an entire, you know, fabric that is purpose built for AI workloads.

But but I I do understand what you're saying. We're gonna have a controller of some sort with policy and whether they be dynamic thresholds or hard thresholds for the quality of each link, but that decision making process in, again, in real time, run time environment where we're looking at individual packets. That has to happen on the Nick or on the switch.

Because we can't number one, we can't be waiting for the control plane to send its decision, you know, that that bidirectional conversation, that's out. And we also don't wanna dump control plane traffic onto the network, thereby possibly causing additional issues with congest and latency and and things like that. Now you could probably have an out of band network to do that for you. But, but, yeah, what we're seeing with with, chip manufacturers. I'm thinking like Broadcom, Nvidia, things like that. We're seeing that this decision making process is happening locally on the box.

In order to eliminate, any of those problems and and to, move traffic as fast as possible and make those decisions as as as efficiently as possible. And the decisions are both what link is is best to use right now based on whatever metrics So it's not just like a path, but we are looking at now gauging the quality of a link, and that's really that's really interesting.

But also, we're looking at, what what are we gonna do with the next packet? So there is sort of a predictive component here where this is how this link has been behaving in the past two seconds. I I don't know how how these things work exactly, but in these short amounts of time. And then we're gonna we're gonna queue up the next packet or the next series of packets depending on what you're doing to use this same link or this other link. So there is both a a current decision on where we're gonna forward this it in the immediate, and then this, this queuing activity that's going on as well.

And I and I believe what we're doing now, we're starting to see things like our DMA, for example, which is, it's it's a method to offload this traffic directly to the the Nick. Right? And so you're doing a direct memory to to memory communication as opposed to what, I guess, traditional TCP IP, right, which uses kernel to compute that and then process and send the data and all that kind of stuff. So, yeah.

Yeah. So so what what are the things that we're looking for then? I mean, obviously, we're talking about, you know, the quality of the path and stuff like that. We're talking about how how much bandwidth we have and and and you mentioned scheduler, but we're connecting all this stuff with, like, good old fashioned cables.

Aren't we? I mean, we're talking about a physical data center. So How how is that different? Or are we just using traditional copper and fiber?

Well, I I mean, it can be done that way, but that that brings another interesting engineering challenge, which is, power density. Right? We're talking about thirty two thousand GPUs in a given data center combined with all of the AI fabric, AI interconnect fabric equipment we've been talking about. That's a lot of switches.

It's a lot of ports on those switches, and though that draws a lot of power. Right? So then you come up with a power and heat density type of a problem. That's another engineering challenge you have to deal with.

And so what your what we're seeing is a lot of companies are trying to figure out ways that they can reduce points of failure, reduce power draw, get these high bandwidth. And so there are a couple of interesting, trends that we're seeing or innovations that we're seeing in this area One is going back to using DAC cables. I mean, they've been around for a while, and I think they were, at least in the last data center eye design. They're really popular between the Nick in the top rack switch to use a Dac cable, but typically from the top rack or the the leaf switch up to the spine, you went ahead and did fiber optic because you likely already had it run between your various racks and especially if you're having to go from in the data center from one cage to another, from one room to another, most of your cabling between that was already fiber optic.

So you were you were using fiber optic, but most of these ai interconnects are completely dac because now you can reduce some power draw by not having all of those lasers and and a point of failure. One of the number one failure scenarios in an optic is the light itself, the laser itself. And so by doing DAC cables, you don't have active electronics there. And so you have have less power draw.

You have, one less failure domain or one less place where the failure can or where failure can incur.

Another interesting thing is this concept of linear political optics.

And, you know, I won't pretend to be the the expert on all things optics, but I encourage the people listening to to go check this out.

Since they're removing the DSP module from the pluggable optic. Right? It makes it maybe a little bit less flexible in how you can what you can use it for what kind of other things you can plug it into, but it essentially allows your optic your fiber to connect to communicate directly with the certities and remove some of the the intelligence of the optics in in some of the things that it does. And the trade off there, again, engineering's all about trade offs.

Right? Is that it's lower cost, lower, power draw. So there's lower power that we have to provide to a switch full of these optics, if you use these linnable LPOs, linear pluggable optics, if I can get the term correct. So that's a really fascinating innovation to me as well.

Yeah, and and a reduction in the amount of heat that it throws off as well. So that's absolutely right. So, yeah, I mean, I remember using DAC cables I I that cable is like saying ATM machine, right, automatic teller machine machine. It's a direct access cable cable.

But I I do remember using DAC cables all the time in data centers. Yep. A lot of the time, it's when I was setting up storage networks, and things like that. That was the most common, and then and then running single mode or multi mode to, you know, the rest of the network from there.

Whatever was necessary. But we're seeing that again. Again, because like you said, low latency, low power draw, fewer failures, and and then now we are, but I do I do see a lot of activity happening on the ethernet CIDR. Whether it be the development of optics and you mentioned LPOs, and just a lot of time and research and effort being put into improving ethernet.

So that way, that does become the de facto standard for connecting everything. In fact, there's something called, the excuse me, Ultra ethernet consortium, which if if folks aren't, familiar with, go look it up, but it is it is a collection of network vendors. Most of them are network vendors. There's, I think, one hyperscaler that's involved meta.

Like, maybe today it's two, I don't know, but last time I checked, it's mostly network vendors, trying to solve these these problems with AI interconnection. And primarily in Ethernet. And so they put out, literature. They do research.

They have conferences. They do all that kind of stuff. And from what I understand, they're gonna have their first, UEC standards coming out next year in twenty twenty four. So we're gonna see some some movement there.

It's It's really interesting that this is it's it's not just that we all are relying more on artificial intelligence and this more advanced data analysis to, kind of empower the way we're we're, you know, living our lives these days, but the entire industry has to change to accommodate that as well. It's really interesting.

Now as you were talking about Schedule fabrics, right, and you were talking about offloading intelligence here, and they were doing all these things, If we are gauging the quality of our connections, and we're doing it at that sub second level, and it is also vital, like mission critical. So all of these things coming together, that suggests to me that visibility, network telemetry, whether it be tradition or maybe some new form that I'm not aware of is incredibly incredibly important. I mean, I have to assume you agree, Justin.

Yeah. It's amazing we've gotten this far in the podcast and just bring up telemetry. Right? But, yeah, I mean, it's absolutely critical.

If you're gonna be able to schedule, your traffic on a particular links. You need to know the quality of those links. Right? You need real time.

And that's probably less than, you know, a five minute S and P poll for sure to be able to figure out, okay, how how loaded is this link?

Do I have loss on that link? What's the latency across that link? You're gonna be able to answer questions like that to be able to figure out which links to schedule particular traffic on. I mean, it's a it's a full loop, right, in order to be able to do that type of automation, that type of full scheduling, you have to have that telemetry data, you have to have that information in order to make those decisions.

You can't make those decisions without that knowledge. And so Right. Yep. Yeah. We're seeing some interesting innovations, from, you know, Broadcom and some of the other chipset makers.

We're right on the chip. They're going to have the ability to export some amount of telemetry, data. You know, that could be things like, what's the queue depth of traffic? What's the utilization on the link?

What, you know, what's my packet loss. What's my, my latency. So, yeah, I think it'll be really, really fascinating to see, what kind of telemetry data we're able to get out of these new chipsets that are coming up. Yeah.

And a lot of the telemetry that we're gonna need and that we do need today is really the same as as information that we've been gathering for a long time, but now we're talking about connecting to the network, right, at the nick at two hundred gigs, or we're even four hundred or eight hundred, which we're we're gonna see in time. So if that's the the the line rate that we're operating at, and we need to know what's going on on a packet per packet sub millisecond millisecond level, then, yeah, there there is gonna be some interesting advancements in how we do telemetry, moving forward. So, you know, whether it's just, advancements and changes in flow sampling and, how we monitor the state table of a particular switch, and things like that. I I'm not exactly sure.

But it is That's a really fascinating one we hadn't talked about.

Yeah. The state tables.

Yeah. Absolutely. And, like, where where are my individual flows How do we know, you know, as far as packet reordering on the other side and, how do we build all that together?

In in in such a fast way without error that we don't affect job completion time. A lot of those things, I'm not I'm not exactly sure how we're gonna We're gonna do that yet, but I am interested to see all that. I mean, we we are we are looking at hyperscalers leading a lot of this charge as well. I mean, we've been talking about chip make them the chip makers themselves, but there's hyperscalers out there that that they're building their own smart nicks. So they can design their own protocols, whether that be for scheduling fabrics or for avoiding congestion, you know, congestion control protocols in general.

So everybody has a very vested interest in this. And sure there's There's the for profit component, and then there's, of course, the R and D component from the large universities. But but really, you know, all of this stuff is so that way we can reduce the job completion time and the network isn't a bottleneck to the completion of this AI workload, because when it comes down to it, some of these data centers, and these are these are purpose purpose built data centers to do artificial intelligence tasks. Right?

Yeah.

For sure. They can be hundreds of millions of dollars, if not over a billion So if I can find a way to reduce power consumption, reduce cooling needs, reduce the number of optics, make use of every single path I have available so that nothing is idle so I can use it to its maximum. Right? And therefore efficient. Just like we do with compute as well. Right? You know, we run our GPUs at ninety nine percent.

And so that's really the key here. And that's why I think a lot of these hyperscalers, you know, as well as the vendors and and academia are so interested in this because there is money to be made here.

And then also, of course, there's money to be made when you save money. Right? You you make your operations more efficient. Really interesting stuff. Yeah.

And I mean, I, you know, that's one of the reasons I I love this industry. I think you probably would echo this Phil is just there's there's never an end to interesting challenges to solve. Right? It seems like every day there's something new, different, and unique. Sometimes you get to reapply solutions that you've had, like we're talking about earlier, I mean, packet spraying or multipathing has been around for decades, but being able to apply that solution to a new problem statement, to a new problem domain is really fascinating to me.

So, yeah, it's interesting that a lot of the technology that we've talked about isn't new.

It actually is, like, old technology that's being, you know, maybe it's being applied in a new way or it's being updated. Applied anyway. Yeah. Absolutely to solve a problem.

And that that's how I've always defined engineering too. It's like, okay. Here's the problem. How can we solve the problem?

Not what new box can I buy necessarily?

Although sometimes it means, you know, I need a new so I can have greater port density or more bandwidth, but it's it's what what tools do we have available? Fancy or not. I I remember hearing somebody say that BGP is old and dumb And and I'm just like, what? That that makes no sense. You know, the idea of of a technology that works well, being not good simply because of its age makes no sense to me. So I it is interesting to see that we are resurrecting things like infiniband and DAC.

That cables and, and, and that kind of stuff for moving forward. So in any case, Justin, really interesting conversation. Great to have you on again. Look forward to doing something like this again soon. Yeah.

Thanks for having me. Absolutely.

So if folks wanna reach out to you, if they have a question about, artificial intelligence, how we do networking with artificial intelligence workloads. How can they find you online?

Yeah. Sure. So, probably LinkedIn's where I'm spending the most amount these days. I'm Justin Ryburn. Last name is spelled r y b u r n on LinkedIn.

Same handle on Twitter. I guess we call it x these I don't spend nearly as much time there, or you can always, feel free to drop me an email, j ryburn at kintech dot com.

And, you can find me online at network underscore fill on Twitter. I am also filled with Giovanni on LinkedIn. My blog is network fill dot com, which I have I have really neglected in the past year. I'm gonna try to get get busy with that again.

Now Justin is, Kentic's field CTO. So if he doesn't get back to you right away, keep emailing him. Keep sending him notes on on, on X and on LinkedIn because I know he love persistence is key. Yeah.

And and, Justin, I know that you love the engagement and being a part of the networking community. I know that that's important. To you.

So For sure.

Yeah. We're definitely all about the networking community at Kentech. We love the community and being a part of it, and, that's what telemetry now is all about. And, on that note, if you have an idea for an episode or if you'd like to be a guest on telemetry now, I'd love to hear from you.

Our email address is telemetry now at kentech dot com. Just shoot us a note, and, we can start from there. So until next time, thanks for listening very much. Bye bye.

About Telemetry Now

Do you dread forgetting to use the “add” command on a trunk port? Do you grit your teeth when the coffee maker isn't working, and everyone says, “It’s the network’s fault?” Do you like to blame DNS for everything because you know deep down, in the bottom of your heart, it probably is DNS? Well, you're in the right place! Telemetry Now is the podcast for you! Tune in and let the packets wash over you as host Phil Gervasi and his expert guests talk networking, network engineering and related careers, emerging technologies, and more.

All Episodes

Kentik is network intelligence for modern infrastructure teams.

844-356-3278

Platform

Solutions

Technology

New and Notable

Learn

Company

We use cookies to deliver our services.

By using our website, you agree to the use of cookies as described in our Privacy Policy.