More episodes
Telemetry Now  |  Season 2 - Episode 16  |  October 11, 2024

Optimizing Cloud Costs to Balance Performance, Complexity, and Budget

Play now

 
Unexpectedly high cloud bills are a common challenge in the tech industry. Host Phil Gervasi sits down with cloud experts Carl Fugate and Tejay Cardon to discuss the factors driving cloud costs—especially on the network side. They explore how inefficiencies, misconfigurations, and architectural decisions can lead to unnecessary expenses. They cover strategies for optimizing your cloud network environment, understanding traffic flows, and balancing performance with budget constraints. Tune in to learn actionable steps for managing and reducing your cloud expenses without compromising network performance.

Transcript

Many of us have been surprised by an unexpectedly high cloud bill to the extent that it's sort of become a joke in the industry at this point. But, of course, running a network, running a cloud environment is indeed serious business. And we have to be mindful of cloud costs where there's inefficiencies, possibly misconfigurations, or maybe poor architectural decisions that led to these unnecessary costs.

Well, with me today are Carl Fugate and Tejay Cardon, both experienced, seasoned cloud engineers with plenty of experience and knowledge and insight into cloud cost optimization, a topic that's become just as important as learning cloud technology itself.

And today, we'll be talking about what drives cloud costs, specifically on the network side, and what steps we can take to optimize our cloud network environment. My name is Philip Gervasi, and this is Telemetry Now.

Carl and Tejay, thanks so much for joining today. And it's great to meet you for the first time, Tejay. And then, Carl, it's great to see you again. We got to see each other recently at the, Kansas City NUG just last month, so that was really cool. So, thanks so much for joining. Before we dive into today's episode, I we need some introductions here because, Tejay, I'm just starting to get you to know you for the first time. So why don't you give us a little background about yourself, personal if you like, but professional, how you came into this cloud networking business and and what you're doing these days.

Yeah. Thanks, Phil. So, my story is actually kinda interesting. I started in the world of networking, when I was in high school. Wanted to play Doom with my friends and my family, and and that meant setting up a network.

My dad was a network engineer and system admin, but just didn't have the time. So he brought home cables and terminators and and the whole nine yards, and I literally was pulling cable through vents, just so I could play Doom and throw LAN parties.

So that's where I started. Ended up getting into software engineering out of college, and then running some OpenStack clusters that required massive automation to get some big data stuff going, while I was at Lockheed Martin. And that's what really got me into the DevOps world, which led to the cloud world and then spent the last decade or so very much as an AWS cloud guy.

Networking has definitely fallen into my background, so this won't be my forte. But nowadays, I actually run a consultancy that's really focused on cloud cost engineering.

Mhmm.

And so from a cost perspective and kinda where do we look for savings and how do we engineer for savings, that's very much right there in my sweet spot.

Yeah. Yeah. It's amazing how, so many different, personas in technology has have sort of converged in in the cloud world. You know? There there are folks that have never had to think about networking that are sort of having to think about networking when they're managing their workload in cloud. And then foe and then folks on the network side that never had to think about other things, you know, that are all of a sudden cloud engineers. It's interesting stuff.

And, and, Carl, it's good to see you again. How about a little bit of your background for our audience?

Yeah. Thanks, Phil. You know, kinda like Tejay, the, the the network gaming was, definitely what, caught my interest into networking way back in in college.

But that that led me to to work for, a telecommunications service provider, doing mobile wireless, for for almost a decade, before I moved in kind of more into the, the network consulting, in in various spaces.

But, most recently, I've I've made the pivot, into to cloud and cloud architecture, for a health care software company. So working to deliver SaaS SaaS solutions, to to clients. So, definitely having the networking background has been incredibly valuable.

As one of my, former mentors said, networking is the cause of and the solution to all problems.

Oh, nice.

So and and that's true no matter no matter where you're you're hosting your workloads.

So do you work predominantly with AWS or or multiple different cloud providers?

Yeah. So definitely predominantly AWS, but, I'm definitely definitely working a lot more in Azure these days as well, a little bit of, of GCP. So and obviously, still some private clouds. So Right. Kind of, all the clouds.

All the clouds. So, Tejay mentioned Doom. What was the, video game that you were interested in that kinda got you into the space, into technology?

Yeah. Yeah. Don't say Leisure Suit Larry.

No. No. No. No.

So the we, we were doing some map gaming back in the day.

So, this this is way back, but, so Marathon was the kind of the first person shooter for on the Mac platform.

That or a game called Bolo, which was like a two d, tank, tank strategy game. But, yeah, it was, it was it was more map gaming for me back then. That's just our that's where our school had.

That's what the school had. Okay. Fair enough. Yeah. Yeah. For me, I remember very vividly playing Wolfenstein three d and being just blown away, at that whole concept of exploring an entire level.

And the whole first person shooter too was was was neat, but just that open open concept of exploring a level was so cool instead of, like, a side scroller. Really neat. I really enjoyed that. So let's get into today's topic.

I did just receive an AWS bill personally for, I think, twelve dollars and fourteen cents. So I I'm not really hard up for cash in the sense that twelve bucks was like a hit to my portfolio. I'm still gonna be able I'm I'm on track to retire, sort of kind of. It depends if you ask my wife or not.

But that's not the case for a lot of organizations today that are, you know, sort of restrategizing how they approach cloud, how they approach where they put workloads and and how those workloads communicate with each other, perhaps among different clouds, perhaps between, their cloud resources and back on premises, whatever, all that kind of stuff.

So I'd like to start by asking this question, just kinda putting it out there.

What are the main drivers of cloud costs? So when I get I I assume that we're gonna be sticking a lot with AWS today, and that's completely fine since they have the vast majority of the market share. What are the main drivers behind that bill that an organization is gonna get, periodically?

Sure. I'll I'll take a shot. So it's all over the map. It really just depends on on what your workloads are.

But I think the key when it comes to cloud cost is you're charged by the drink. And so it's a very different mindset than, a lot of finance people and a lot of traditional engineers are used to. It's one of the greatest strengths of the cloud. I can have access to a massive amount of compute for an afternoon and then turn it off and not be paying for it anymore.

But it also can lead to really, really big surprises. So you mentioned your twelve dollar bill, but, you know, one silly afternoon that you spin something up you didn't realize that sits there for three or four days and, you know, suddenly you're getting a multi thousand dollar bill. And and the corporate equivalent of that is, you know, you've gotta really understand the architecture of what you're bringing to the cloud.

Because if you don't, you can get really, really surprised by where those costs do show up. Mhmm.

Yeah. And and, you know, kind of that, you know, looking at how how networking inside of cloud is consumed. Right? You know, the the pay by the buy pay by the drink is really pay by the bite.

And if you ask any any, you know, networking team, you know, that that's operating in a private cloud today, like, how much data are you moving around? They they they really can't tell you. I they could give you speeds and feeds, but they have no idea how much an application in particular is actually transferring and which ones are communicating with what.

And when you get into public cloud, that becomes a really important, you know, thing that you need to understand because how you actually layout and design things is going to be directly influenced by that. And, if not not done properly, you can you can definitely end up in a situation where you're paying significantly more, to, to enable communications between things than you would expect.

Okay. So there is a there is a strong case for traffic engineering for cloud engineers who may not have been network engineers and all of a sudden kinda sorta are, at least in a cloud context.

And I know that a lot of the constructs, the networking constructs in AWS and other clouds, I mean, fundamentally, they are very similar, if not the same, as traditional networking, but perhaps with a different name or perhaps with another level of abstraction. So I I can't tweak it quite as much or I have to do it in a different way. But, ultimately, there is still that element of traffic engineering in order to manipulate where traffic is going, and I'm gonna assume also understand where it's going both in volume and in and in its format and type, because that is gonna be a main driver of that AWS bill that I get.

How much traffic? What kind of traffic? Where is it going? Obviously, there's caveats there. There's I know there's, like, formulas.

I've been on the AWS website looking at that, and there's, you know, formulas there and trying to figure that out. Can be a little bit difficult, at least for me.

But basing it on, you know, the difference between intercloud, intra cloud, traffic leaving the cloud and going down to my branch office for my on prem data center and that egress traffic, all components driving that overall cost bill. But, I mean, this kind of assumes that we know how the traffic is flowing in our network, where those hot spots are that are incurring significant cost and therefore are possibly inefficient or or due to a misconfiguration or a bad architectural design. What whatever the reason, but we're gonna have to understand both in the planning and in the testing and then in the ongoing day two support what, the traffic looks like. So how do we actually do this in practice? What are we doing to determine, where the problems are and and how to fix them?

Yeah. I I think, you know, just to kinda start out and you kinda mentioned abstractions, and I think this is a really important thing for for people to do is not get lost in the abstraction.

So fundamentally, you know, the the the saying that, you know, the cloud is just somebody else's computer, or in this case, just somebody else's network is really important.

Fundamentally, when we're looking at, you know, consuming networking resources inside of the cloud, we really just need to kind of think back to, hey. How would I have designed this, you know, within my own data center if I were if I were architecting for, connecting these solutions together. And fundamentally, understanding, hey. Where where where am I going to be at a point where I'm consuming really expensive resources?

Right? So if I can keep things local, you know, to to a particular switch or, within a particular spine, that traffic is essentially free. Right? It's it's a sunk cost, and and and there's really no additional cost, you know, for me to to go from one thing to another.

But if all of a sudden I wanna go between from, one data center to another data center, now I'm going to have to go over much more expensive WAN links. And that's really those same concepts apply within the public cloud providers, you know, and those could be, you know, called availability zones, or something else.

And so you you start to look at, hey. Where are those finite resource constraints? Because that's where those charges are going to start to pop up, is is where those, those additional resources are that the cloud provider needs to to have to be able to enable that service connectivity.

So if I have to buy really expensive WAN links because I'm going between, you know, two different locations, you can you can be assured that those are the the pieces where they're going to charge for that service.

Mhmm.

Yeah. And you can expect the costs to to scale in a similar fashion. So, you know, this has been a topic that's been a little bit hot in the news, not so much really recently, but about twelve months ago because there's definitely different approaches from the different vendors as well. So, traditionally, bringing data into the cloud was free. We want your data.

But getting data back out, they're gonna charge you for. And so it it makes things like hybrid cloud, potentially, if you're a data heavy company, much more expensive. Because every time that data is transiting, you're paying somebody. And it's even reached the point where it's becoming almost a marketing tool for some of the smaller players.

So Oracle, for example, has much lower networking fees than than the bigger players do. And they're using that as as a way to try to pull, and especially for them, data heavy customers away from some of the bigger players. They're saying, hey. Look.

We're best at data anyway. Our egress charges are better. We're just we're the the best player in town, and we play nice. And they, you know, then try to paint everybody else as the bad guy who's just trying to lock you in with these fees.

So you have some of that. And then, you know, kind of the different types of fees as Carl said. So speaking specifically AWS terminology, but they're all very similar.

They have what they call availability zones, which are, generally speaking, single data center, but depending on the region, maybe two or three different buildings, but they're they're physically within thousands of feet of one another. So you're running, you know, fiber optic lines between those very, very fast, very low latency.

And as long as I'm within that availability zone, traffic is free.

Within what they call a region, which could be a single physical location with a couple different buildings, or it could be like US East that's probably at least a dozen separate locations within, you know, dozens of miles of each other. So still very close, but now you're going over significantly more expensive connections. And so they charge you there, and it's not super transparent.

They tell you it's a penny per gig, but it's actually a penny per gig on the way out and a penny per gig on the way in. So it's really two pennies per gig. Right? And then if I wanna leave the region, that's a different charge. And if I wanna leave AWS as a whole, that's a different charge.

And that's just your data. That's before we start talking about all the different abstractions and all the different resources, that you can use for, you know, VPNs and, Direct Connects and, you know, other ways to really do global type networking within the cloud ecosystem.

Right. Yeah. And, you know, I I do understand that the network piece of a cloud bill, an AWS bill, is proportionately smaller than all of the other components that are gonna drive the overall cost that an organization sees month to month much more than than the network piece. But we can control the network piece in that we can make these decisions, albeit there's trade offs there, but we can make these decisions about whether we're using a direct connect or we're using, VPN over the public WAN. And also as Carl Carl alluded to, the way we architect our overall solution and the decisions that we make about where we put resources and where those egress points are gonna be, what kind of egress points we use.

And so I guess that's my next question is I it's not really a question, but I'd like to discuss the trade offs now. I mean, just as an example. I mean, is it always better to go with a Direct Connect just because of the quality of the connection? Is it is it always better to go with a VPN over public LAN if you're cost conscious? Maybe a little bit of both?

Yeah. I mean so connectivity is really important. And, you know, to to that point, you know, there is no one size fits all.

So I've I've worked, you know, with just about every every possible way to connect to public cloud providers, from, you know, just, you know, direct, you know, public public WAN Internet connectivity, to VPN, to MPLS and and direct connections.

And all of them have their their pros and cons, to them. And, you know, I would I would say that, you know, you know, just looking at, you know, the the last two, you know, projects that that I've done with public cloud providers, almost always you end up with some hybrid approach. It's not even just one.

So, we we will have certain things where it makes absolute sense. And because of reliability, because of latency characteristics, we would absolutely do, you know, some sort of direct connect, where I'm going to have a dedicated private WAN connection, you know, right onto the cloud, you know, on ramp backbone.

But in other cases, you know, where, you know, I need to make sure that this service is available to lots of people, you know, with high bandwidth and maybe I don't know exactly where things are gonna come from. You know, the public the public LAN can be, you know, a a good solution for that if I don't need those kind of guaranteed SLAs.

So, you know, that's one of the beauties of of cloud, and it wasn't it wasn't like this originally. Right? Cloud has evolved to to kind of, service a a lot of different needs.

It started with Internet only, but now you can connect in a lot of different ways depending on your use case.

Yeah. I think that's fair. And I you know, to your point, the options are there. And that's that's an exciting place to be where you can make those trade off decisions.

For me, I think a lot of it when I'm working with clients is helping them recognize total cost of ownership.

So, you know, I worked at a company where we when we first started networking with AWS, we were pretty small. We didn't have a lot of budget. We were looking at Direct Connects, and it was gonna be, you know, like, twelve hundred dollars a month. And at the time, that was that was substantial for us. And so we stood up a VPN that was, like, forty five dollars a month Right. And did the job just fine. Well, fast forward sometime later, and, you know, we've continued to grow that pattern across multiple VPCs, across multiple accounts.

And we've reached a point where I think we had eighty two VPNs and, like, seventy five what Amazon calls VPC peering connections, which is Right.

Sort of an internal concept.

And the total cost was still super manageable.

But when you actually think TCO and you look at the cognitive load that puts on your networking team and the downtime when something breaks and you're trying to troubleshoot which of these seventy five VPN connections is causing problems, you know, it very quickly became very, very expensive.

And so we moved from that towards an MPLS solution. We already had to connect our sites, and we said, hey. How do we just connect all of our cloud networks into that MPLS as well so we just have one beautiful giant network?

And our networking team was not doing a lot of cloud. All our cloud networking was our cloud team, and so it it put the corporate network back into the hands of the network experts. They had to learn a little bit more about the cloud, but it got them responsible for the overall schematic of our network, and they were able to dramatically it and significantly reduce that total cost even though the AWS bill may have actually gone up a little bit. Right.

But the amount of labor we were spending keeping the network happy and healthy was much, much, much lower.

So that's something that especially when you're working across finance and executive team and engineers and network engineers and soft like, you got all these players that all view things very myopically.

Mhmm. And so when I'm trying to work those situations, we really have to get everybody thinking in that total cost of ownership perspective to be able to start really seeing through that complexity and say what makes the most sense here.

Yeah. And that's what you mean by cognitive cost. Right? The the total cost of ownership in terms of the management of this entire web of connections and systems and all of that.

Yeah. I I don't know what you mean by myopic, though. I thought we broke down all the silos and technology a long time ago, that we were all one big happy family now. I am I right?

Absolutely. Yes. Right. Right. Well, that's that's what the software engineers tell me. They say just give me keys to AWS, and now I can build my own network and my own infrastructure and my own code, and there's no more silos.

Yeah.

We won't talk about what happens when they do that.

But It works really well right up until they all, deploy you know, each of them deploys VPCs with, you know, ten slash sixteen, and then they say, hey.

I need these to communicate with each other.

That didn't happen, Carl.

It never happens. No. But but but those are those are definitely really important concepts, and, you know, Tejay kinda nailed it. Right? So, you know, when looking at cloud, it's it's not all about the bill.

You know, a lot of times, you know, especially depending on, you know, where a company is in their in their kind of cloud native journey, you you have to start to look at how where do I want to have expertise? What's my what are my core competencies, you know, as a business? And, you know, having a, you know, super strong, you know, networking team, you know, that can troubleshoot the most complicated, you know, cloud design known to man may not be the most, you know, you know, you the greatest use of resources.

Maybe I'm gonna go ahead and pay more for some of these network abstractions that a lot of the the cloud providers, you know, have. And so even though that cost is more, if it allows if it allows my developers to move faster because they're not having to, you know, work with the networking team to build some complicated design, you know, that that value, is probably worth it, in the long run.

Now, in terms of analyzing and understanding this overall architecture, this cloud environment, and then the nature of the traffic that's traversing, over transit gateways and NAT gateways and leaving one availability zone and going from, one region to another region or whatever it happens to be, those things that incur costs. And and so in this, this exercise of understanding architecture, understanding where our resources live, but then really digging into where the traffic is really going, what tools do you two use to do that? And and this is not a leading question where you have to answer with Kentik. I wanna know if, maybe you're predominantly using those cloud native tools that the providers give you, or or are you using third party tools or maybe some combination of the two?

You know, I'll say the vendor tools have improved dramatically in the last few years.

They used to really not be very helpful, and they're making big strides. But I think we're still at a place where the partner ecosystem gives you a lot more visibility into some of those items. And even, it's a little tangential, but a lot of the the network level resources when you think about, like, firewalls, you know, especially if you're at a really high security company where you wanna be doing traffic analysis and things like that. I think the vendors have come a really long way, but they're still just not even close to what you can accomplish when you bring in a partner.

So that's that's probably the big thing.

I'd also say it it's a little challenging, because you don't have as much control.

You're you're kinda limited to what gets exposed. And because of the security model of a big shared world, things like packet sniffing, for example, become much more complicated than they are in in your home network. But, Carl, I know you guys, I'm sure, have a lot of this going on, over at your place. What are you doing?

Yeah. No. And I I would agree. So, you know, we, we try to use first party tools, you know, whenever possible. And, you know, we we've got a we've got a saying that, you know, the the cloud tools are about seventy four percent. They get you about three quarters of the way there, you know, from a feature feature and functionality perspective. And for a lot of for a lot of people, that's that's all you need.

But, certainly, you know, we've seen that, you know, that that doesn't get us far enough in in a lot of areas, and we've definitely had to to go out and and bring in third party tools to to do some of these things, bring in, you know, additional features.

So we we've we've done, you know, some some demos and and brought brought in some of our previous partners, into into the cloud ecosystem.

Mhmm.

You know, there's, you know, I'll I will name drop one. So, I think ExtraHop has been, you know, something that we've used a lot on prem, to be able to, understand, network traffic. And that that solution is is works really well, within the AWS ecosystem at least.

But, again, the the first party tools now are getting so much better, to a point where there's there's a lot less need to to to go outside of it.

Or, you know, one of the other things is we really hadn't talked about kind of network function virtualization, but, you know, these, replacing kind of the first party network services with third party services, whether those are firewalls, routers, load balancers, etcetera.

So bringing those things in, you know, the great thing is all of the the tooling and and things that I already have or the management e management platforms that I have for those tools can give you a a lot of visibility into your network traffic as well.

Okay. So walk me through how you would approach, or at least how you would think about analyzing your cloud environment with this idea of cost optimization, of analyzing maybe the architecture and then, of course, analyzing network traffic, focusing on the network side, to to reduce a cloud bill or perhaps to weigh those trade offs between cost and performance and and all that sort of thing?

I would say, like, the so for me, the very first thing is, make sure you're reading the pricing guides. You know, the the very first step for me when when looking at it with in a public cloud service is to go look at the cost guide. How how is this how how are how are you going to charge for the Gervasi? Really helps me understand how how am I going to go design for it.

And I'll give you a really good example of this. So one of the things and and, again, you know, I think Tejay kinda mentioned it earlier. The the pricing model for for a lot of network services is changing over time. A lot of things are starting to become table stakes.

Things like egress fees are starting to, you know, the providers are starting to to, do away with with some of them. But one of the changes that, AWS made, you know, last year, or or the just just the year before, was around how they handled, traffic, between availability zones. So within an availability zone, you know, if you're staying within your VPC, that traffic was was free. There there's no cost for that.

So one of the things that we needed to do was was deploy a backup solution, because we needed we needed something that could, could work in a very particular way. And so we needed a multi availability zone backup solution. Well, the problem is if I'm deploying this and I have, you know, my EC2 instances in one availability zone, but they're backing up to, an EC2 and yet another availability zone. I was paying for all of that traffic.

Mhmm.

And one of the things that you know, even though, you know, even though they they could be in the same VPC, I was still paying for for this traffic.

Now one of the things that I could do is say, hey. How can I make sure and align those, align those traffic flows? And so in our particular case, I was actually we were actually able to deploy instances within each of each availability zone. And so and now no matter which VPC it is, I can actually map my instances from one of it, from from backing up to that same availability zone.

Right. Okay. And so by aligning that traffic, now I don't have to pay these cross, you know, cross availability zone, fees. And so, really, it's it's digging in and just kind of understanding, hey.

Where am I going to get charged? And then understanding, hey. How can I go back and look at the actual workflows? And and I and I wouldn't do everything.

Right? The in this particular case, backups are a huge volume driver. Right? So I know that that's gonna be a significant amount of volume and thus cost.

Am I going to worry about something that, you know, transfers a hundred megabytes a month? I I'm not. I'm not going to worry about those things.

Yeah.

Of course.

But really just focusing on your heavy hitters, those big traffic data syncs, that's where the where the value is.

Yeah. And it some of this depends whether you're you're talking in a greenfield or a brownfield environment. So, you know, one of the first things I'll do with a client is we'll just look at their bill. We'll just say, like, what's what sticks out?

And and for me, you know, networking rarely does, partly because the networking costs are are very reasonable as long as you're not doing really unusual things. But we'll look at just what what is the thing that looks weird, and and then we'll dig in. We'll say, well, why is it happening? I I just read an article couple weeks ago about a a company that was using Kafka, which is like a message broker service.

And one of the really central concepts of of Kafka is that it's highly resilient.

And so when you write a message, it then spreads that message out to multiple nodes. And, inherently, the whole point there is, well, I want nodes in each availability zone so that if an availability zone goes down, my cluster is still happy and healthy.

But every message I write then scatters across zones, and then every time I read, Kafka tells me, oh, you don't actually wanna read for me. You wanna go read from over there. And so what they found was they were spending almost four times as much on network transit as they were on storage for this Kafka cluster.

And it's just because it it was engineered for an on prem world back when we did all our big data stuff on prem.

And and so it just it doesn't fit the model well. And so understanding that leads to a couple things. One, there's actually a new project out, that uses the whole Kafka API, but behind the scenes, it's using S3 for storage and and for communication.

Because and and this is an important part of a more greenfield view, is as I'm designing, there are certain services that don't have transit fees. They're considered regional services. So S3, dynamo db, there's a couple others where all the cross zone stuff is happening on the backplane and and behind the scenes, and you don't get billed for it. And so they've actually built an implementation now of Kafka that all of the cross talking is happening in S3, so there's no transit fees.

And, you know, there's performance implications and and other things there, but it's it's going into it and understanding, like, how is this gonna impact my model when we're deploying something new with companies I work with. Like, we actually build our own cost model in Excel of what do we think this thing's gonna look like. And then we start tweaking and saying, well, what if this, you know, does that change what service we're gonna use because of the way that they're priced to Carl's point? Like, you have to go understand the pricing model of each each service.

Yeah. And you you bring you actually one one thing that there and it's it's again, every everything that you know, there's always some gotcha, you know, when you when you go to look at these things. And so, you know, one of the things Tejay just mentioned is about communicating with, like, these global or these regional service, you know, services like S3 or or or others. Well, in order to, make that so that you're not paying for that cost, you actually have to deploy a service endpoint, into your VPC, and then you get that. But if you don't, if you forget, which we have, and, and now it's part of our infrastructure as code model when we deploy VPCs to make sure that we're deploying these endpoints, we were actually getting charged where people were going and and hitting an S3 API, but they were going out to the Internet in order to do to hit that instead of using, you know, that direct connection right within the AWS, VPC to go do that, and that cost, you know, significantly.

So now instead of free, you get the most expensive kind of EV. Exactly.

So you're tromboning out and then back into AWS without realizing it.

I'm I'm leaving yeah. I'm going out AWS to come back into AWS.

In an effort to You probably never actually leave their hardware.

But But I do I but I went through the I went through the toll booth.

Right. Yep.

So I I see on the outline something that I I don't know I don't know how to interpret this. I see one of the points says here, the hidden costs of NAT gateways and idle regions. I know what an idle region is. I get that part. But, you gotta help me a little bit, maybe educate me on the hidden costs of NAT gate I know what a NAT gateway is, but what does this mean that there are hidden costs there?

Yeah. So there's a lot of different strategies for how to manage kinda your your AWS account structure.

And the most common modern approach is to have lots of accounts. So team x gets their own AWS account where we can isolate all the things they're doing, And there's there's a lot of benefits from security to cost allocation.

But what often happens, especially in orgs that are, like, mostly mature but not quite mature yet, is you start to get this template of, well, an account getS3 availability zones, and each availability zone gets a public subnet, and each public subnet gets a NAT gateway, and then you get some private subnets. And so what you end up is, like, you just build this template infrastructure.

And I specifically bring up the NAT gateways. They're they're not bad. They're fifty, sixty bucks a month, something like that. Like, they they're not expensive.

But if my template's not built right and every account getS3 regions with three availability zones that each have an app gateway, you know, suddenly my baseline cost for an account that's doing absolutely nothing is seven, eight, nine hundred dollars a month. And so, like, you're not you're not bleeding depending on how big an organization you are, but it's just complete and total waste. Nobody's even looking at US West. They're just doing a quick little POC on US East, and, you know, they're they're pumping a hundred megabytes a month through this NAT gateway, but I'm paying for it versus a more holistic strategy where we say, actually, all of our Internet transit's gonna go through our networking account through a single cluster of NAT gateways that we may even add, you know, a number of our own tools to, in which case we're deploying a third party NAT instead of an AWS NAT.

And now all these little vended accounts that we build have to transit back through there instead.

And all of that little onesie twosie, you know, death by a million cut costs folds back towards this really powerful appliance that actually has a bunch of benefits from a security and manageability perspective and is actually less expensive than what I would have ended up with otherwise.

I I we were just go I was just having this conversation yesterday. So, like, we've we've been, we've been pondering, you know, the the benefits of a centralized NAT gateway, you know, deployment for for quite some time. And, you know, this also is coming into a discussion around, you know, first person versus or first party versus third party, network appliances. So, you know, we've we've used, you know, third party, firewalls within, you know, public cloud to to get, you know, certain visible levels of visibility that the that the first part first party native, tools just don't provide.

And, but recently, you know, the they've those platforms are starting to become, you know, have more feature parity, you know, to the third party tools. And one of the benefits, of, you know, with at least with the AWS firewall service is, you do not have to pay for NAT gateway. So if you deploy this in kind of a centralized manner, you no longer need NAT gateways because it's provided as part of that service.

But, you know, going back to one of the one of the earlier things, that that is not for it's free as in I don't have to pay for NAT gateways, but it it's not free when we look at the network complexity.

Right? Now I have to do things like either stand up VPC peerings or stand up a transit gateway, and I have to route all of this traffic back, and I have to have all of these routing policies to be able to get traffic from one place to another. And so that operational complexity increases as well. And, you know, that that might be a great thing if you're looking at a greenfield network deployment. But if you're in a brownfield, you know, deployment, going back and ripping out NAT gateways and updating all of your routes.

Sure.

That's like doing open heart surgery, you know, on the network.

Right. Yep. Yeah. We don't wiggle wires during the day either, so it's definitely a a project.

See, I've never heard that that analogy before, but that's that's a perfect analogy for pretty much any network change. We I come from a DevOps world where you do everything in the development environment and then the test environment.

And when I started talking to our our network engineers about, like, how do you do cloud right? It quickly became apparent that, well, there's not there's not a development environment for the network.

And and yeah. So, I mean, you're you're you gotta keep the the data flowing while you're making changes, and that's kinda terrifying.

So I've spent many in evening and a night, middle of the night, sweating and, in a cold sweat, I should say, you know, while, I'm doing some sort of cutover or or some sort of upgrade.

And, you know, what we're doing there is we're messing with the abs the the the substrate, the mechanism of application delivery, which is what everybody cares about. Nobody cares about the network, but they care about this thing they're logging into.

And, and like I said, if you wiggle the wrong wire, Chicago goes offline, and all you did was bump into an SFP. You didn't do anything either. So it's it's just so, networking folks, I I believe, and I was one of them, very risk averse for that reason. You know, our our, our test environment is production. So now we we've danced around some of these ideas of of, strategizing, around cost.

You know, we've talked about understanding traffic flows, some tools that you brought up. Carl, you talked about just being more, mindful of how you architect your solution.

Tejay, you mentioned, the the total cost, not total cost of ownership. What was the phrase that you used? Total cost of cloud. TCC. Let's let's go with that.

What what are some specific strategies that folks can walk away from this episode from, to to really understand how they can either minimize their cost if that's a goal or to understand it better so they can do some of those things that you said where they make those decisions and trade offs, in in their in their own design.

Yeah. I would I would say the very first thing is if if you have a cloud networking team and they're not looking if if if your Philip people are not looking at the bill with them, you you're missing out.

The it in a lot of organizations, you know, somebody else gets the bill and it just gets paid, you know, and as long as within with as long as it was within budget, all is all is well and we just we just move on.

But, you know, I know Tejay has pointed out some really, good examples of this, and and and this is actually how we found some of ours is where we're we're looking at at cost, and it's just going up and up and up every month. And then that's how we realized, hey. Something's not right here, and we need to go actually dig into why why is this cost getting, you know, so high. And, you know, that's what led us to to find those opportunities to go back and and actually optimize, you know, some of our some of our deployments. So start start by looking at your bills. And if you're, just because you're a network engineer doesn't mean that you don't need to to understand kind of how, how the bill is, is structured in, each month.

Mhmm.

Yeah. So Carl used the term, you know, FinOps that's maybe not familiar to to a lot of people yet. It's it's rapidly growing.

But, the notion behind FinOps is really how do we bring finance and engineering together to understand the cloud. Because, you know, as Carl said, like, look at your bill.

It's remarkable because of that pay for drink model how quickly it's too late.

Just a matter of days, and you can rack up months worth of spending.

So, you know, my biggest thing would be, one, set up anomaly detection, whether that's through a third party app or through the first party tools.

Most have options now to like, as soon as things look weird, it's gonna notify somebody, and and you can start digging in. And then the second from a planning perspective is just recognize how much you don't know, because we are still siloed. I think cloud has really changed the way that a lot of engineering happens, and it certainly led to kind of software developers broadening their skill set.

But especially, I would say the networking side of things has still lagged way, way, way behind to where most people that are building cloud assets really don't understand networks at all. And in a lot of cases, they're not getting a networking expert involved.

And so kinda getting that seat at the table. If if you're in the networking world, but you're not really in the cloud world you know, I mentioned my last company. Our our network engineers dealt with on prem and our cloud team dealt with the cloud.

Like, we gotta talk more, and then we gotta get that engineering team in as we're planning that new thing and and get those different perspectives during, like, an architecture review process so that the questions are being asked. Because engineers generally aren't gonna be asking, oh, you know, how much network transit am I gonna see in this? They need that perspective coming in and saying, hey, guys. You know, there's a lot of cost with this.

Like, how is the way you're designing your app going to impact my network? And when you ask the question, they can speak to it. It's just it's not something that's on their mind. And then likewise, when you have networking projects that you're planning, right, you wanna be having those conversations that involve the finance team, that involve the engineering team, and saying, like, this is what we're thinking because they're gonna have perspectives and thoughts that you may not capitalize on that are gonna help you avoid both cost surprises, but but also a whole world of other surprises that can happen.

So we just I think we've gotta be more connected than we ever have been with one another.

Right. Wasn't DevOps supposed to do that for us?

Yeah. That's the goal. And it helps. It really helps.

I think you mentioned spin ops. We we still got a lot more to do.

As another mechanism to bring, you know, the financial financial folks and the engineers together.

And now we have folks that are, maybe coming up for the first time in technology as cloud engineers. So they weren't previously network engineers due to their age and things like that.

So, yeah, it does seem like all of these things are converging to require us to be more more proactive and more mindful about how we strategize, how we architect solutions, which, as you both have discussed at at some length here in-depth, it requires collaboration, and it really does require thinking about things differently. I mean, you mentioned in your previous employment, network team focused on prem, your cloud team focused on the cloud. I assume at some point those two networks connected. So there has to be some collaboration.

The more that happens, which is inevitable, especially with some repatriation of of workloads now, when people are strategizing and saying this is we're better off putting this in data center b over on the other side of town. That's just gonna happen more and more. So, gentlemen, I I really do appreciate your your level of expertise, your insight, and your experience, and really just talking specifically about this area of cost, in in the cloud space. Really interesting stuff to me.

And, so I appreciate you both both coming on, and spending some time talking about it with me today. If anybody would like to reach out to you with a question or comment or concern, how can folks do that? Tejay, we'll start with you.

Yeah. LinkedIn is probably the easiest bet, or, you can get me at tejay@crucial-clarity.com.

I should say, Tejay is not just "Tejay". Sorry. I I forget that sometimes. T e j a y at crucial clarity dot com or just LinkedIn.

Perfect. And, Carl, how about you?

Same thing. Just find me find me on LinkedIn. I'm, I'm always available.

Great. And you can still find me on Twitter, network_phil. Search my name Philip Gervasi on LinkedIn. And, if you have an idea for an episode or would like to be a guest on Telemetry Now, I'd love to hear from you. Reach out to us at telemetrynow@kentik.com. So for now, thanks so much for listening. Bye bye.

About Telemetry Now

Do you dread forgetting to use the “add” command on a trunk port? Do you grit your teeth when the coffee maker isn't working, and everyone says, “It’s the network’s fault?” Do you like to blame DNS for everything because you know deep down, in the bottom of your heart, it probably is DNS? Well, you're in the right place! Telemetry Now is the podcast for you! Tune in and let the packets wash over you as host Phil Gervasi and his expert guests talk networking, network engineering and related careers, emerging technologies, and more.
We use cookies to deliver our services.
By using our website, you agree to the use of cookies as described in our Privacy Policy.