Troubleshooting Cloud Traffic Inefficiencies with Kentik AI
Summary
Balancing cost efficiency and high performance in cloud networks is a constant challenge, especially when misconfigurations or inefficient routing lead to inflated costs or degraded performance. Learn how Kentik Journeys simplifies traffic analysis, helping cloud engineers identify inefficiencies like unnecessary Transit Gateway routing.
Introduction
Cloud engineers often face the challenge of balancing cost efficiency with high performance in their networks. Misconfigurations or inefficiencies in traffic routing, particularly in complex and dynamic cloud environments, can lead to inflated costs, degraded performance, or both. Diagnosing these issues requires deep visibility into network traffic flows and patterns.
However, traditional workflows often require manually crafting queries, applying filters, and iterating multiple times to understand the data you are working with to get the insights you need. It can be manual and time-consuming work. Kentik Journeys alleviates this, allowing you to combine the power of AI and natural language with rich network and cloud traffic data to understand patterns and information faster.
In this post, we’ll look at how Kentik Journeys can help you optimize cloud traffic and avoid costs by identifying traffic unnecessarily routed through a Transit Gateway (TGW) due to asymmetric routing. We’ll also take a look at how Connectivity Checker now seamlessly works with Journeys along with its new AI Summaries.
Step 1: Identifying traffic over Transit Gateways
In this scenario, we want to identify traffic that unnecessarily goes over AWS Transit Gateway, which can potentially be avoided. In a typical case, traffic that flows between VPCs in the same Availability Zone can also be routed using VPC Peering, which is a lower-cost solution than Transit Gateway.
Let’s start a new Journey to investigate this further. First, let’s look at the traffic inside the same Availability Zones going through a Transit Gateway by asking, “Show me AWS intra-zone traffic going over the transit gateway in the last week.”
Kentik Journeys uses AI to generate a tailored Data Explorer query and filter.
Step 2: Analyzing traffic spikes
Next, we can see that traffic occurs in two zones, with a significant spike in us-east-2a on November 17. To understand this spike better, let’s refine the view to include IP addresses and instances involved in the traffic.
This gives us granular visibility into the endpoints driving the spike, setting the stage for deeper analysis.
Step 3: Focusing on specific traffic
The traffic spike reveals a pattern: The majority of traffic flows between IP addresses 10.67.200.223 and 10.68.200.128. To isolate this flow, we’ll ask, “Filter this for traffic between 10.67.200.223 and 10.68.200.128 in both directions.”
The resulting filter confirms that most traffic flows in only one direction. This asymmetry is unusual and suggests a potential misconfiguration or routing issue that needs further investigation.
Step 4: Investigating applications
Understanding the applications involved can provide additional context. Let’s adjust the query to remove zones and add application details and destination ports. This reveals the specific services or protocols driving the traffic between the two instances.
This step helps pinpoint application-level dependencies that might contribute to the routing inefficiencies.
Step 5: Verifying connectivity
The one-sided traffic pattern suggests potential connectivity issues. To verify this, we’ll use Kentik’s Cloud Connectivity Checker to test bidirectional connectivity between 10.67.200.223 and 10.68.200.128 on TCP port 3020.
The Connectivity Checker examines metadata from AWS, including routing tables, security groups, and network access lists. The results show traffic from 10.67.200.223 to 10.68.200.128 is routed through the TGW, while the return path uses VPC peering. This asymmetric routing explains the incomplete traffic visibility.
Note the new Kentik AI Summary that is now available with Connectivity Checker. It provides additional context and information about the connections between these two addresses.
Step 6: Confirming bidirectional traffic
We can confirm bidirectional traffic between the two IPs by asking Journeys to remove the Transit Gateway filter. Additionally, eliminating application and port dimensions simplifies the view, making the routing paths easier to analyze.
This confirms that traffic flows correctly in both directions but takes different paths, highlighting the need for route optimization and routing the traffic through VPC Peering to minimize the cost.
Conclusion
Inefficient routing and traffic planning can create performance issues anywhere in a network. When it comes to cloud, though, quickly identifying and isolating it is even more crucial as it generates additional metered costs.
This example demonstrates the power of Kentik Journeys in troubleshooting cloud network inefficiencies, leveraging the data you already collect with Kentik in a new and fast way. Kentik helped quickly develop the specific insight needed to optimize this traffic (or inform the cloud team of what they need to do). All that is left to do is create an appropriate route in the AWS account to leverage existing VPC peering to reduce costs and improve performance.
With AI-driven insights and natural language queries, you can quickly isolate problems, optimize traffic paths, and enhance your cloud environment’s cost efficiency and performance.
Start your 30-day free trial of Kentik today and experience seamless cloud troubleshooting firsthand.