Kentipedia

NetFlow Troubleshooting

Overview of NetFlow Troubleshooting

NetFlow was originally developed to help network administrators gain a better end-to-end understanding of their network traffic. Once NetFlow is enabled on a router or other network device, it tracks unidirectional packet flow statistics related to TCP/IP, UDP/IP or ICMP sessions, without storing any of the payload data  carried in that session. By tracking only the metadata about the flows, NetFlow offers a way to preserve highly useful traffic analysis and troubleshooting details without needing to perform full packet capture, which is very I/O and storage intensive.

When combined and correlated with SNMP device and interface data, BGP, and performance metrics, NetFlow can be used to monitor and diagnose a variety of network issues

NetFlow Troubleshooting Use Cases

NetFlow can be used for many different network troubleshooting issues including:

  1. Congestion
  2. Application performance issues
  3. DDoS attacks
  4. Network security anomalies

1. Troubleshoot network congestion problems:

  • Identify traffic bottlenecks at source/destination ports, interfaces and IP addresses by comparing traffic levels to interface capacity/bandwidth.
  • Drill into flow details to find out who/what are the top contributing flows?
  • Are they anomalous or valid flows?
  • Compare top contributing flows to other timeframes for context.
  • Perform ad-hoc grouping of BGP routing, interface, port, IP, geolocation and other fields to find commonalities that will shed light on other aspects of the root causes, such as which applications, servers, user groups, or locations are factors.

2. Troubleshoot application performance issues:

  • Identify which applications and protocols are consuming your network bandwidth by analyzing the source and destination IPs, ports and protocols.
  • Track the cumulative usage of a given application in an aggregate manner, down to a specific region or country for example.
  • Analyze network performance by using metrics exported from packet capture exporters like nProbe™. Compare client versus server latency, TCP retransmits for anomalies that indicate a network versus a server/application issue.
  • If using a per-server agent such as nProbe, look at destination IP + exporting server to see if there is any correlation of performance issues to a particular server.
  • Look for correlation to destination networks, geography, ASN, interfaces, etc. to find potential root causes.

3. Defend against DDoS attacks

  • Detect a sudden overall rise in network traffic that departs from baseline behavior.
  • Which resource is being hit?
  • Look for unusual numbers of sending source IPs for evidence of botnets.
  • Look at unusual or known bad source geography or ASN
  • Trigger automated mitigation if available (such as via RTBH, mitigation appliances or cloud services)

4. Analyze network security anomalies:

  • Baseline traffic volume between top subnet pairs and alert on new pairs.
  • Is there traffic to/from known bad IPs based on threat feeds from Alienware, etc.
  • Identify unusual traffic peaks for unknown IPs, unusual ports, known bad ports, IANA reserved IPs.
  • Identify unusual numbers of flows from one host to many on the same port.

Kentik in brief: Kentik is the network intelligence platform for modern infrastructure teams, providing unified visibility across data center, cloud, WAN, and the public internet. Kentik collects telemetry like flow data, internet routing, performance tests, and network metrics to help teams troubleshoot, optimize performance, control cost, and strengthen security.

10 Critical Use Cases for Network Intelligence

See how AI insights help predict issues, boost performance, cut costs, and improve security.


NetFlow Sampling

NetFlow, and other flow-based analysis solutions, generate flow records based on the volume of traffic flows. Generating a UDP-based flow record for every flow can create a lot of telemetry data — typically 1% of operational traffic, which is significant overhead. Not all troubleshooting use cases require 1:1 flow export.

Fortunately, NetFlow, J-Flow and IPFIX provide for flow sampling, whereby exporting devices can be configured to sample 1:N flows to reduce telemetry traffic volume.  For most network operations and DDoS detection/defense purposes, sampling anywhere from 1:1000 to 1:8000+ flows (depending on overall traffic volume) provides sufficient insight.  For network security purposes, if NetFlow is required to provide for detailed forensics or a full audit traffic, then 1:1 flow export is required.

Since different portions of the network handle different volumes and types of traffic, it is possible to sample at different rates on different exporters.  For example, internet-facing edge routers or datacenter core routers that handle huge volumes of traffic can be configured for high rates of sampling.  Routers and switches at the aggregation layer, where security anomalies become apparent, don’t handle nearly as much traffic, and can be configured for 1:1 sampling to provide full granularity of insight.

Factors Limiting NetFlow Troubleshooting Effectiveness

NetFlow troubleshooting is most effective when sufficient detail is available and can be compared with other data points such as performance metrics, routing and location.  Unfortunately, the state of the art of NetFlow analysis tools up until recently has presented a significant challenge to troubleshooting effectiveness, due to data reduction.  Even with sampling, flow records can add up to a lot of data.  Since most NetFlow collectors and analysis tools are based on scale-up software architectures hosted on single servers or appliances, they have extremely limited storage, compute and memory capacity.  As a result, the common practice is to roll-up the details into a series of summary reports and to discard the raw flow record details.  The problem with this, of course, is that most of the detail needed for operationally useful troubleshooting is lost.

Cloud-scale computing and big data techniques have opened up a great opportunity to improve both the cost and functionality of NetFlow analysis and troubleshooting. The market has long embraced SaaS as a delivery model for advanced products and capabilities and its now possible to apply this cost effective approach to network traffic visibility and analytics solutions.

Big data storage allows for the storage of huge volumes of augmented raw flow records instead of needing to roll-up the data to predefined aggregates that severely restrict analytical options.  SaaS options save the network managers from incurring CAPEX and OPEX costs related to dedicated, on-premises appliances. Scale-out NetFlow analysis can deliver faster response times to operational analysis queries on larger data sets than traditional appliances.

More Reading

To learn more about troubleshooting with NetFlow, read about Kentik’s network troubleshooting features, and our approach to big data NetFlow analysis. Also, check out this blog post on maximizing the value of network metadata, or view the demos of Kentik in our Networking Intelligence Resource Center.

FAQs about NetFlow Troubleshooting

What’s the best practice for monitoring encrypted traffic performance?

Monitor encrypted traffic by analyzing flow metadata (who/where/how much), adding TLS and DNS context where available, and validating performance with synthetic tests. Kentik supports TLS dimensions such as TLS Server Name (SNI) when provided in flow records, correlates DNS with flows via True Origin, and combines these signals with traffic analytics to track encrypted traffic behavior and regressions without decrypting payloads.

How do I troubleshoot network congestion with NetFlow?

Troubleshoot congestion by comparing traffic volume to interface capacity, then drilling into the biggest contributors. Start by grouping traffic by interfaces, ports, and IPs to find bottlenecks, then identify top flows and compare them to other time windows for context. Add routing and location context (ASN, geo, site) to spot common patterns that point to root cause. See: Data Explorer

How can I identify which applications are consuming bandwidth using NetFlow?

Use NetFlow metadata (source/destination IPs, ports, protocols) to attribute traffic to applications and quantify usage over time. Then break it down by region, site, user group, or destination network to understand where consumption is coming from and how it is changing. This is especially useful for finding bandwidth hogs and validating capacity planning assumptions. See: NetFlow Analysis

How do I troubleshoot application performance issues with NetFlow?

Use flows to identify which applications and endpoints are involved, then correlate with performance indicators such as client vs server latency and TCP retransmits (when available from exporters). Look for correlations to destination networks, geography, ASN, or interfaces to separate network-path issues from server or application problems. See: NetFlow Guide: Types of Network Flow Analysis

How can I detect DDoS attacks using NetFlow?

Detect DDoS by alerting on a sudden traffic increase that departs from baseline behavior, then identify which resource is targeted and whether the traffic pattern suggests botnets (many sources, unusual ASNs or geographies). Flow analytics also supports rapid breakdown by protocol, ports, and top sources, and can feed automated mitigation workflows where available. See: DDoS Detection

How can I detect network security anomalies using NetFlow?

Baseline “normal” traffic relationships (for example, between top subnet pairs) and alert when new pairs appear or unusual ports spike. Watch for traffic involving known-bad IPs (via threat intelligence), suspicious scanning patterns (one host to many), and unusual flow peaks to unexpected destinations. See: Network Anomaly Detection

What NetFlow sampling rate should I use for troubleshooting and security?

Sampling reduces telemetry overhead by exporting 1 in N flows. For many network operations and DDoS detection use cases, sampling can still provide enough visibility, while detailed security forensics and audits may require 1:1 export. Many teams also use different sampling rates in different parts of the network based on traffic volume and risk. See: Understanding the Advantages of Flow Sampling

What data should I correlate with NetFlow to improve troubleshooting accuracy?

NetFlow is strongest when combined with other signals. Correlate flow metadata with SNMP interface and device metrics (utilization, errors), BGP routing context, and performance measurements so you can connect “what moved” with “where it went” and “what the network experienced.” See: Core Overview

Why do some NetFlow tools miss root cause details, and how do I avoid losing visibility?

Many legacy flow collectors roll up data into fixed summaries and discard raw flow detail because of storage and compute limits. That can hide the dimensions you need during an incident. A better approach is retaining high-fidelity flow records (with enrichment) so you can run ad-hoc queries during troubleshooting without being constrained by predefined reports. See: Kentik Data Engine

Can NetFlow troubleshooting work without packet capture?

Yes. NetFlow records metadata about traffic flows (who talked to whom, over what, and how much) without storing payloads. That preserves useful troubleshooting and traffic analysis detail while avoiding the storage and I/O costs of full packet capture. You can then enrich and correlate flows with other telemetry for deeper diagnosis. See: What is NetFlow? An Overview

We use cookies to deliver our services.
By using our website, you agree to the use of cookies as described in our Privacy Policy.