Kentik Detect Alerting: Configuring Alert Policies
Summary
Operating a network means staying on top of constant changes in traffic patterns. With legacy network monitoring tools, you often can’t see these changes as they happen. Instead you need a comprehensive visibility solution that includes real-time anomaly detection. Kentik Detect fits the bill with a policy-based alerting system that continuously evaluates incoming flow data. This post provides an overview of system features and configuration.
Deep, Powerful Anomaly Detection to Protect Your Network
One of the most challenging aspects of operating a network is that traffic patterns are constantly changing. If you’re using legacy network monitoring tools, you often won’t realize that something has changed until a customer or user complains. To stay on top of these changes, you need a comprehensive network visibility solution that includes real-time anomaly detection. Kentik Detect fits the bill with a policy-based alerting system that lets you define multiple sets of conditions — from broad and simple to narrow and/or complex — that are used to continuously evaluate incoming flow data for matches that trigger notifications. The power of Kentik Detect’s alerting system — offering a field-proven 30 percent improvement in catching attacks — comes from the processing capabilities of Kentik’s scale-out big data backend combined with an exceptionally feature-rich user interface. In this blog post we’ll look at the basics of configuring this system to keep your team up to date on traffic that deviates from baselines or absolute norms. In future posts we’ll explore particular capabilities and use cases of the system, and how to harness its power for better network operations.
Kentik Alert Library
The first thing to know is that you don’t have to master the configuration of alert policies from scratch. Kentik support is available to walk you through the configuration process, and we’ve also provided an extensive library of policy templates that cover common alerting scenarios and get you most of the way toward creating an alert policy that meets your specific needs. You’ll find these templates on the Kentik Alert Library tab of the Alerting section of the Kentik Detect portal (accessed via the Alerts menu in the navbar).
The library includes templates for alerting on DDoS attacks, changes in top talker IP addresses, changes in geography, interfaces approaching capacity, devices that stop sending flow, and much more. Kentik is continually adding to the library as we add new features and uncover new use cases. To configure a policy based on any of these templates, click the copy button at the right of the template’s row in the template list on the library tab. A copy of the template will be added to the list of policies on the Alert Policies tab, which is where you go to configure it for your system.
Alert Policies
The Alert Policies tab is the access point for all of the policies currently configured for your organization, and it’s also where you add new policies. To configure a template copied from the library, or to edit an existing policy, click the edit button at the right of that policy’s row in the alert policies list. To start a fresh policy, click the Create Alert Policy button at the top right of the tab. Either action will take you to an Alert Policy Settings page, which is where you’ll get down to the configuration process. The policy settings page is made up of three main sections: the title bar (with a few basic controls, e.g. save, cancel, etc.), the sidebar, and the main settings area, which is itself divided into general, historical, and threshold settings. With all of these sections, policy settings can seem pretty complicated at first, but by breaking it down we’ll be able to see the logic that gives the system its power and flexibility.
Let’s start with the sidebar, which is where you specify the metrics that a given alert will be looking at and focus the alert on a specific subset of total traffic. This gives you a lot of flexibility to define the scope of your policies so you can include or exclude only certain matches. The top pane in the sidebar is for setting the Devices whose traffic will be monitored by this policy. This pane is nearly identical to the Devices pane in Kentik Detect’s Data Explorer (see our Knowledge Base topic Explorer Device Selector), but without the Router and nProbe buttons.
Dimensions, metrics, and filters
Next comes the Query pane, where you determine which aspects of ingested flow data will be evaluated for a match with the alert policy’s specified conditions. The first order of business is to define the dimensions (derived from the columns in Kentik Detect’s database) to query for data. You can choose up to 8 different dimensions simultaneously, including the IP 5-tuple, geography information, BGP, QOS, Site, and many other types of information. This lets you get very granular with the types of network traffic or conditions you want to trigger an alert on, something that’s not possible with the limited controls of legacy tools. To complement that long list of dimensions you can also choose from a wide variety of metrics. While most legacy tools monitor only Bits/s and maybe Packets/s, Kentik Detect has the ability to monitor Flows/s, Retransmits/s, a count of source or destination IPs, ASNs, Countries, Regions (States), and Cities. With all those different metric options, you can build a really granular and powerful policy, such as monitoring for a change in the top 10 countries that are sending traffic to your network. Also noteworthy is the concept of primary and secondary metrics. The primary metric will be the main item that is evaluated. If the user is building a “Top N” type of policy, for example, then the primary metric would be the units (e.g. bits/s, packets/s, flows/s, etc.) by which ingested flow data will be evaluated to determine the ranking. The addition of one or more secondary metrics allows the user to set additional static thresholds that must be met in order for the policy to trigger. To expand on our “top countries” example above, we could add a secondary metric of Bits/s and Packets/s to make sure that the amount of traffic is above a set level before the alert is triggered. The result would be a policy that notifies us when there’s a change in the top counties, but only if traffic exceeds the specified secondary metric values. We can get even more specific about the traffic we want to track by defining one or more filters in the Filters pane. Either saved filters or locally defined filters can be applied, and you can filter for any of the dozens of dimensions that we saw earlier in the Query pane. This lets you get really granular in what you are tracking. For example, you might use filters to build an alert for private IP (RFC1918 allocated space) traffic coming into your network but filtering out internal interfaces that are allowed to have private IP traffic on them.
General Policy Settings
Once the sidebar panes are defined the action moves to the panes of the main settings area. The General Settings pane defines the overall settings for a policy, including the name, description, how many items to track (e.g. 10 for top-10 policies), as well as to enable/disable the alert. One noteworthy aspect of the track items setting is that you don’t have to specify which items you want to track; instead the system can automatically learn what the top-N items are and then notify you of changes to that list. For example, maybe you want to track the top 25 interfaces in your network receiving web (80/TCP) traffic and be notified if that list changes for any reason. With Kentik Detect, you can create a policy for that without having to know in advance which interfaces are your top 25. Another interesting feature is Learning Mode, which allows you to set a period for which the policy’s traffic will be tracked, with notifications muted, in order to establish a baseline. You can also associate a dashboard with a policy so that when you click through to that dashboard from an alarm generated by the policy the panels of the dashboard show the traffic subset that is defined by the policy.
Historical Baseline Settings
The Historical Baseline settings allow you to configure what the current traffic patterns are compared against to see if there is a change (these settings are ignored for Static mode; see policy threshold settings below). The Look Back settings set the depth and granularity of the historical data that will be used for the baseline. The aggregation settings define how the historical data points will be calculated for comparison with current traffic. The Look Around setting is useful for traffic patterns that are hard to capture in a single 1-hour window. One of the coolest features of the historical baseline is the ability to make it Weekend vs. Weekday Aware. If your network has different traffic patterns on a weekend than it does on a weekend, this will help make the historical baseline more accurate.
Policy Threshold Settings
The policy threshold settings define the conditions that must be met in order to trigger an alarm (which results in notifications and, optionally, mitigation). Multiple thresholds can be configured per alert policy. Each threshold is assigned a criticality (critical, major2, major, minor2, or minor) to keep them distinct based on their importance to the user. A key setting for each threshold is the Comparison Mode, which determines what the primary traffic is compared to when it’s evaluated to determine whether there is a match with the conditions defined in the threshold. The options are Static, Baseline, or If-Exists. Unless the Comparison Mode is Static you’ll see some additional Comparison Option settings. One important comparison option is Direction, which determines which key (current or historical) is the primary and which key will be the comparison. In other words:
- If the direction is Current to History (default), the current dataset is primary and the comparison dataset is historical data which is derived based on the alert’s Historical Baseline settings.
- If the direction is History to Current, the historical dataset is primary and the comparison dataset is the current.
In most use cases, the default direction is the one to use. However, the History to Current setting is useful if for Top-N type policies where you want to be notified if something drops out of the top-N group. For example, we might want to know if any of the countries that are normally in our Top 10 no longer are. Meanwhile the Activate If and Activate When settings allow the user to configure the conditions that must be met, and how long they must be present, in order for the policy to trigger an alarm.
Notification Channels
Kentik Detect uses notification channels to determine how users are notified when an alarm is triggered. There are currently five types of channels (email, JSON POST Webhook, Syslog, Slack, and PagerDuty), but we can easily and quickly add new notifications and integrations as customer demand warrants. This is another aspect that makes the SaaS approach of Kentik Detect so powerful; as our users know, our features are continually expanding over time. Multiple notification channels can be defined and attached to one or more thresholds as needed. This flexibility allows different notifications to be sent depending on the threshold that is triggered. For example, you may want an email sent to the team for minor issues but a PagerDuty page sent for a critical alert. In addition you can modify the “when” in the threshold to send different notifications at different times. For example, you can send an email immediately but only send a PagerDuty notification if the alert is not acknowledged within an hour.
Mitigations
Kentik Detect offers both built-in and third-party mitigation options that are configured on the Alerting system’s Mitigation tab and then attached to an policy in the Alert Policy Settings page as shown below. Organizations whose Kentik Detect plan includes BGP have an RTBH mitigation option, in which the system originates a Remote Trigger BlackHole route into the user’s network to mitigate an attack. Kentik Detect also includes APIs that support third-party orchestration servers like Radware DefenseFlow as well as hardware scrubbing appliances like Radware DefensePro or A10 Thunder TPS. Mitigations can be triggered automatically, activated manually in response to an alarm, or activated if there’s been no user acknowledgement within a certain time.
Summary
With a system that’s as deep and powerful as Kentik Detect, it’s difficult for an overview like this to do more than skim the surface, so we’ve covered just a few of the many capabilities available with our alerting system. If you’re already a Kentik Detect user, head on over to our Knowledge Base for more detailed information on how to configure policy-based alerts. Or contact Kentik support; we’re here to help you get up to speed as painlessly as possible. If you’re not yet a user and would like to experience first hand what Kentik Detect has to offer, request a demo or start a free trial today.