Product Demo

Troubleshooting a SaaS App in Kentik

SaaS applications make provisioning new services very easy for IT operations, but what's not so easy is troubleshooting performance problems, considering that we don't own or manage the SaaS providers network or the public Internet. Kentik's network observability platform form can monitor SaaS providers such as Office 365, Salesforce, GitHub, Dropbox, DropboxNow, and many more. Really, if it's on the Internet, we can monitor it. Now we use primarily a synthetic testing mechanism to continually interact with a particular SaaS provider and capture all sorts of metrics like packet loss, jitter latency, DNS resolution, page load time, and other metrics that you can customize for a specific use case. So let's take a look at just one example of how you can use Kentik to investigate a poorly performing SaaS application. We're starting in the state of the Internet, which is a service that we provide all of our customers as a built in function of the platform. And what we've done is deployed test agents around the world in strategic locations to gather performance metrics of many of the of the most popular SaaS providers and services out there. So here at the top, we get the HTTP status code, the response size, domain lookup time, connection time, response time, http latency, network latency, jitter, and packet loss. Now Kentik does the same thing with the major cloud providers such as AWS and Azure, as you can see here. And also for public DNS services looking at connectivity but also tracking resolution times. So as an example, let's say that we have complaints coming in from end users specifically in our Upstate New York branch office. That Microsoft 365 is very slow for them. To log in and sometimes the login page fails altogether. So what we could do is create our own custom tests to monitor Microsoft 365 specifically for that region. So let's switch over to our Synthetics test control center. I've set up a collection of synthetic tests to monitor the connection and performance of SaaS applications that are important to the folks in my Upstate New York branch office, specifically my Albany location in this case. I've also set up tests to monitor several on-prem devices like the gateway, the office router, and an on-prem wireless controller. Now for this scenario, we've received trouble tickets that our end users weren't able to log in to Microsoft 365 earlier in the day for about an hour or so. And sometimes the login was slow and it would fail, and sometimes the login page itself would fail before users could even get a chance to log in in the first place. So let's try to figure out what's going on. I'm gonna start by filtering my tests for Microsoft 365. And I wanna start by looking at my login simulation test. This is a synthetic transaction monitor that uses a built in script to interact with an application and it captures metrics like the overall transaction time and also of all the individual components. Now for this scenario, I'm using it to log in to Microsoft 365, and then track how long it takes to go through the process and then get all my apps like PowerPoint and and Word and so on. Now I'm gonna expand my time range to six hours, we can look at earlier in the day. And indeed, indicated by all the red that we can see here that for an hour or so, the test was failing continually. Also noticed that before and after that one hour, the test was succeeding, and we can see that, of course, by the color green that we see in the in the health status bar, and then also the pass message here under transaction completion. And when the test passes, look at the total transaction time. It's around seven to eight seconds, which the system is dynamically established over time as the normal time, it takes for the script to successfully complete. But I wanna focus on when we were experiencing those issues. And you could see that the total transaction time, it spikes to over twenty seconds, which we know is causing the test to fail. And we can see that by looking at the indicator here that says transaction timeout. So that indicates that something is slowing everything down causing a time out of the actual script and that's exactly what our end users were reporting. So what I wanna do now is take a look at the page load test because I wanna figure out why the page wasn't loading at all sometimes. So we'll open up that test And notice that when things are working fine, we get a two hundred status code. So that's good. The navigation time, domain lookup time are fine. Our average HTTP latency is about a second and a half which is normal, and our average latency, which represents the the connection time, so a network centric latency. Is around fifteen or sixteen milliseconds. Everything looks good here. And then when we look at that one hour period starting around ten forty, it looks like Our navigation time and our domain lookup time, they're still okay suggesting that this may not be a DNS problem. But over here to the right, notice that HTTP latency, it shot way up. But also very interesting to me, our average latency, which remember is network related, At shot up as well suggesting that there is possibly a network issue at play here, a network latency issue that's causing HTTP latency and then, of course, everything to slow down. To the point of failure sometimes. Now if we wanna look at the individual components of the page load, we can choose a time slice And then for the agent, we can select details. In the upper left, we can select waterfall, And then we can scroll through the files and the elements involved with that page load. Now I can see that there's no one file or one element that's really the smoking gun. But I do see that many files are taking a very long time several seconds to queue and then ultimately send. So something is definitely slowing down the actual transmission data, and it doesn't seem to be any one particular corrupt file or maybe a a DNS problem. So since we're looking at network latency, I'm gonna start by looking at our our local network resource to see if there's anything causing or reporting latency for our locally connected devices. So we're gonna take a look at our gateway device first, and it's reporting no problems at all during that time. Notice it's we're all green. So let's take a look at our office router as well. And again, the office router is reporting no latency among its low local connections either. We're all green here. So what we can do is expand our search to to the network out to Microsoft 365. So looking at the public and we can do that using our connection monitor test. So looking back at that one hour earlier in the day, So we could see here that right away there was a very clear and dramatic increase to network latency during that time which went away at the same time our end users reported the login problem went away. So there's certainly some sort of network latency happening at the exact same time that our end users reported they weren't able to log in. So that's good, but I wanna know where that's happened So what we can do is look at the path view, and that's generated using trace route, specifically Paris trace route. And if we look at that time period, we can open that up, And we could see here indicated in red indicating a problem that our our uplink or rather our upstream provider is experiencing some sort of latency. So it's not our locally connected devices, and it's not necessarily all the way on Microsoft's end, but there's something happening in the path, and here we can see that it's It's one of our upstream providers. So by using the Kentik platform, we were able to investigate a slow SaaS application from both a global perspective using the state of the Internet, and also from a regional perspective with custom performance monitoring for one of our regional branch offices. To learn more about Kentik and how we can help you monitor your SaaS applications, visit kentik.com/dem or reach out to your local Kentik representative as well.

Phil Gervasi explains how Kentik’s network observability platform helps IT professionals troubleshoot performance problems with SaaS applications. He demonstrates how Kentik’s network observability platform can monitor popular SaaS providers, such as Office 365, Salesforce, GitHub, Dropbox, and more, using synthetic testing mechanisms. By capturing metrics like packet loss, latency, DNS resolution, and page load time, Kentik provides valuable insights into SaaS performance. Phil takes you through a real-life example of investigating a poorly performing SaaS application and showcases how Kentik’s tools pinpoint network latency issues, both regionally and globally. Discover how Kentik empowers IT operations to proactively identify and resolve SaaS application performance problems.

Explore more from Kentik