In the early days of our company, our engineers were posed with a challenge: how do we push the innovation boundaries for access point (AP) technology to provide the best RF performance in the industry? The team came up with BeamFlex, a smart antenna technology which changed the Wi-Fi performance landscape.
Over the past decade, Wi-Fi has become ubiquitous. For instance, when we check into a hotel, we expect good Wi-Fi connectivity as a basic amenity, just like a good bed and a clean shower. From a network administration perspective, this translates into two things: keep the network optimized to provide best possible connectivity, and ensure the network runs smoothly at all times to provide the best client experience.
Any network admin will tell you that keeping the network running smoothly is easier said than done. Wi-Fi deployments are complex, and the nature of Wi-Fi makes troubleshooting very challenging. Every problem, irrespective of where it originates (the AP, client or elsewhere), ultimately manifests as poor client experience in some form or fashion, which does not bode well for the business. Hence the network admins are under tremendous pressure to identify and troubleshoot problems quickly.
Today, their troubleshooting process is mostly reactive. When customers complain, the admin scrambles to find the root cause, which is a painstakingly slow process because several KPIs, metrics and events have to be analyzed in the various layers of the network to get to the bottom of the issue. And when the issue is resolved, all this hard work goes unrecognized because in this reactive network management process, the customers are already impacted and unhappy, and the damage is done even before the first step to troubleshoot is taken
Through the years, as we worked with our customers to enable them to provide the best client experience possible, we realized that there surely must be a better way to manage a network. How cool would it be to have a network that manages itself? So, the smart folk on our engineering team got cracking.
We started with the “Why?” Network downtime costs approximately $5600/minute. That is huge. So obviously, a self-healing network would be of great value to network admins. But surely the problem could be solved using machine learning and artificial intelligence techniques, right? If so, then why hadn’t it been done yet? The answer was simple. It is an incredibly hard problem to solve given the nature of Wi-Fi. Why so? In order to explain, let’s look the steps involved in troubleshooting any Wi-Fi problem:
Step 1: Know that there is an issue
As stated earlier, network administration is mostly reactive today. The admin needs deep knowledge to understand where the issues can be, and there isn’t a good way to get visibility into the network.
Step 2: Identify root cause
In a network, seemingly unrelated events in the network need to be stitched together to tell a whole story. For example, if a client has poor connectivity, and the admin can see that his/her RSSI is constantly lower than -80 dbm, that could be caused by a multitude of reasons: the AP has hardware issues, there is too much interference in the network, there isn’t enough coverage in the network, and sometimes, it might be because of an out-of-date client driver, which is not an AP problem at all!
Step 3: Resolve the issue
Based on the root cause, the recommended actions may be different. For example, for the low RSSI problem, if it is localized to an AP, the recommended action might be to check AP placement. If the problem is impacting specific clients, checking client configuration would be the recommended action.
As you can see, when a problem manifests, it can be caused by one or more factors. The problem statement is quite similar to what doctors face in a daily basis – be able to diagnose the illness and prescribe treatments based on the symptoms that the patient is experiencing. Let’s say the patient has a fever. A fever with a runny nose could indicate a viral infection. If the fever is very high, it could indicate a bacterial infection. If, with the fever, there is a runny nose and an upset stomach, it could mean the stomach flu. If there is a fever with no runny nose, for a child, it could be an ear infection. For an adult, the same symptoms would warrant a battery of tests to diagnose the problem.
So the one symptom – fever – in conjunction with other symptoms, could lead to very different diagnoses from the doctor. If we want to automate the doctor’s diagnosis using analytics and treat body temperature as a KPI, detecting a fever, i.e. an anomalous temperature reading, is only the first step in the diagnosis.
Similarly, in networking, detecting an anomalous pattern for a KPI would only be the first step in diagnosing a problem. Defining what would classify as an anomalous pattern for a KPI is a challenging problem in itself. Correlating various data to get to the root of the problem is another story. Every network is unique. Every AP is unique. Then how do we find an analytical model that doesn’t require training for every network and for every AP? In the next blog post, we will dive deeper into our approach towards solving this problem.