As mobile voice and data services have become a communication necessity for business and private purposes, network outages will continue to be a serious problem for telco operators. A Heavy Reading report from a few years ago suggested that network outages in the telco domain seemed to be getting both longer and more frequent, with a rise in the volume of outages that exceeded 48 hours.
Though network conditions, and the networks themselves, have changed since then (it was still relatively early in the transition of IP replacing other protocols for real-time services), the possibility of an outage is still top of mind for most operators. After all, outages can do serious damage to your corporate reputation and result in subscriber churn and revenue losses.
At SEGRON, we believe that the first step to preventing network outages is understanding their primary root causes. The second step… well, that depends on which cause was actually responsible for the outage in the first place.
1. Overloaded Networks
In the report we mentioned above, congestion and overloaded networks were listed as the top causes of outages. On one hand, the timing of this report means that part of what we were seeing several years ago was networks grappling with the shift away from mostly voice and SMS traffic to increasingly diverse network traffic that emphasized low-latency mobile data and VoLTE usage.
On the other hand, the industry is currently entering a new era of transition from 4G/LTE to 5G. You can now expect another sustained increase in network traffic, this time from the internet of things (IoT). As 5G powers an increasing volume of low-latency network connections going forward, load testing will be more crucial than ever for telco testers.
In order to ascertain whether your network can actually hold up during periods of congestion, you’ll have to find practicable ways to stress test the network itself. This will require automation frameworks that are capable of leveraging entire networks of connected devices at the push of a button.
2. New Services
While network congestion was cited as the most frequent cause of outages, network issues were the leading culprits when it came to serious and severe outages. These issues can take any number of forms, but some of the most common were outages related to new service offerings.
Again, this is something that the 5G era is likely to see a lot of, for example:
You’re upgrading your network so that subscribers can access 5G data speeds when streaming video, but something goes awry with the implementation and the network crashes.
You’re updating your roaming partnerships and the new interconnections cause issues that you didn’t anticipate.
The traditional bulwark against this kind of issue is extensive regression testing, but that can be time-consuming and slow down time-to-market, which companies are often hesitant to do. The trick here is finding a way to speed up regression testing to improve test coverage without impacting time-to-market.
3. New Devices
For network failures, new service offerings are more frequently cited than the introduction of new devices, but the latter still causes its share of outages. At the risk of sounding like a broken record, this is another area where the risk is only going to increase as mobile networks evolve and more devices enter into common usage.
To decrease the odds that a new device will cause issues for your network, you’ll need to implement a robust testing infrastructure for devices that includes both end-to-end tests (dialing and calling, sending SMS messages, using mobile data, etc.) and tests that go beyond end-to-end.
What does this mean, exactly? Essentially, it means gathering signalling traces and tracking EDRs/CDRs from the systems under test. This allows you to get protocol-level information about how each element of each test is being carried out. From there, you can identify potential issues that aren’t yet manifesting themselves at either endpoint.
The only trick here is incorporating newly released, flagship devices into your automation framework quickly enough to maintain test velocity, but AI-powered workflows can be a big help here.
Another frequently cited cause of network outages is misconfiguration, which can typically be translated to human error. In the telecommunications industry, there’s a lot of manual effort that goes into any new service offerings, changes to your network, or adjustments to the equipment you’re using. While there’s no way around the fact that “to err is human,” it is possible to decrease the amount of manual intervention that goes into any of these updates and changes.
You can start with ramping up automation in your test labs by creating test scripts for each use case that do the same thing, the same way, every time without deviation. From there, it will produce test reports the same way every time, leading to greater transparency between testing and other functions.
Obviously there’s no way to root out human error completely, but you can detect it and reduce it through tactics like this.
5. Physical Failures
This is another term that works as something of a catch-all. Obviously a physical failure can be the result of anything from a hurricane taking down phone lines to rats gnawing through the wiring of your network equipment. Some of these may be partially human error again, but often times they really will be situations that are out of your control.
That said, there’s plenty you can do to prepare your organization for things like this when they do happen. For instance, any infrastructure you can put in place to help you identify the issue more quickly will be a huge advantage. This can be a matter of testing, but it’s also important to retain documentation in a way that’s accessible for key stakeholders.
For instance, if someone has identified a similar issue in the past, it’s of the utmost importance that whoever’s responding to the outage can locate, access, and understand that fix and other similar fixes. Reporting matters; it’s crucial that your incident reports be readable and consistent, one of the many benefits of keyword-based testing. From there, you’ll also want an official operational plan for any possible outage scenario outlining who’s in charge of doing what and when.
Of course, there’s no way to stop outages completely, but an ounce of prevention is worth a pound of cure. If you want to know more about how SEGRON can help you mitigate the costs of network outages, contact us.