Reflections on recent LTE outages - Urgent Comms

2022-10-01 12:51:00 By : Ms. Tracy Lei

Rogers Communications Canada experienced a national service outage on July 8, 2022, that halted LTE communications for millions of Canadians, including public-safety agencies counting on mobile broadband for operations. In light of this jarring failure, should public-safety leaders rethink plans to collapse all wireless communications onto a single wireless technology platform operated by a service provider?

The root cause of the Rogers outage was relatively mundane. It was not a hack. It was not a disaster. The failure came from a routine, well-planned, early-morning configuration change to core internet router reachability. The faulty configuration change flooded Rogers’ network-routing equipment with invalid route information, halting all incoming and outgoing traffic until the national routing fabric could be restarted—a laborious process that took many hours. Restoral efforts continued into the night.

While Rogers’ national outage was significant in scope and impact, it was not a unique event. Modern cellular networks operate with a core IP network serving as the network’s central nervous system. Equipment failure, configuration errors, and security attacks can affect the core network subsystems that govern LTE network operation. Subsystems implementing the Border Gateway Protocol (BGP), Directory Name Service (DNS), IP Multimedia Subsystems (IMS), and Home Subscriber Server (HSS) each can trigger national outages.

Recent history shows how prevalent these failures are. Just days before Roger’s outage, the KDDI LTE network in Japan experienced a national two-day shutdown caused by congestion in its IP Multimedia Subsystem hosting Voice over LTE. In Europe, an attempted June 2021 upgrade to Orange’s network in France disrupted voice services across the country. Christmas 2020 brought about a mass failure of AT&T cellular operations, including FirstNet, across the Southeast U.S. following the Christmas-day explosion outside a significant MPLS routing hub in Nashville. And in June 2020, T-Mobile’s Voice over LTE network experienced an outage across the southeast US following an IP-routing misconfiguration. The list can go on.

LTE’s foundational role in public-safety operations is well established in many countries today. Mobile broadband communications based on LTE serve as the essential enabler of vital tools, ranging from e-mail to incident dispatch. The future will expand this role. With all mobile network infrastructure vendors offering support for 3GPP Mission Critical LTE specifications, some nations are planning to retire legacy Land Mobile Radio (LMR) systems that are based on analog radio or digital trunked radio transmission technology.

The argument for a transition from narrowband to broadband is logical. As the term implies, narrowband radio technology lacks channel capacity for extensive data flows beyond basic messaging. In contrast, LTE was architected to provide an IP-based super-set of functionality that can accommodate the needs of outdated technology silos. For cost-conscious governments, the argument is enticing. Rather than funding two network technologies – LMR for push-to-talk tactical communications and LTE for data-centric communications – they can fund a single converged mission-critical LTE network that handles all requirements.

Unfortunately, the recent massive network outages raise several issues that local and national public safety authorities must consider before taking the final step of powering down the LMR networks. Chief amongst those is the ability of a dedicated public safety core network to operate following widespread IP or transport (e.g., MPLS, optical) failure that impacts the radio access network. Resilience tools such as MOCN architectures, roaming, multi-SIM user devices, and satellite communications can be essential elements of an authority’s strategy. Likewise, the continued operation of a parallel LMR network provides significant redundancy. Specific points to consider include:

Perhaps the cadence of mass LTE network outages will slow down or stop. Emerging AI-assisted network management tools, automation, and operational experience may lead to an elimination of mass outages. But the complexity of inter-connected IP networks dependent on fragile protocols such as BGP, DNS, and IMS suggests that mass outage risk remains a lingering factor for years to come.

Placing all communications requirements in a single technology basket is tempting. But public-safety communications planners must provide a robust PACE communications methodology for resilience that incorporates each PACE element: primary, alternate, contingency, and emergency communications functionality.

Christmas morning explosion in downtown Nashville is another prime example. Communications aftermath persisted regionally for days afterward.

This all points out to why this national cry to move 9-1-1 to an all-IP infrastructure sounds pretty suspect to me. I am sure that the vendors love it as that is what they are selling, but the pubic safety agency that buys it, and ends up with injured or deceased parties when it collapses like a house or cards, is gonna be the one who gets to play musical lawyers. Older Technolgies like T1s are decidedly lower tech, but also decidedly more resilient and more decoupled, so that an error at one place doesnt collapse the whole blasted network!

To leave a comment login with your Urgent Comms account:

Log in with your Urgent Comms account

Or alternatively provide your name, email address below:

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

Sign up for UrgentComm’s newsletters to receive regular news and information updates about Communications and Technology.

Learn from experts about the latest technology in automation, machine-learning, big data and cybersecurity.

Find the latest videos and media from the market leaders.

Want to reach our digital and print audiences? Learn more here.