Airline Outages are a Lesson for All Businesses When It Comes To IT Disaster Recovery Plans

Another week, another airline has a critical information system go down, stranding thousands of travelers and making a big splash in the news. If you think these incidents seem to be happening with increasing frequency, you are right.

Technology is no longer simply used to improve efficiency and increase performance in the airline industry; it is now fundamental to how these businesses operate. If that technology fails, an airline doesn’t just slow down, it comes to a full stop. While this is not unique to the airline industry, the difference is that system outages are highly visible, highlighting the importance of investing in resilient and quickly recoverable technology and systems.

In addressing this, airlines need to build production resiliency and DR plans by application tier, focusing on the business and customer impact of each system. The most recent Delta outages impacted websites, mobile apps, and airport departure screens. Are these considered critical applications? Are the application dependencies understood and managed? I’m not much of an expert in airlines (though goodness knows I spend enough time on planes that I ought to be), but I do know something about fully recoverable and highly resilient production systems.

airline outageThe more critical the system, the more important is it to have multiple plans in place in case Plan A or Plan B fails. In the case of Delta last August, a minor problem became a major disruption when a power outage hit and it turned out that 300 of Delta’s 7,000 servers weren’t connected to the backup power system, causing thousands of cancelled flights. What happens when the failover fails?

More and more, airlines need to look beyond just Plan A/Plan B. No doubt, Plan A starts with building production resiliency into the application itself, so that it is resilient to common infrastructure problems. Plan B might include failover to a standby system with minimal loss of data or delay. In this day of ransomware and data compromises/corruption, it’s becoming increasingly clear that companies also need a Plan C that enables them to recover from replicated backups at a remote location where that data is properly isolated.

Finally, I sometimes wonder how often airlines truly test their disaster recovery plans. A plan that isn’t tested regularly is likely to be out of date as the systems, people, and processes change over time. You need to build muscle memory in your organization for what to do and, more importantly, who does it.

Here is a quick checklist that airlines can use to assess their risk of having a “full stop”” systems outage.

  1. When did you do your last application tiering and business impact analysis? How often do you do this?
  2. How do you categorize systems such as those that drive websites/ mobile apps/ airport departure screens- are they considered mission critical? Important?
  3. How do you manage the complexity of application interdependencies and update those with change?
  4. [Plan A] What applications are considered mission critical and have production resiliency – high availability / fault tolerance / load balancing – built in?
  5. [Plan B] What are considered Tier 2 / Tier 3? What are your application recovery time objectives (RTOs) and recovery point objectives (RPOs)?
  6. [Plan C] What are your DR plans in the event of a production site being reduced to rubble by a major disaster? What are the plans around replication of offsite data? How quickly could applications be brought up in that scenario? What about in the scenario of a malware attack or data compromise/ corruption? Do you have isolated copies of your data that the malware/ corruption does not get propagated to?
  7. How often do you update and test your DR plans? What success do you have with your DR testing? What percent of your apps do you test?

What percent is your IT budget of total revenue costs? What percent of your IT budget is spent on production resiliency/ modernization/ DR? Are you underinvesting?

Find out about our Disaster Recovery Services

______________________________________________________________________________________________________

Joseph George, Vice President of Product Management at Sungard AS, joined the company in September 2013. He is responsible for product management for the Recovery as a Service (RaaS) portfolio as well as Managed Hosting. He is a highly experienced technology product management leader, strong understanding of technology, extensive business management experience and proven analytical skills.

Previously George was with NetIQ, a global enterprise software company, for almost eight years in various business management roles.   Most recently, he served as Director of Business Operations & Program Management for all products, prior to which he led the Application and Performance Management product management team. Before joining NetIQ, George held several product management and engineering positions with IBM /Tivoli over a seven-year period. Prior to that, George worked in various engineering roles at Nortel Networks, Texas Instruments and Wiltel/Worldcom.

Joseph George holds a Master of Business Administration (MBA) degree from The University of Texas at Austin – The McCombs School of Business, a Master of Science (MS) degree from Mississippi State University and a Bachelor of Technology degree from the University of Kerala, India. He holds several patents and is a frequent author of technical articles and blogs for CIO and Forbes.com.