Extreme Availability Where It Really Matters: The Financial Power of Integration

Extreme availability may become a competitive differentiator. Extreme availability must ensure applications can respond in under a second, whether it’s supporting a wireless trade on an exchange floor, transferring multi-billion dollar transactions across global infrastructures, or protecting the integrity of integrated applications through a secure infrastructure that extends to third parties.

The financial industry requirements for availability, reliability, scalability, and performance are well beyond what most Commercial Off-the-Shelf (COTS) software provides. Because of this, many financial institutions have developed applications in-house. However, when considering the scale, scope, and cost of changes needed to select, support, and integrate several applications, organizations have discovered the value of leveraging outside middleware to speed future development and decrease development and maintenance costs. With the development of extreme availability using commercially available middleware, organizations no longer must choose between their availability requirements or the development and maintenance benefits of standard middleware; they can reap the benefits of both.

Extreme availability is the ultimate form of high availability, when the desired goal is no outages at all because the perceived harm to the business is immeasurably large. How does this differ from high availability? While it might seem initially that the difference is relatively small, achieving extreme availability requires changes throughout the organization. Indeed, the scope of the changes is sufficiently large that it makes sense to think of the needed changes as a culture change. One way to think of extreme availability is to think of it as an “availability culture” added to a standard high-availability operation.

Availability Culture

In even the most carefully designed fault-tolerant system, there’ll be unexpected events. While these will be rare with careful planning, proper preparation and response are vital if outages are completely unacceptable. Organizations pursuing extreme availability pay close attention to people, culture, fault tolerance, upgrade procedures, testing, security, using mature software, documentation, and simplicity.

Businesses that seek extreme availability have a strong culture of reliability that emphasizes taking extra precautions and being ready for surprises. Extreme availability requires fanatical attention to possible single points of failure. In power, for example, the building should have two power feeds, coming in on opposite sides of the building, from separate power substations. Each feed should support an independent electrical distribution network inside the building. The outside power should be backed up by batteries, which are backed up by generators with enough fuel storage to last many days. Good planning might even secure multiple fuel delivery contracts with independent providers who would be available a week or so before the fuel store empties.

For hardware and/or software upgrades, extreme availability organizations use a gradual rollout process that lets them observe the upgrade in actual operation. This gradual rollout is coupled with a way to back out the upgrade if problems occur. This may impose additional system redundancy requirements on the design.

Redundancy and fault tolerance can protect against random errors, but not against systemic errors such as a programming mistake or improper hardware configuration. The only remedy for such problems is testing, which extreme availability organizations take seriously. Testing periods are never shortened. Frequently, there are multiple independent testing organizations, and each can stop deployment. Testing is normally not under development. Frequently, testers are rewarded by the number of defects found, which is exactly opposite from developers.

Security precautions address both physical security (site, building and machine room) and network security. When possible, the system should be isolated from the Internet. If Internet access is a key feature, then layered protection against penetration attacks is needed, as is a mechanism for handling denial of service attacks. Data integrity needs to be checked after failure recovery in the application.

Extreme availability organizations typically implement a higher-than-normal degree of isolation. Development networks may be fully independent of production networks and testing environments are fully independent of development and production.

Software is markedly different from hardware in the shape of its failure curve over time. New software is markedly more likely to fail than software used for several years. Many organizations therefore avoid all x.0 releases; some also avoid any commercial software until it has a track record with other organizations.

High availability relies on fault tolerance plus repair. Good monitoring is fundamental to understanding when there’s an otherwise transparent failure and repair is needed. Knowing what’s happening in the system also lowers the likelihood of a system administrator taking an inappropriate action.

Extreme availability requires complete documentation because people can be single points of failure, too. Even if you have several people cross-trained, you may find the backup is out sick while the primary is on vacation.

Simplicity, a watchword for most extreme availability organizations, means using the fewest pieces possible for the function needed because of the impact on the mean time between failures of additional components. It also means using simple administrative interfaces to lessen the chance of error in an emergency.

Extreme availability takes a high-availability system as a base and adds operations personnel with good skills handling emergencies and a strong culture of failure avoidance. While this may sound simple, it isn’t. Any culture change is difficult and the culture of high availability is unusual. It’s a difficult adjustment for many.

Achieving Extreme Availability

To achieve extreme availability, begin by developing guiding principles that set forth requirements in site and building architecture, network design and operation, system hardware and software, and procedures and guidelines for design, development, deployment, and operation.

There should be no single point of failure; this implies multiple levels of redundancy and replication, including full site duplication. Thorough testing is an imperative. A key operational requirement is a deployment process characterized by a gradual, well-managed rollout of application upgrades and the ability to immediately roll back an upgrade if a problem emerges. Most extreme availability organizations anticipate the new and old versions will run simultaneously in production for a while. Regular audits are required to ensure the data is equivalent in the old and new systems and that information isn’t lost if it’s necessary to roll back to the old system.

An application supporting a financial organization needs to be extremely reliable, fast, available and scalable. Few, if any, bugs should exist in production systems. Response time should be about a second or less for users. Highly available means no outages, which has significant implications in the overall system design. There must be two geographically separate sites working as hot backups for each other, which implies further implications for hardware and software, design, management and maintenance. Synchronizing dual sites presents challenges.

The problem is how to reliably upgrade an application given that the upgrade might fail and no downtime is acceptable. As part of the upgrade deliverable, developers provide the operations staff with a database verification routine that executes at the end of each day and ensures the current and new databases are equivalent.

Because the changes involved using vendor middleware potentially have such broad impact, financial organizations should take a gradual approach to deployment. Breaking the rollout into several stages supports strategic goals without compromising systems integrity. Hardware and software vary in reliability, so considerable care must be taken in the initial selection. Even the best components must be tested in the type of environment where they’re going to be used to validate the expected reliability and uncover compatibility issues. Testing failure recoveries during heavy loads is essential.

The testing methodology should involve causing individual components to fail in various ways and observing the recovery behavior. After the individual failures seem to be handled properly, multiple failover scenarios considered likely to occur are tested. In addition, various kinds of hardware failures must be tested. This includes failure of entire machines and key infrastructure components such as the disk subsystem and network interfaces.

Summary

Extreme availability-defined as ensuring as close to 100 percent uptime as possible-has become imperative for the global financial services industry as companies face increasing availability requirements and consider new alternatives to meet them. From a financial services viewpoint, extreme availability must ensure all applications can respond in under a second, whether it’s supporting a wireless trade, transferring multi-billion dollar transactions, or protecting the integrity of the integrated applications through a secure infrastructure.