Extreme availability may become a competitive differentiator.
Extreme availability must ensure applications can respond in under a
second, whether it’s supporting a wireless trade on an exchange floor,
transferring multi-billion dollar transactions across global
infrastructures, or protecting the integrity of integrated applications
through a secure infrastructure that extends to third parties.
financial industry requirements for availability, reliability,
scalability, and performance are well beyond what most Commercial
Off-the-Shelf (COTS) software provides. Because of this, many financial
institutions have developed applications in-house. However, when
considering the scale, scope, and cost of changes needed to select,
support, and integrate several applications, organizations have
discovered the value of leveraging outside middleware to speed future
development and decrease development and maintenance costs. With the
development of extreme availability using commercially available
middleware, organizations no longer must choose between their
availability requirements or the development and maintenance benefits of
standard middleware; they can reap the benefits of both.
availability is the ultimate form of high availability, when the desired
goal is no outages at all because the perceived harm to the business is
immeasurably large. How does this differ from high availability? While
it might seem initially that the difference is relatively small,
achieving extreme availability requires changes throughout the
organization. Indeed, the scope of the changes is sufficiently large
that it makes sense to think of the needed changes as a culture change.
One way to think of extreme availability is to think of it as an
“availability culture” added to a standard high-availability operation.
even the most carefully designed fault-tolerant system, there’ll be
unexpected events. While these will be rare with careful planning,
proper preparation and response are vital if outages are completely
unacceptable. Organizations pursuing extreme availability pay close
attention to people, culture, fault tolerance, upgrade procedures,
testing, security, using mature software, documentation, and simplicity.
that seek extreme availability have a strong culture of reliability
that emphasizes taking extra precautions and being ready for surprises.
Extreme availability requires fanatical attention to possible single
points of failure. In power, for example, the building should have two
power feeds, coming in on opposite sides of the building, from separate
power substations. Each feed should support an independent electrical
distribution network inside the building. The outside power should be
backed up by batteries, which are backed up by generators with enough
fuel storage to last many days. Good planning might even secure multiple
fuel delivery contracts with independent providers who would be
available a week or so before the fuel store empties.
and/or software upgrades, extreme availability organizations use a
gradual rollout process that lets them observe the upgrade in actual
operation. This gradual rollout is coupled with a way to back out the
upgrade if problems occur. This may impose additional system redundancy
requirements on the design.
Redundancy and fault tolerance can
protect against random errors, but not against systemic errors such as a
programming mistake or improper hardware configuration. The only remedy
for such problems is testing, which extreme availability organizations
take seriously. Testing periods are never shortened. Frequently, there
are multiple independent testing organizations, and each can stop
deployment. Testing is normally not under development. Frequently,
testers are rewarded by the number of defects found, which is exactly
opposite from developers.
Security precautions address both
physical security (site, building and machine room) and network
security. When possible, the system should be isolated from the
Internet. If Internet access is a key feature, then layered protection
against penetration attacks is needed, as is a mechanism for handling
denial of service attacks. Data integrity needs to be checked after
failure recovery in the application.
organizations typically implement a higher-than-normal degree of
isolation. Development networks may be fully independent of production
networks and testing environments are fully independent of development
Software is markedly different from hardware in
the shape of its failure curve over time. New software is markedly more
likely to fail than software used for several years. Many organizations
therefore avoid all x.0 releases; some also avoid any commercial
software until it has a track record with other organizations.
availability relies on fault tolerance plus repair. Good monitoring is
fundamental to understanding when there’s an otherwise transparent
failure and repair is needed. Knowing what’s happening in the system
also lowers the likelihood of a system administrator taking an
Extreme availability requires complete
documentation because people can be single points of failure, too. Even
if you have several people cross-trained, you may find the backup is out
sick while the primary is on vacation.
Simplicity, a watchword
for most extreme availability organizations, means using the fewest
pieces possible for the function needed because of the impact on the
mean time between failures of additional components. It also means using
simple administrative interfaces to lessen the chance of error in an
Extreme availability takes a high-availability system
as a base and adds operations personnel with good skills handling
emergencies and a strong culture of failure avoidance. While this may
sound simple, it isn’t. Any culture change is difficult and the culture
of high availability is unusual. It’s a difficult adjustment for many.
Achieving Extreme Availability
achieve extreme availability, begin by developing guiding principles
that set forth requirements in site and building architecture, network
design and operation, system hardware and software, and procedures and
guidelines for design, development, deployment, and operation.
should be no single point of failure; this implies multiple levels of
redundancy and replication, including full site duplication. Thorough
testing is an imperative. A key operational requirement is a deployment
process characterized by a gradual, well-managed rollout of application
upgrades and the ability to immediately roll back an upgrade if a
problem emerges. Most extreme availability organizations anticipate the
new and old versions will run simultaneously in production for a while.
Regular audits are required to ensure the data is equivalent in the old
and new systems and that information isn’t lost if it’s necessary to
roll back to the old system.
An application supporting a financial
organization needs to be extremely reliable, fast, available and
scalable. Few, if any, bugs should exist in production systems. Response
time should be about a second or less for users. Highly available means
no outages, which has significant implications in the overall system
design. There must be two geographically separate sites working as hot
backups for each other, which implies further implications for hardware
and software, design, management and maintenance. Synchronizing dual
sites presents challenges.
The problem is how to reliably upgrade
an application given that the upgrade might fail and no downtime is
acceptable. As part of the upgrade deliverable, developers provide the
operations staff with a database verification routine that executes at
the end of each day and ensures the current and new databases are
Because the changes involved using vendor middleware
potentially have such broad impact, financial organizations should take a
gradual approach to deployment. Breaking the rollout into several
stages supports strategic goals without compromising systems integrity.
Hardware and software vary in reliability, so considerable care must be
taken in the initial selection. Even the best components must be tested
in the type of environment where they’re going to be used to validate
the expected reliability and uncover compatibility issues. Testing
failure recoveries during heavy loads is essential.
methodology should involve causing individual components to fail in
various ways and observing the recovery behavior. After the individual
failures seem to be handled properly, multiple failover scenarios
considered likely to occur are tested. In addition, various kinds of
hardware failures must be tested. This includes failure of entire
machines and key infrastructure components such as the disk subsystem
and network interfaces.
availability-defined as ensuring as close to 100 percent uptime as
possible-has become imperative for the global financial services
industry as companies face increasing availability requirements and
consider new alternatives to meet them. From a financial services
viewpoint, extreme availability must ensure all applications can respond
in under a second, whether it’s supporting a wireless trade,
transferring multi-billion dollar transactions, or protecting the
integrity of the integrated applications through a secure