You are on page 1of 2

Ultra Server Availability in the Cyber Age

Umar Farooq Availability Availability, in the context of information technology, refers to the proportion of time a system is in a functioning state. Computer server availability refers to the proportion of time a data or web server is able to handle service requests. An increasing number of important real world services depend on technological infrastructure, and there have been a number of examples in the recent past where critical services have gone down. For example, a faulty warning system caused an air traffic controller router to go out of service in 2009, resulting in more than 800 delayed flights.i And a distributed denial of service attack late last year caused the online merchant services of Visa and MasterCard to be unavailable for two hours.ii Ultra-availability Highly critical life-threatening services require a high level of availability, called ultra-availability. Although standards differ, the industry tends to aim for a goal of 99.999% availability (referred to as five-nines of availability). Telephone switching services, clearly critical to a number of life-saving emergency services, are an example of a service that has reached a level of ultra-availability. One switching server, for example, has a two hour downtime in forty years, an availability of 99.9994%.iii But not all industries need to meet this standard of availability. Relevant Industries Clearly, telephone connections are critical to emergency services, and must be expected to meet a level of ultraavailability. And reliable air traffic control, for instance, is needed to keep flights from using the same flightpath, which can result in airline crashes and the loss of life. Global positioning systems, which are now needed for everything from 9-11 emergency positioning, to airline navigation, to guidance for intercontinental ballistic missiles, are another example of systems that should be expected to be ultra-available. And as the electrical power grid, the system of power distribution and management, relies more and more on data servers and software for load balancing and prediction, one should look for a level of ultra-availability from this industry as well. Many hospitals, for example, only have short term access to electrical generators, and would not be able to care for their most critical patients without electrical services. Architecture The hardware architecture aspect of availability involves various strategies for ensuring service uptime, relying on different kinds of redundancy. Servers may share load, to reduce the likelihood of failure. Or they may be clustered together, although this strategy is vulnerable to a single point of failure, for example the vulnerability of the internet Domain Name System to downtime in a number of critical geographical locations. The master/slave strategy allows for redundancy as well, but is not complete redundancy because the slave, by definition, cannot offer the same level of functionality as the master, and is thus only a temporary replacement. A hot standby or an online copy and a standby copy, refer to systems where additional servers are available that offer the same level of functionality as the online (or active) copy, and can switch in seamlessly when the running server goes offline for whatever reason. Software aspects of availability include issues of fault detection, usually accomplished by polling of the server by a separate diagnostic server at set intervals, which can trigger an escalation if it cannot, for instance, get a satisfactory response after three tries. This is similar to a hardware rack of servers with LEDs on it that can be checked from time to time to ensure proper operation. Once a problem is detected, fault isolation software handles switching to another server, in most cases to a redundant hardware. The fault isolation software may need to handle multiple system failures, and be as quick as possible in its switching. Finally, software fault recovery must ensure that the server is back to service, and escalate the situation further if this is not the case. The most important factor in ensuring reliability is the testing and upkeep of independent backup systems, as one can usually plan to do something once an initial fault is detected, but it is very difficult to do anything once even the backup systems fail. The best way to ensure backup systems are always available is regular, thorough, testing of all backup systems, and ensuring that they are as independent as possible of the online systems. This requires human checking often times, and the

availability of some humans on a 24/7 basis to handle any unforeseen eventualities in the backup systems. Too often, backup systems do not work as expected, and due to safety concerns, as in the case of the cooling systems for a number of Japanese nuclear power plants, humans cannot go repair the facility once a failure has occurred. Thus, regular maintenance and independence of backup systems is critical for availability. Testing Waiting and recording the total uptime and downtime of a system can only truly test ultra-availability. It may be possible, however, to manually test some backup systems by forcing the failure of main systems, but it is possible that not all scenarios can be practically taken into consideration during this testing phase. Thus, in reality only an upper bound measure of reliability is possible to calculate with one failure. Notes on New Challenges Although hardware has become cheaper, thus allowing for more redundant systems, a number of software problems persists, especially in systems that are connected to the internet. The major threat to services that rely on the internet seems to be their interconnectedness itself. The underlying structure of the internet has some vulnerabilities, for instance in its domain name systems or its physical data cables, which, although distributed geographically, can still cause significant downtime or loss of bandwidth to entire continents. As recently as 2008, for example, the entire wired internet availability of North Africa and the Middle East as well as half the bandwidth of India, was thwarted by one of the handful of underwater cable connections being accidentally cut by a ship.iv An expiration of VeriSign's master certificate in 2004 caused widespread issues as it could no longer provide verification of other important websites, including some services critical for e-commerce on the internet.v And as more and more businesses begin to rely on cloud-based servers (for example the Amazon EC2vi) to store their business-critical data, a wider range of markets becomes susceptible to accidental and malicious glitches. Finally, though not critical, social networking sites that are pushing Web 2.0 functionality can cause widespread problems. A 2.5 hour outage of Facebook last year for example caused problems with a number of other websites that used Facebook Connect, a service that allows users of Facebook to track and share third party sites they visit.vii Similarly, a two hour denial of service attack that brought down Twitter in 2009 caused glitches in other sites drawing content from the social networking service, including Facebook and Live Journal.viii Thus, careful design considerations must be kept in mind as web-based services become more and more inter-dependent, especially in the light of more and more sophisticated security attacks References i. US Air Traffic Control System Outage Last Month Helped Along By Human Error, http://spectrum.ieee.org/riskfactor/computing/it/us-air-traffic-control-system-outage-last-month-helped-alongby-human-error ii. Wikileaks Supporters Tear Down Visa in DDOS Attack, http://www.digitaltrends.com/computing/wikileakssupporters-tear-down-visa-in-ddos-attack/ iii. Siewiorek, Daneil P. and Swarz, Robert S. Reliable Computer Systems: Design and Evaluation 3rd ed., A K Peters, Ltd., 1998, Pg. 524 iv. Internet Failure Hits Two Continents and May Last Two Weeks, http://www.nationalreview.com/mediablog/34765/internet-failure-hits-two-continents-and-may-last-two-weeks/tom-gross v. VeriSign Dead Cert Causes Net Instability, http://www.theregister.co.uk/2004/01/10/verisign_dead_cert_causes_net/ vi. Why Amazon's Cloud Titanic Went Down, http://money.cnn.com/2011/04/22/technology/amazon_ec2_cloud_outage/?section=money_latest vii . Facebook's Outage Was It's Biggest Ever, http://articles.cnn.com/2010-0924/tech/facebook.outage_1_outage-half-billion-users-social-networking viii . Twitter Hit by Denial of Service Attack, http://articles.cnn.com/2009-0806/tech/twitter.attack_1_twitter-co-founder-biz-stone-social-networking-site-twitter-denial-of-service-attack

You might also like