Disaster Recovery (DR) refers to the set of policies and procedures in place to ensure the continuity and recovery of mission critical systems in the event of a disruptive event such as a power outage, flood, or cyberattack. In other words, how quickly can you get your computers and systems up and running after a disastrous event?
High Availability (HA) is the concept or goal of ensuring your critical systems are always functioning. In practice, this means creating and managing the ability to automatically “failover” to a secondary system if the primary system goes down for any reason as well as eliminating all single points of failure from your infrastructure. Like disaster recovery, high availability is a strategy that requires careful planning and the use of tools. Achieving a network uptime of 99.999% (commonly referred to as “five nines”, which equates to 5.26 minutes of downtime) should be your organization’s goal. Unlike with fault tolerant systems, there will always be some amount of downtime with high availability, even if it is only a few seconds.
Fault Tolerance describes a computer system or technology infrastructure that is designed in such a way that when one component fails (be it hardware or software), a backup component takes over operations immediately so that there is no loss of service. The concept of having backup components in place is called redundancy and the more backup components you have in place, the more tolerant your network is hardware and software failure.
For example, a single application running at the same time on two servers. The servers essentially mirror each other so that when an instruction is executed on the primary server, it is also executed on the secondary server. If the primary server crashes or loses power, the secondary server takes over with zero downtime. There are two small drawbacks of fault tolerance however; it is more costly because both servers are running all the time and there is a risk of both servers going down if there is a problem with the operating system that the servers are using.
Resiliency refers to a system’s ability to stay operational during abnormal conditions. Resiliency is the ability of a system to recover from failures and continue to function. It’s not about avoiding failures, but responding to failures in a way that avoids downtime or data loss. The goal of resiliency is to return the application to a fully functioning state following a failure. High availability and disaster recovery are two crucial components of resiliency.