Fault tolerance
Fault tolerance is the ability of a system to keep working even when parts of it fail. A fault could be a broken component, a bad data value, or a damaged connection. In a fault-tolerant system, errors are hidden from users and the overall operation continues without downtime. This is especially important for high‑availability, mission‑critical, or life‑critical systems. If a system can still function at full speed despite faults, it’s fault tolerant. If it can only keep operating but with some slower performance or reduced features, it’s resilient.
Fault tolerance isn’t limited to computers. Buildings with backup power, braking systems in cars, and safety devices in planes and trains all use fault-tolerant ideas. The goal is to prevent a single problem from causing a total failure.
A quick look at history helps explain how it works. The first fault-tolerant computer, built in 1951, used multiple copies of memory and a voting method to decide the correct result. Early designs relied on detecting faults and relying on a worker to fix them. Over time, engineers learned that truly fault-tolerant systems must be able to diagnose problems automatically, isolate faulty parts, and switch to backups without stopping the system. The common approach today is redundancy: having extra parts or copies ready to take over.
Redundancy comes in two broad forms. Space redundancy adds extra components, functions, or data that aren’t needed when everything is perfect. Time redundancy repeats a task or data and checks the results against a known good copy. Space redundancy appears in hardware, software, and information (like data backups). Time redundancy is often used in testing and in keeping data consistent.
Two famous ideas in fault-tolerant hardware are dual modular redundancy (DMR) and triple modular redundancy (TMR). In DMR, two copies run in parallel and a voting circuit flags any mismatch. In TMR, three copies run and a majority vote picks the correct result, removing the faulty copy from consideration. Some systems use lockstep designs where all copies run in perfect sync, so they stay identical and faults can be detected quickly. A related approach is pair‑and‑spare, where two copies run as a pair and a second pair serves as backup, with a voting circuit choosing the correct output.
In computing, there are also software techniques. Failure-oblivious computing lets programs keep running by returning manufactured values for bad memory reads, though this can slow things down. Recovery shepherding is a lighter approach that fixes errors on the fly by instrumenting the program’s binary without heavy recompilation, usually with minimal overhead. Another common pattern is the circuit breaker: if a part of a distributed system starts to fail, requests are temporarily blocked to prevent a total collapse.
Fail‑safe, fail‑secure, and fail‑soft describe different ways systems respond to faults. Fail‑safe aims to protect people and property, often by reducing risk in a controlled way. Fail‑soft or graceful degradation lets a system continue operating at a lower level when some components fail. Fail-fast, on the other hand, reports a fault immediately so the problem can be diagnosed quickly and not cause further damage.
Redundancy and fault tolerance have costs. Extra components add weight, size, power use, and price, and they complicate design, testing, and maintenance. Designers must choose which parts to make fault-tolerant based on what’s most important for safety, reliability, and cost. For example, in transportation, the occupant restraint system and braking system are highly prioritized for redundancy because failures here can be life‑threatening, while some noncritical components may be kept simpler.
Availability is a key measure of fault tolerance. It is often expressed as a percentage of time the system is expected to be up and running. A system with five nines (99.999% availability) is online almost all the time, with only a few minutes of downtime per year.
In short, fault tolerance uses redundancy and smart design to prevent faults from causing failures. It’s about keeping systems running smoothly, even in the face of faults, while balancing cost, complexity, and performance.
This page was last edited on 2 February 2026, at 13:36 (CET).