What is reliability
Reliability is the confidence we have on a component or a system that it will not fail. Confidence is a stochastic, or probabilistic, figure; it expresses our expectations of the component, and so refers to the future.
Availability is on what has already happened. Reliability is on what is expected to happen.
How to improve reliability
This is what this blog is about. First, you have to understand the causes of downtime, presented, and then gradually implement the non-stop infrastructure principles, both described in this and other posts. In general, there are two directions to work on to improve infrastructure reliability:
- Working to improve the component reliability. The purpose is to increase MTTF (mean-time-to-failure) and decrease MTTR (mean-time-to-repair) for each component, that is to make a component fail less times, and when it does need less time to come back online again. This is the domain of “proactiveness” and “automation”. Proactiveness means that you work preemptively with each component, analysing logs, historical metrics or running tests, to predict and remediate failures before they happen. But when components evidently fail, you should have automated processes in place, to apply preset corrective actions quickly and without involving humans.
- Working to improve the system reliability, by arranging the components in such a way that their failure does not impact the system operation. This is the work of the system architect, to make the system “redundant” and “abstracted” in all tiers. By implementing these two principles, dependencies on physical components are eliminated, and the system becomes invulnerable to component failures (and thus component reliability). Infrastructure operation simply fails over to healthy components, in a way that is not sensed by hosted applications and services.
How to assess the reliability
To assess the infrastructure reliability is to calculate the probability that the given infrastructure will be operational in the near future. Such calculations always carry risks. Once you have the assessment you can use it to identify weak components of system design. Or, if there is a commercial interest, you can commit to your assessments with SLAs, implying penalties in case your commitments are not met.
Step 1. Identify the Components
Identify which infrastructure components support the services you offer. Be as analytic as possible. Some components may be services provided by external suppliers; include these as well.
Example: we will consider an infrastructure consisting of a Load Balancer, a VMWare cluster with 2 hosts, a HA SAN storage, 2 VMs with Windows clustering, a separate DB cluster with 2 physical nodes, a Firewall, and an internet connection. This example is used throughout this post.
Step 2. Draw the Topology
Create a diagram showing the topology of the components you have identified from a flow and dependency perspective (see example below). In the diagram, show which components are stand-alone and which are clustered, or redundant. Define as many “tiers” as you want to describe different parts of the infrastructure.
Example: For the infrastructure mentioned above we create the following diagram with 8 tiers. DC represents the datacenter facilities, power, networking, cabling, all presented as one component.
Step 3. Build a Table
Create a table, with one row for each component, to depict the diagram you created in the previous step. If a component is not critical for your operation, you can omit it. What we need for each component is its reliability. For some components this is easy to find, for some not. Here are some tips:
- you may need to refer to various sources from industry references
- express the reliability with precision of 4 decimal digits to allow for calculations
- if you need to be creative, do capture that in your assumptions
- for services by external suppliers, you can use their SLA figure, e.g. internet connectivity 99.8% promised by the ISP
- if you have the MTBF and MTTR data of a component, you can calculate the reliability
- OS, services and software have their reliability too; yes, they do fail. But, what is reliability of Windows, RedHat, AIX etc.? – be creative
- for clusters, remember that apart from the nodes, there is a clustering mechanism, which may fail by itself, and is a separate component
- for Datacenter reliability you can use industry classifications, like the one by the Uptime Institute
Example: this is the table for our example. It has 16 components in 8 tiers.
Step 4. Calculate Each Tier’s Reliability
For each tier appearing in the diagram and table above we must now calculate the combined reliability . To do so, we will consider the topology from the diagram, in particular if the components are in series or parallel:
- for parallel redundant components, we use the following formula:
ReliabilityOfTwoRedundantComponents = 1 - (1-RelOfComp1) x (1-RelOfComp2) RelOfCompN: Realiability of Component N
- for components in series, the combined reliability is the product of their individual reliabilities:
ReliabilityOfTwoComponentsInSeries = RelOfComp1 x RelOfComp2
Finally, we build a new table with one row for each tier, and its combined reliability, calculated in the previous step.
Example: the combined reliabilities for the 8 tiers of our example:
When we have filled out this table, the reliability of the whole infrastructure is simply the product of the combined reliabilities of all tiers.
Example: in our example, the infrastructure reliability is 99.5066%, corresponding to a yearly downtime of 43h 13m 19s.
Step 5. Point Out the Assumptions
Capturing and highlighting the assumptions in each step is an important step which should not be missed. People tend to look only at the final number and not read what comes before it, so make sure to have the assumptions clearly stated. No matter how obvious or how arbitrary the assumptions we were forced to make may be, they should appear together with the calculated reliability. Later on, we may be in position to limit down our assumptions with newer or more accurate data.
- specific components not considered
- there was no reliability data found for specific components
- reliabilities considered from SLAs of external suppliers