To build a non-stop IT infrastructure, capable of supporting demanding IT services in a dynamic DevOps environment, 4 self-contained principles are needed. These principles are: Abstraction, Redundacy, Automation, and Proactiveness.
Principle 1. Redundancy
Redundancy dictates that entities in each tier be supported by multiple parallel instances (components), and thus operation may continue with any of these instances if one fails. The level of redundancy (that is, the specific number of parallel components) should be at least 2. The specific number for each implementation is determined by risk and cost factors, such as the intended availability, risk impact and risk probability, component reliability, mean-time-to-repair etc. The level of redundancy may also change in time to adapt to new requirements for the infrastructure.
Redundancy should be applied in all infrastructure tiers, for example:
- multiple power feeds
- server clusters
- network interface teaming
- RAID disk arrays; mirrored storage systems
- redundant WAN links; redundant routes
- redundant sites or datacenters
Of course as the scale gets larger, the technicalities of achieving redundancy (such as syncronization between components) becomes more challenging.
Principle 2. Abstraction
Redundancy would have no benefit without a mechanism to switch between redundant components without downtime. This mechanism is Abstraction. Abstraction can be defined as a “generalization” of multiple entities (redundant components) in one layer, with one logical entity in a higher layer. The purpose of abstraction in IT is to eliminate the dependency to specific physical underlying components.
A good example is clustering; to connect to a cluster you refer to it with its cluster name (the higher logical entity), instead of using the names of specific nodes (the lower physical components). The same principle should be applied in all infrastructure tiers. Examples include, but are not limited to:
- server hardware: server virtualization; virtual servers
- network: virtual front-end IP addresses (using load balancers) instead of real back-end IP addresses; DNS-resolved hostnames instead of IP addresses.
- storage abstraction: storage virtualization; virtual disks; clustered filesystems
- middleware: database clusters (accessed through the listener name, instead of hostname/IP address), messaging (Enterpise Service Bus, ESB)
- geographical site: using global server load balancers (GSLB)
The higher form of abstracton in IT today is the “cloud”, in all the different meanings this term may get.
Principle 3. Automation
Automation has a simple goal: eliminate human involvement in as many activities as possible. The purpose of automation is to avoid human error and to perform the activities more efficiently than humans.
Automation refers to operations and processes. The essence of automation is not simply to facilitate human actions, or support their decisions, but to take over the logical decisions from humans. Automation relies heavily on two components:
- Monitoring from all infrastructure components, as automated actions are usually triggered by events in infrastructure, and decisions depend on the current state of components.
- Methods and tools to apply the decisions to the infrastructure. Such tools are usually configuration management tools.
Automation should additionally control human actions to avoid erroneous requests which can potentially threaten the infrastructure operation. For example, if an operator tries to shutdown a server, the action should be refused by automation, if an application is depending on the server, or if a cluster would be left with one node (no redundancy).
Automation is elaborated in a separate category on this blog. Link to Category.
Principle 4. Proactiveness
Moving infrastructure operations and their related processes (how the people work) to a proactive model is an important part of the transformation to non-stop IT servies. It is, in fact, impossible to achieve zero downtime with a reactive model, where operations fix failures after they happen.
Proactiveness complements the previous 3 principles. Automation, redundancy and abstraction deal with failures as they happen, without involving humans. Proactiveness deals with situations, which are probable to happen in the future. It does so by identifying and addressing threats before they can actually impact the live operation. Proactiveness also extracts utilization patterns and involves humans preemptively to scale the resources accordingly in advance. Humans are thus engaged in preparing the infrastructure for the future.
Proactiveness requires a good level of monitoring. It involves analysing historical data, with a level of intelligence, to identify silent issues in infrastructure.
The Road to Non-Stop Infrastructure
The 4 principles presented above can also be used as a roadmap to reaching non-stop infrastructure. Each step requires tedious work.
- Step 1- Starting with the basics, implementing Redundancy and Abstraction in all tiers throughout the infrastructure. This includes clusters, networks, databases and middleware. With this step we have achieved infrastructure resilience.
- Step 2- Adding automation to infrastructure, by automating tasks and orchestrating activities. By the end of this step, we will have achieved an automated infrastructure.
- Step 3- Transforming operations and processes to being proactive. This is the end of the road, where we will finally achieve a non-stop infrastructure.