A lot has been written on how important monitoring is. Instead of contributing to the frenzy or more tools, with heuristic end-to-end capabilities for DevOps, and the importance of monitoring, known already since the 80’s, I would like to give a disruptive view.
From the moment any tool presents its findings on a screen, the benefit is only evolutionary; it cannot contribute to a vision for a non-stop infrastructure for DevOps or any other operational model. We forget that monitoring is a process, not technologies. And, as a process, it should be regarded in its entire flow. A weak link makes the whole process slow and inefficient. And unsurprisingly the weakest link in processes is always humans (see this post).
Image: Monitoring the WOPR Computer, War Games movie, 1983
Look at the image above. The WOPR engineer monitors the LEDs. He tries to interpret the LED pattern to make sense of what is happening in the system and then decide if and how he must intervene. The engineer is still the most important part of the monitoring process. He receives the information, analyzes and correlates it to what he knows from experience, he knows what to ignore and what is important. No matter how many LEDs you have, and how many different colors they have, or how well they capture what is happening inside WOPR, monitoring is still manual, slow and vulnerable to human mistakes and omissions.
Fewer alerts, more intelligence
Bringing the WOPR example to our era, we have replaced LEDs with different tools with fancy dashboards, performance metrics and predictive analytics. Information, however intuitive it may be, is still validated and evaluated by human operators, based on their experience with the system, and by using a plethora of resources, such as documentation, other tools, and contacting colleagues.
Humans decide whether alerts and warnings should be ignored or acted upon and how. And once the decision is made, humans type in commands to a keyboard, the mighty manual tool.
To improve the monitoring process is to exclude the weakest link, and to do so we must automate the process, and move the decision process from humans to computers. Good and intuitive monitoring tools are of course important, however the focus should be in the red parts at the right side of the above diagram.
Your last choice should be to get more tools and increase the number of alerts you are receiving. It is a common and mistaken practice today to lower the thresholds and produce more alerts, whenever there is a problem with the infrastructure. Continuous tuning the static thresholds is a sisyphean struggle to make monitoring efficient; even if you manage to find the perfect balance, this is of no value in a dynamic environment, such as DevOps, and will soon have to start over again.
“We’re still learning how to monitor systems, how to analyze the data generated by modern monitoring tools, and how to build dashboards that let us see and use the results effectively. The amount of information we can capture is tremendous, and far beyond what humans can analyze without techniques like machine learning.” – Mike Loukides, What is DevOps, 2012
Monitoring integrated with automation
They say that in DevOps monitoring should be embedded in applications to make them capable of self-heal and of reacting to failures. This is true, but in more complex and dynamic environments, such as the cloud, the failures the application can react to are very few, and even in cases the application can react, it becomes more difficult to know what is the best way to do so.
It is a mistake to regard self-healing conditions only from the application or the infrastructure side. We need a separate automation component, which will be aware of both applications and infrastructure, and which will be able to interpret the information from monitoring and extract sense combining events and metrics coming from the both applications and infrastructure. Automation and monitoring cannot be regarded independently from each other.
Today, there are tools with machine-learning algorithms, which can derive what is normal for your infrastructure, and thus are in position to recognize situations outside the normal. There are no tools, however, capable of extracting sense by combining information from various sources, and apply logic rules and make decisions, other than comparing to fixed or dynamic thresholds from machine-learning.
Such an automation component should be integrated to monitoring to know the state of the system at any time. It should also be integrated to a configuration management mechanism to implement the decisions it takes. Apart from reacting to failures, automation should also actively challenge the infrastructure periodically by provoking simulated failures and assessing the results.
This automation component will ultimately replace human decisions.