Total cost of ownership of servers continues to rise despite improvements in hardware and software. Effective manageability remains a problem for a number of reasons. First, the management infrastructure deployed in the enterprise relies on traditional client-server architectures. Second, high levels of human interaction result in reduced availability while servers wait for operators to diagnose and fix problems. Finally, the deployed management solutions are in-band, with software agents operating on servers communicating with centralized management platforms. This implies that server management is only possible when the operating system is functioning, which is often not the case when management is required. Clearly, change is necessary.
Delegation of responsibility is widely acknowledged as a way of getting things done in an industrial setting. Providing workers with the authority to make decisions speeds things up, making an enterprise more efficient. Translating this observation to the server management problem, the solution is clear; empower management software to make decisions regarding change or reconfiguration. Empowering software to make decisions leads to a number of desirable software characteristics.
First, the software must be capable of autonomous decision making. In other words, the software should be an intelligent agent. This implies that the software should separate its understanding (or knowledge) of what is to be managed from the ways in which problems are diagnosed. Second, the intelligent agent cannot be part of the managed system in terms of the resources that it consumes; e.g. CPU and disk. This requires some explanation. Imagine a scenario where a run-away process is consuming almost all of the CPU. It is difficult to see how an agent would be able to control a server in these circumstances. Consider another scenario in which critically low levels of disk space are detected. An agent sharing resources on the host would be unable to save information potentially critically important to the resolution of the problem. Finally, let's consider the scenario in which the operating system is hung; the agent can no longer communicate with external parties.
The scenarios described in the previous paragraph lead to the inevitable conclusion that the agents tasked with delegated system management should reside on a separate management plane; that is a platform with separate computing and disk resources. Furthermore, the design of the computing platform should support the principles of Autonomic Computing, an area of computing recently proposed by IBM. Recently AMD has embraced autonomic computing principles in its efforts to improve manageability of servers.
Autonomic Computing is a relatively recent field of study that focuses on the ability of computers to self-manage. Autonomic Computing is promoted as the means by which greater dependability will be achieved in systems. This incorporates self-diagnosis, self-healing, self-configuration and other independent behaviors, both reactive and proactive. Ideally, a system will adapt and learn normal levels of resource usage and predict likely points of failure in the system. Certain benefits of computers that are capable of adapting to their usage environments and recovering from failures without human interaction are relatively obvious; specifically the total cost of ownership of a device is reduced and levels of system availability are increased. Repetitive work performed by human administrators is reduced, knowledge of the system's performance over time is retained (assuming that the machine records or publishes information about the problems it detects and the solutions it applies), and events of significance are detected and handled with more consistency and speed than a human could likely provide.
The remainder of this article describes the essential requirements of an autonomic element for servers and client computers.
Figure 1 provides a view of an autonomic element as proposed in the Autonomic Computing literature. In this figure, the Managed Element is the server or a client workstation, which includes the hardware, operating system and hosted applications.