Model-driven Autonomic Management (MAM)

Department of Computer & Information Science,

Purdue University School of Science,

Indiana University, Purdue University,

Indianapolis, USA, 46202

E-mail: YDai@cs.iupui.edu

Tel:  (317) 274-3473; Fax: (317) 274-9742

Advisor: Dr. Yuanshun Dai

¡¡

Project Description

The rapidly increasing complexity of computing systems is driving the movement towards autonomic systems that are capable of managing themselves without the need for human intervention. Without autonomic technologies, many conventional systems suffer reliability degradation and compromised security. Autonomic management techniques reverse this trend.

This project is going to develop a novel and promising prototypical Autonomic Management system, what we term, the model-driven approach.

In developing the framework for our system, not only did we attempt to weave together the best ideas from previous approaches, but also integrate novel ideas into the design.

One of the most significant differences in the model-driven approach is that network components (e.g. nodes, channels and traffic) are not monitored evenly/randomly. Rather they are monitored based upon the predicted reliability or security provided by the models. For example, the components that are predicted to have lower dependability at any given point in time are monitored more intensively than highly reliable components. Furthermore, the dependability is not static. As time passes a component that may initially be highly reliable could become more prone to failure. Therefore, the monitoring frequency of that component should be adapted accordingly over time. Failure correlations should also be integrated into the monitoring decision. For example, Component B may be highly reliable, but if Component A fails, Component B may then have a higher probability of failure, perhaps due to the heavier load moved from A, and thus requires more intensive monitoring.

To initially test this approach, a software simulator was developed that utilized fault insertion testing to determine if uneven/probability-based  monitoring and healing was superior to traditional ¡°even/random¡± monitoring approaches. The results were clear: ¡°uneven/probability-based¡± monitoring consistently showed that more nodes were operational at any given time, and the healing process was performed more quickly. The development of the automated fault/intrusion detection/healing mechanism is the first step in improving system reliability autonomously.

Following validation of the significant improvement provided by the ¡°uneven/probability-based¡± monitoring approach, we determined what characteristics the system design should possess. First, it is important that the model be general so that it can be applied to as many network topologies and structures as possible. Second, the system should provide interfaces to identify different levels of models so that new reliability and security services can be built on top of, and integrated with, existing models. In other words the system components should be pluggable. Third, the system model needs to possess the ability to adapt to component changes ¡°on the fly¡± rather than require periodic reevaluation that is computationally expensive. Fourth, the structure should be hierarchical. In fact, our design develops two types of hierarchical architectures: a monitoring hierarchy and a modeling hierarchy as depicted in Fig. 1 and Fig. 2 respectively.

Fig. 1 Monitoring Hierarchy

¡¡

 

Hierarchical Approach 

Monitoring Hierarchy

In the monitoring hierarchy, the higher levels monitor the levels beneath them, as depicted by Fig. 5-1. Additionally, monitors in each level monitor other monitors on the same level except at the machine level.

Hierarchical control allows monitoring to be utilized in different ways at different levels. Machine level monitoring can be accomplished either through an OS resident approach by patching the OS to provide the additional functionalities, or perhaps the AOP approach offers more potential because it can be implemented without altering the OS or applications¡¯ source code. At the machine level it is important that the monitoring system not consume much computational or storage resources. It should be a background process that transparently runs without degrading the performance of the applications that are running on the machine.

Moving up the monitoring hierarchy, the next layer is the LAN level. At this level the individual machines would be monitored for aliveness. In the event the machine is detected to be in a state whereby it is unable to perform the autonomic functions that are resident in it, this layer could revive the system, and heal it through Backdoor mechanisms. Self-protection would also be present at this layer. In fact, the self-protection that is present at the machine level could be viewed as the last line of defense. The outermost layer would be the first line of defense. It would probably be appropriate to utilize redundancy as we move farther away from the machine level, i.e. employ multiple monitors that not only monitor the individual machines, but also watch other  machines or monitors.

At the LAN level, monitoring and control mechanisms can be configured to meet the specific requirements of the organization. For example, tolerance ranges would be configured more tightly for safety-critical organizations. Alarms and corrective measures may be designed to be more aggressive and proactive in these organizations. Furthermore, at this layer and other layers that are farther away from the machine, the resources consumed by the monitoring processes is not a constraint as it is at the machine level. This is because the hosts/servers/agents in these higher layers are dedicated solely to the management process (e.g. coordination, diagnosis, modeling, analysis, and healing) The complexity of the monitors at the organization level may be greater than the more generic monitoring processes of the higher layers which might encompass multiple organizations, thus reducing the potential for customization at those layers.

 

Modeling Hierarchy

The modeling hierarchy depicted in Fig. 5-2 can be briefly described as follows: There are three layers in the hierarchy. Communication between the layers is provided by interfaces that allow pertinent information to be passed to the appropriate model modules resident in each layer. 

 

Fig. 2 Modeling Hierarchy

Each layer is comprised of various model modules. For example, in the component layer (which is the lowest level of the modeling hierarchy) the SW/HW Reliability module contains models that enable the autonomic system to determine the state of the software and hardware components. Stochastic control charts may be one method that is used to determine whether these components are in a normal or abnormal state.

The Human Behavior models in this layer are responsible for ensuring system dependability in spite of human errors which can never be eliminated.  Three types of human operator errors that our Human Behavior models must address are: (1) Slips/Lapses - operators not doing what they intended to do, (2) Unintentional Mistakes ¨C operators doing what they intended to do, but their action was the wrong action to perform, and (3) Intentional Mistakes ¨C malicious actions carried out by hackers, intruders, or eavesdroppers.

The Virus/Spam Behavior models are concerned with system security and protection and are thus designed to improve system dependability.

The model modules resident in the communication layer are self-descriptive. Network Reliability models are responsible for ensuring reliable communication between nodes, files and services that may be distributed throughout the network. Network Security at the Communication Layer includes models for authentication via keys and certificates, for example. Also cryptography is utilized to ensure secure data transfer and communications.

At the highest level of the Modeling Hierarchy is the System Layer. The most important model module in this layer is the Management module which stores the models for three distinct management areas: (1) Resource management, (2) Security management, and (3) Service management.

It should be noted that the model modules shown in Fig. 2 is not an all-inclusive list. Other modules can be added based on organization-specific requirements. This provides for additional customization offered by the model-driven approach.

 

Overall Architecture of Model-Driven Autonomic Management

Once the monitoring hierarchy and modeling hierarchy have been established, the next step is to determine how to integrate these ideas into an architecture that produces the prototype design for a dependable, secure, and comprehensive autonomic system that meets all four of the aforementioned design goal characteristics. The proposed architecture is depicted in Fig. 3.

 

 

Text Box: Diagnostic Unit

Fig. 3 Autonomic System Architecture 

The modeling hierarchy is present in each network that is resident in the entire system (only one network is shown for illustration purposes).  This allows for a new LAN, for example, to be ¡°plugged in¡± to the existing architecture without the need for manual reconfiguration.

The sensor collects data deemed relevant for evaluating the state of the network or machines for which it is responsible. The sensor can initially be configured with default values, or parameters may be loaded from other domains that are already present in the system. However, the sensor is dynamic. This is accomplished by integrating a machine learning component into the sensing and monitoring mechanism.

Data retrieved and stored by the sensor is passed to the monitor where a decision regarding the system state is made. If the system is determined to be in an abnormal or uncertain state, the pertinent data is passed to a high-level controller that makes a determination on the nature of the problem at hand. The controller then chooses the configuration, healing, or protection component of the diagnostic unit which is linked to a problem/prescription library. The library contains a list of known problems (configuration issues, fault/failure issues, and security/protection issues). Each problem is mapped to a list of proposed corrective ¡°prescriptions¡±. The prescription that is determined to be the best to solve the current problem is chosen and passed to an effector which implements the cure by sending it to the component that needs to be healed.

A feedback loop from the effector to the problem library, the controller, and the monitor modules provides a machine learning mechanism for the system architecture, and thus provides for adaptability. This feedback loops allow the system to perform better with the passage of time.

First, the feedback to the Problem/Prescription library tells this module whether the prescription successfully solved the problem. If not, the library notes that the fix was unsuccessful then chooses another prescription to try. 

Second, the feedback to the Controller is used to notify the Controller that it erroneously decided corrective action was necessary when in fact no problem existed. For example, the Controller may have incorrectly identified a user attempting to access the network as an ¡°intruder¡± and thus triggered a protection mechanism. The feedback loop will tell the Controller that the user was ¡°safe¡±.  Not only must this information be passed back to the Controller, but cooperative communication between the Controller and Modeling Hierarchy must also be present. This is required so that the models can be updated as necessary, and the Controller can make better decisions based upon the updated models.

Lastly, the feedback to the Monitor provides real-time updates to the parameters being monitored to determine the state of the system and adjust tolerance ranges if necessary. For example, if the tolerance range is too narrow for a specific parameter, the system will experience a higher frequency of false alarms than would typically be expected. Therefore, the effector provides this information back to the Monitor where adjustments can be made autonomously.

It should be noted that the components of the system architecture are distributed and varied within and across network layers. They are dynamic and customizable.

Although the system is a work-in-progress that needs validation through testing, we believe that the model-driven approach will improve system dependability for many reasons, some of which are:

¡¡