Heartbeat Monitor v1.0


Status

This is a draft for discussion.

Objective

The Globus Heartbeat Monitor (HBM) is designed to provide a simple, highly reliable mechanism for monitoring the state of processes. The HBM is designed to detect and report the failure of processes that have identified themselves to the HBM. Originally designed for monitoring Globus system processes exclusively, the HBM design has been expanded to allow simultaneous monitoring of both Globus system processes and application processes associated with "user" computations.

It is difficult in general on the basis of missing status reports to distinguish process failure from other failure events, such as network partitioning and host failure. Thus, strictly speaking the HBM detects process failure when the host and network connections are functioning properly, and also monitors the availability of a process or host as evidenced by the received and missing heartbeats. The HBM also provides notification of process status exception events, so that recovery actions can be taken.

Requirements

Reliability and robustness were primary design goals for the Globus heartbeat monitor. For this reason, the heartbeat monitor is designed to have no dependence on other Globus components (such as MDS), nor any special fault-tolerance components.

Overview

The HBM consists of three types of components: There is one (Globus system) HBMLM running on each host, checking and reporting the status of the monitored system and application processes on that host. The HBMCL is used to register each monitored client process with the (unique) HBMLM on the same host, and to unregister those processes as part of normal process termination. This registration process is necessary since we are using the HBMLM as an external monitor of the client processes; it has no way of knowing which processes are of interest unless and until it is told. Each HBMLM periodically performs a review cycle in which it checks the status of the client processes it is monitoring, updates its local status information, and sends a report on each monitored process to one or more external agents (HBMDCs) specified at registration. There can be any number of HBMDCs, typically one for tracking all of the monitored processes associated with the metacomputing environment, plus one for each distributed application. Each HBMDC receives the reports sent to it by the HBMLMs and incorporates those reports into its local repository. The HBMDC also infers the unavailability or failure of monitored components based on HBMLM reports that are expected but not received (time-out situations), and periodically adjusts the status of client processes accordingly. The status information in each repository is checkpointed regularly. In addition, the HBMDC can recognize specific exception status changes and generate appropriate notifications via callbacks.

The following diagram illustrates the relationships between client (HBC) processes, HBMLMs, and HBMDCs. Each host on which monitored HBC processes can run has one HBMLM. Two HBC processes, one a Globus/GUSTO system process and the other for application 1, that are running on Host A register with the HBMLM on that host. Similarly, two HBC processes on host B register with the HBMLM on that host. The HBMLMs monitor the registered HBC processes and periodically send reports to the appropriate HBMDCs. The HBMDCs that the HBMLMs report to may be on the same host as the HBMLM, or on a different host.

 

HBM Design Details

The HBM provides a number of capabilities to enhance the utility and robustness of the monitoring function. Registration of a monitored process can specify multiple HBMDCs to which status reports are to be sent. Multiple registrations can be used to achieve the same result. Also, at registration the HBMCL may specify a text string to be passed by the HBMLM to the HBMDC. This string is passed "as is" by the HBMDC to user-defined routines for processing, which normally would consist of setting up callback routines for client status exception handling. Re-registration of a process for reporting to a previously specified HBMDC results in substitution of the new message string for the one specified in the previous registration. Unregistration is universal, i.e., if a monitored process is unregistered then it is reported as such to all HBMDCs it is being reported to -- selective unregistration of a monitored process is not supported. Registration and unregistration of client processes with the HBMLMs is done using TCP. All heartbeat communication from HBMLMs to HBMDCs is done using UDP. UDP was chosen to avoid the additional processing and communications overhead associated with reliable protocols such as TCP and to allow the communications to be done in non-privileged mode. The next three subsections will describe each of the components of the HBM in greater detail, including an explanation of how to use them.

HBM Client Library (HBMCL)

The HBMCL can be used either as a library API that can be linked into the program for the monitored process, or as an independent program (globus-hbm-client-register) that accepts as one of its parameters the process id (PID) of the process to monitor. The HBMCL API consists of four procedures that are incorporated into the external registration program. Those procedures are: The signatures for each of the above functions is given below. Following each signature is a short narrative of the functions performed by the procedure.

The client registration/unregistration program that uses the API (globus-hbm-client-register) takes the same parameters as the globus_hbm_client_register() API procedure. globus-hbm-client-register calls the API routines as appropriate to register or unregister the specified process. The command format for invoking globus-hbm-client-register is included below.

Thus, the life-cycle of a heartbeat client process (HBC) is as follows:

HBM Local Monitor (HBMLM)

The (unique) HBMLM (program globus-hbm-localmonitor) on each host waits for, and processes, registration and unregistration messages from HBMCLs. Also, the HBMLM periodically checks the status of the monitored processes, updates its internal repository appropriately, formats and sends heartbeat messages to the appropriate HBMDCs, and checkpoints the repository data. Note that there is normally one HBMLM process on each host that monitors all client processes on that host. An HBMLM process is active as a part of the Globus/GUSTO system, and should not be started by users as part of applications run on hosts participating in the Globus/GUSTO testbed. The full execution cycle of the HBMLM follows.

The HBMLM first reads in its parameters, initializes the (TCP) port for receiving registration/unregistration messages and the (UDP) port for sending heartbeats. It  then writes the parameter information to an external file for use by the HBMCL API. Next it loads the checkpoint data from the previous execution (if any), and then verifies the status of any registered HBC processes. (This is done on UNIX systems using ps, other systems are not currently supported.) The HBMLM next sends heartbeat report messages to the appropriate HBMDCs, after which it checkpoints. Finally, the HBMLM uses globus_poll_blocking() to control a (timed) select to wait for registration and unregistration messages.

When registration and unregistration messages are received, the internal table/repository data is updated appropriately, the process reported to the appropriate HBMDCs, and the repository data is checkpointed. For registration messages, the monitored processes are added to the repository data if necessary and the HBMDC data for the monitored process is added (or updated if the process is already being reported to a specified HBMDC). For unregistration messages, the corresponding monitored process is flagged as unregistered, and counters initialized for the number of times the monitored process has been reported to each HBMDC as unregistered. Unregistered processes are reported five times before they are purged from the repository. This is done to provide a higher level of confidence in the notification, given the use of the unreliable UDP protocol.

The formats of the HBMLM parameter file, the heartbeat report message, and the HBMLM repository file are all given later.
 
The main functions of the HBMLM are as follows:

HBM Data Collector (HBMDC) API

The HBMDC API is a library of functions that perform monitoring of HBCs and notification of exception events. The functions consist of a group of reentrant, threadsafe procedures that can be used to construct and maintain a number of data collector instances within a single process. These procedures can be incorporated into programs that use the API notification callback mechanisms to trigger responses to exceptional changes to the status of HBCs. The core procedures of the HBMDC API are: These routines assume the existence of a user-coded procedure used for evaluating the client message strings of client processes and setting any appropriate callbacks based on client status events. A pointer to this procedure is provided as a parameter to globus_hbm_datacollector_create().

The signatures for each of the above functions is given below. Following each signature is a short narrative of the functions performed by the procedure.

A Globus/GUSTO HBMDC has been developed using the HBMDC API. This program (globus-hbm-datacollector) works in coordination with the HBMCL registration to provide e-mail notification when monitored Globus/GUSTO processes abend. The HBMCL message includes the e-mail address of the responsible party to notify in the client message field of the registration, it is passed by the HBMLM to the HBMDC, and the HBMDC uses it to send the notifying e-mail if the HBC goes down. This program can be used as a model for developing other, application-specific, HBMDCs.

HBM Procedure Signatures and File Formats

The signatures of the HBM Procedures follow.
 

HBMCL Procedures and External Registration/Unregistration Program

HBMLM Program

The HBMLM program takes the following flagged parameters:

HBMDC Procedures

Following are the signatures for the core procedures:  

Summary of Protocol Messages For HBM

There are a number of messages exchanged between the HBC processes and their HBMLM, and between the HBMLMs and the appropriate HBMDCs. The fields in these messages are of the following types: character (1 byte), unsigned integers (32-bit in network format), strings (variable-length null-terminated), UTC time (Universal Time Code as 32-bit unsigned integer in network format).

HBC Registration

For each HBC registration the HBMLM verifies that the IP Number for the registering HBC process (as provided as the source address for the message) is a valid IP Number for the host on which the HBMLM is running. The HBC processes register with the HBMLM for their host, providing the following information:
 
Field Contents
RegCmsgLength Length in bytes of the registration message.
Format: Unsigned integer.
RegCregCode Registration/unregistration Code
Format: Unsigned Integer (4 byte). 

Values: 

    GLOBUS_HBM_MSGTYPE_REGISTER
RegCprocessPID PID of the HBC process.
Format: Unsigned integer.
RegCprocessName Name of the HBC process as returned by ps.
Format: String.
RegCreportName Name the HBMLM is to use when reporting this HBC to this HBMDC.
Format: String.
RegCDChbInterval Requested interval for generating heartbeats (in seconds).
Format: Unsigned integer.
RegCDCaddr Address of the HBMDC to which the HBMLM is to report the status of the HBC process (IP Number and port).
Format: sockaddr_in.
RegCDCmsg User message from the client to the Data Collector, can be used for designating callback events and responses.
Format: String.
After registering the HBC process the HBMLM sends a simple acknowledgement with the following information:
 
Field Contents
RegAckCretCd Registration return code (GLOBUS_SUCCESS if successful, GLOBUS_FAILURE otherwise).
Format: Unsigned integer.
When the last registration message has been sent and acknowledged, the HBC sends a message to tell the HBMLM to either commit the registration or cancel it (based on the value of the require_all flag when hbm_client_register() was called). The format of that message is:
 
Field Contents
RegCMsgLength Length in bytes of the commit/cancel message.
Format: Unsigned integer.
RegCommitCd Registration commit code.
Format: Unsigned integer. 

Values: 

    GLOBUS_HBM_MSGTYPE_REGISTER_COMMIT 
    GLOBUS_HBM_MSGTYPE_REGISTER_CANCEL
After processing the Commit/Cancel message and checkpointing (for Commit only) the HBMLM sends a simple acknowledgement with the following information and disconnects:
 
Field Contents
RegAckCretCd Registration return code (GLOBUS_SUCCESS if successful, GLOBUS_FAILURE otherwise).
Format: Unsigned integer.
After recieving this last acknowledgement the HBC disconnects as well.

HBC Unregistration

For each HBC unregistration the HBMLM verifies that the IP Number for the unregistering HBC process (as provided as the source address for the message) is a valid IP Number for the host on which the HBMLM is running. To unregister the HBC process sends an unregister message with the following information to the HBMLM with which it is registered:
 
Field Contents
UnregCmsgLength Length in bytes of the unregistration message.
Format: Unsigned integer.
UnregCregCode Registration/unregistration Code
Format: Unsigned integer. 

Values: 

    GLOBUS_HBM_MSGTYPE_UNREGISTER_NORMAL  
    GLOBUS_HBM_MSGTYPE_UNREGISTER_ABNORMAL
UnregCprocessPID PID of the HBC process.
Format: Unsigned integer.
UnregCprocessName Name of the HBC process.
Format: String.
After unregistering the HBC process the HBMLM checkpoints and sends a simple acknowledgement with the following information:
 
Field Contents
RegAckCretCd Registration return code (GLOBUS_SUCCESS if successful, GLOBUS_FAILURE otherwise).
Format: Unsigned integer.

HBMLM Reports to HBMDC

After each monitoring cycle (forking a child that executes "ps" and returns the output for review via a pipe), each HBMLM reports the following information to the appropriate HBMDC(s):
 
Field Contents
HBMLM data:
RptLMmsgLength Length in bytes of the report message.
Format: Unsigned integer.
RptLMhostIPNum Primary IP Number for the host on which the HBMLM process is executing (the one used for communications when reporting). 
Format: Unsigned Integer.
RptLMportNum Number of the (UDP) port used by the HBMLM process.
Format: Unsigned integer.
HBC data:
RptCprocessPID PID of HBC process. 
Format: Unsigned integer.
RptCprocessName Report name of the HBC process. 
Format: String.
RptCstatus Status of the HBC process as kept by the HBMLM. 
Format: Unsigned integer. 

Values: 

    GLOBUS_HBM_PROCSTATUS_ACTIVE (registered, "alive", and consuming CPU time). 
    GLOBUS_HBM_PROCSTATUS_BLOCKED (registered and "alive", but blocked, i.e., not consuming CPU time). 
    GLOBUS_HBM_PROCSTATUS_UNREGISTERED_NORMAL (normal unregistration message received for this client). 
    GLOBUS_HBM_PROCSTATUS_UNREGISTERED_ABNORMAL (abnormal unregistration message received for this client). 
    GLOBUS_HBM_PROCSTATUS_UNREGISTERED_ABEND (no unregister message received, but process no longer alive on the host).
RptCregistrationTime Time at the HBMLM host when the HBC process was registered.
Format: UTC time.
RptCrptInterval Interval in seconds at which the HBMLM generates heartbeats for this client to this HBMDC.
Format: Unsigned integer.
RptCrptNum Sequence number of this heartbeat (the first heartbeat is number 1).
Format: Unsigned integer.
RptCblockedTime Time of the end of the latest review/report period in which the HBC process consumed cpu time.
Format: UTC time.
RptCcpuTime CPU time consumed by the process (host-specific units) as reported by ps.
Format: Unsigned integer.
RptCunregisterTime Time at which the HBMLM logged the HBC process as unregistered due to either the receipt of an unregistration message or because the process no longer exists on the host. 
Format: UTC time.
RptCnumUnregisterMsg Number of times this HBC process was reported as unregistered.
Format: Unsigned integer.
RptCDCmsgNum The sequence number of the message for the client (incremented only when the message changes).
Format: Unsigned integer.
RptCDCmsg Message for the client.
Format: String.

Checkpoint file formats

Both the HBMLMs and the HBMDC(s) periodically checkpoint to a files. Each HBMLM and HBMDC writes to a work checkpoint file, then renames it to the designated checkpoint filename. Records in the checkpoint files are separated by carriage return/line feeds, and fields are separated by semi-colons (except for the literal at the beginning of each record that gives the record type). The fields in the checkpoint files are of the following types: IP Number (as a string in "dot" notation as generated by inet_ntoa (typically no leading zeros, e.g., 128.9.64.205 rather than 128.009.064.205)), string (variable-length, terminated by the semi-colon field terminator or carriage-return/line field record terminator), unsigned integers (as characters), and UTC time (Universal Time Code displayed as YYYY/MM/DD hh:mm:ss GMT).

HBMLM checkpoint file

The HBMLM checkpoint file has the following data:
 
Field Contents
HBMLM record:
"LM Data:" String literal to designate the record type.
LMhostIPNum Primary IP Number for the host on which the HBMLM process is executing (the one used for communications when reporting). 
Format: IP Number.
LMhostName Fully defined name of the host on which the HBMLM process is executing. 
Format: String.
LMportNumReg Number of the TCP port used by the HBMLM process for registrations/unregistrations.
Format: Unsigned integer.
LMportNumRpt Number of the TCP port used by the HBMLM process for sending heartbeats and receiving messages from Data Collectors (receiving not yet implemented).
Format: Unsigned integer.
LMreportInterval Default interval in seconds at which the HBMLM monitors and reports on registered HBC processes. 
Format: Unsigned integer.
LMClientsCt Number of HBC processes that the HBMLM is monitoring (and which are included in the checkpoint file). 
Format: Unsigned integer.
LMDCsCt Number of DC entries total for all clients (and which are included in the checkpoint file).
Format: Unsigned integer.
LMcheckpointTime Time at which the checkpoint was done.
Format: UTC time.
HBC record:
"CL Data:" String literal to designate the record type.
CprocessPID PID of HBC process. 
Source: HBC registration message. 

Format: Unsigned integer.

CprocessName Name of the HBC process. 
Source: HBC registration message/ps. 

Format: String.

CprocessStatus Process status as determined at the last evaluation. 
Source: derived from ps

Format: Unsigned integer. 

Values: 

    GLOBUS_HBM_PROCSTATUS_ACTIVE (registered, "alive", and consuming CPU time). 
    GLOBUS_HBM_PROCSTATUS_BLOCKED (registered and "alive", but blocked, i.e., not consuming CPU time). 
    GLOBUS_HBM_PROCSTATUS_UNREGISTERED_NORMAL (normal unregistration message received for this client). 
    GLOBUS_HBM_PROCSTATUS_UNREGISTERED_ABNORMAL (abnormal unregistration message received for this client). 
    GLOBUS_HBM_PROCSTATUS_UNREGISTERED_ABEND (no unregister message received, but process no longer alive on the host).
CblockedTime Time of the end of the latest review/report period in which the HBC process consumed cpu time.
Format: UTC time.
CcpuTime CPU time consumed by the process (host-specific units) as reported by ps.
Format: Unsigned integer.
CDCsCt Number of Data Collector records/entries for this client.
Format: Unsigned integer.
HBMDC record:
"DC Data:" String literal to designate the record type.
DChostIPnum IP Number for the HBMDC to which the HBMLM is to report the status of the HBC process.
Source: HBC registration message. 

Format: IP Number.

DCportNum Port number for the HBMDC to which the HBMLM is to report the status of the HBC process.
Source: HBC registration message. 

Format: Unsigned integer.

DCprocessNameRpt The name used for this process when generating heartbeats to this Data Collector.
Format: String.
DCregistrationTime Time at the HBMLM host when the HBC process was registered.
Format: UTC time.
DCrptInterval Interval in seconds at which heartbeats for this Client are to be sent to this Data Collector.
Format: Unsigned integer.
DCrptNum Sequence number of the last heartbeat sent to this Data Collector for this Client.
Format: Unsigned integer.
DCrptTimeLast Time that the last heartbeat was sent to this Data Collector for this Client.
Format: UTC time.
DCrptTimeNext Time that the next heartbeat is to be sent to this Data Collector for this Client.
Format: UTC time.
DCunregisterStatus Unregister status of the HBC process with respect to this HBMDC.
Format: Unsigned integer. 

Values: 

    GLOBUS_HBM_UNREGISTERSTATUS_ACTIVE (signifies that the process is not unregistered). 
    GLOBUS_HBM_UNREGISTERSTATUS_NORMAL (normal unregistration message received for this client). 
    GLOBUS_HBM_UNREGISTERSTATUS_ABNORMAL (abnormal unregistration message received for this client). 
    GLOBUS_HBM_UNREGISTERSTATUS_ABEND (no unregister message received, but process no longer alive on the host).
DCunregisterTime Time at which the HBMLM logged the HBC process as unregistered due to either the receipt of an unregistration message or because the process no longer exists on the host. 
Format: UTC time.
DCnumUnregisterMsg Number of times this HBC process has been reported as unregistered to this HBMDC.
Format: Unsigned integer.
DCmsgNum The number of the message from the client to the Data Collector.
Format: Unsigned integer.
DCmsg The message from the client to the Data Collector.
Format: String.
For Globus processes, the values of the hostName, processName, and processPID fields define a unique key by which monitored processes can be identified.

HBMDC checkpoint file

The HBMDC checkpoint file has the following data:
 
Field Contents
HBMDC record:
DChostIPNum IP Number for the host on which the HBMDC process is executing (the one used for communications). 
Format: IP Number.
DChostName Name of the host on which the HBMDC process is executing. 
Format: String.
DCportNum Number of the (UDP) port used by the HBMDC process.
Format: Unsigned integer.
DCcheckpointTime Time at which the checkpoint was done.
Format: UTC time.
DCnumHBMLM Number of HBMLM records. 
Format: Unsigned integer.
DCnumHBC Number of HBC process records. 
Format: Unsigned integer.
HBMLM record(s):
LMhostIPNum Primary IP Number for the host on which the HBMLM process is executing. 
Format: IP Number.
LMportNum Number of the (UDP) port used by the HBMLM process.
Format: Unsigned integer.
LMlastReportRcvTime Time at which the last report was received from this HBMLM.
Format: UTC time.
LMnumClients Number of HBC processes that the HBMLM is reporting to this HBMDC. 
Format: Unsigned integer.
HBC record(s):
ChostIPNum Primary IP Number for the host on which the HBC process is executing. 
Format: IP Number.
ChostName Name of the host on which the HBC process is executing. 
Format: String.
CprocessName Name of the HBC process. 
Format: String.
CprocessPID PID of HBC process. 
Format: Unsigned integer.
CprocessStatus Process status as determined at the last evaluation. 
Format: Unsigned integer. 

Values: 

    GLOBUS_HBM_PROCSTATUS_ACTIVE (registered, "alive", and consuming CPU time). 
    GLOBUS_HBM_PROCSTATUS_BLOCKED (registered and "alive", but blocked, i.e., not consuming CPU time). 
    GLOBUS_HBM_PROCSTATUS_OVERDUE (client not unregistered, but heartbeat reports from HBMLM are late/missing). 
    GLOBUS_HBM_PROCSTATUS_UNREGISTERED_NORMAL (normal unregistration message received for this client). 
    GLOBUS_HBM_PROCSTATUS_UNREGISTERED_ABNORMAL (abnormal unregistration message received for this client). 
    GLOBUS_HBM_PROCSTATUS_UNREGISTERED_ABEND (no unregister message received, but process no longer alive on the host). 
    GLOBUS_HBM_PROCSTATUS_UNREGISTERED_NO_RPT (client considered unregistered because heartbeat reports are late/missing). 
CregistrationTime Time at the HBMLM host when the HBC process was registered.
Format: UTC time.
CrptInterval Interval in seconds at which heartbeats are to be generated by the HBMLM.
Format: Unsigned integer.
CblockedTime Time of the end of the latest review/report in which the HBC process consumed cpu time.
Format: UTC time.
CcpuTime CPU time consumed by the process (host-specific units) as reported by ps.
Format: Unsigned integer.
ClastRptSeqNum Sequence Number of the last heartbeat sent for this process that was received.
Format: Unsigned integer.
ClastRptRcvTime Time at which the last heartbeat sent for this process that was received arrived.
Format: UTC time.
CunregisterTime Time at which the HBMLM logged the HBC process as unregistered due to either the receipt of an unregistration message or because the process no longer exists on the host. 
Format: UTC time.
CmsgNum Number of the client message.
Format: Unsigned integer.
Cmsg Client message.
Format: Null-terminated string.
 
 
 

Outstanding Issues