Heartbeat Monitor v1.0
Status
This is a draft for discussion.
Objective
The Globus Heartbeat Monitor (HBM) is designed to provide
a simple, highly reliable mechanism for monitoring the state of processes.
The HBM is designed to detect and report the failure of processes that
have identified themselves to the HBM. Originally designed for monitoring
Globus system processes exclusively, the HBM design has been expanded to
allow simultaneous monitoring of both Globus system processes and application
processes associated with "user" computations.
It is difficult in general on the basis of missing status
reports to distinguish process failure from other failure events, such
as network partitioning and host failure. Thus, strictly speaking the HBM
detects process failure when the host and network connections are functioning
properly, and also monitors the availability of a process or host as evidenced
by the received and missing heartbeats. The HBM also provides notification
of process status exception events, so that recovery actions can be taken.
Requirements
Reliability and robustness were primary design goals for
the Globus heartbeat monitor. For this reason, the heartbeat monitor is
designed to have no dependence on other Globus components (such as MDS),
nor any special fault-tolerance components.
Overview
The HBM consists of three types of components:
-
HBM Client Library (HBMCL),
-
HBM Local Monitor (HBMLM), and
-
HBM Data Collector (HBMDC).
There is one (Globus system) HBMLM running on each host,
checking and reporting the status of the monitored system and application
processes on that host. The HBMCL is used to register each monitored client
process with the (unique) HBMLM on the same host, and to unregister those
processes as part of normal process termination. This registration process
is necessary since we are using the HBMLM as an external monitor of the
client processes; it has no way of knowing which processes are of interest
unless and until it is told. Each HBMLM periodically performs a review
cycle in which it checks the status of the client processes it is monitoring,
updates its local status information, and sends a report on each monitored
process to one or more external agents (HBMDCs) specified at registration.
There can be any number of HBMDCs, typically one for tracking all of the
monitored processes associated with the metacomputing environment, plus
one for each distributed application. Each HBMDC receives the reports sent
to it by the HBMLMs and incorporates those reports into its local repository.
The HBMDC also infers the unavailability or failure of monitored components
based on HBMLM reports that are expected but not received (time-out situations),
and periodically adjusts the status of client processes accordingly. The
status information in each repository is checkpointed regularly. In addition,
the HBMDC can recognize specific exception status changes and generate
appropriate notifications via callbacks.
The following diagram illustrates the relationships between
client (HBC) processes, HBMLMs, and HBMDCs. Each host on which monitored
HBC processes can run has one HBMLM. Two HBC processes, one a Globus/GUSTO
system process and the other for application 1, that are running on Host
A register with the HBMLM on that host. Similarly, two HBC processes on
host B register with the HBMLM on that host. The HBMLMs monitor the registered
HBC processes and periodically send reports to the appropriate HBMDCs.
The HBMDCs that the HBMLMs report to may be on the same host as the HBMLM,
or on a different host.
HBM Design Details
The HBM provides a number of capabilities to enhance the
utility and robustness of the monitoring function. Registration of a monitored
process can specify multiple HBMDCs to which status reports are to be sent.
Multiple registrations can be used to achieve the same result. Also, at
registration the HBMCL may specify a text string to be passed by the HBMLM
to the HBMDC. This string is passed "as is" by the HBMDC to user-defined
routines for processing, which normally would consist of setting up callback
routines for client status exception handling. Re-registration of a process
for reporting to a previously specified HBMDC results in substitution of
the new message string for the one specified in the previous registration.
Unregistration is universal, i.e., if a monitored process is unregistered
then it is reported as such to all HBMDCs it is being reported to -- selective
unregistration of a monitored process is not supported. Registration and
unregistration of client processes with the HBMLMs is done using TCP. All
heartbeat communication from HBMLMs to HBMDCs is done using UDP. UDP was
chosen to avoid the additional processing and communications overhead associated
with reliable protocols such as TCP and to allow the communications to
be done in non-privileged mode. The next three subsections will describe
each of the components of the HBM in greater detail, including an explanation
of how to use them.
HBM Client Library (HBMCL)
The HBMCL can be used either as a library API that can be
linked into the program for the monitored process, or as an independent
program (globus-hbm-client-register) that accepts as one of its
parameters the process id (PID) of the process to monitor. The HBMCL API
consists of four procedures that are incorporated into the external registration
program. Those procedures are:
-
globus_module_activate(GLOBUS_HBM_CLIENT_MODULE),
to perform required activation and initialization;
-
globus_hbm_client_register(), for registering the
client process with the HBMLM;
-
globus_hbm_client_unregister_all(), for unregistering
the client process with the HBMLM; and
-
globus_module_deactivate(GLOBUS_HBM_CLIENT_MODULE),
to perform required deactivation and clean-up.
The signatures for each of the above functions is given below.
Following each signature is a short narrative of the functions performed
by the procedure.
The client registration/unregistration program that uses
the API (globus-hbm-client-register) takes the same parameters as
the globus_hbm_client_register() API procedure. globus-hbm-client-register
calls the API routines as appropriate to register or unregister the specified
process. The command format for invoking globus-hbm-client-register
is included below.
Thus, the life-cycle of a heartbeat client process (HBC)
is as follows:
-
HBC Startup and registration. On startup, an HBC process
must register (or be registered) with the HBMLM by means of either the
HBMCL API or the program globus-hbm-client-register.
-
HBC Execution. During execution the HBC is monitored
by the HBMLM, and requires no participation or activity on the part of
the HBC.
-
HBC Termination. If an HBC process terminates normally
(or abnormally, but the error condition is trapped), it should unregister
with the HBMLM by means of either the HBMCL API or the program globus-hbm-client-register.
If the HBC abnormally terminates without unregistering then the termination
is noted by the HBMLM, and the process is marked as abended. The termination
of the HBC is reported to the appropriate HBMDC(s) and the entry at the
HBMLM is deleted.
HBM Local Monitor (HBMLM)
The (unique) HBMLM (program globus-hbm-localmonitor)
on each host waits for, and processes, registration and unregistration
messages from HBMCLs. Also, the HBMLM periodically checks the status of
the monitored processes, updates its internal repository appropriately,
formats and sends heartbeat messages to the appropriate HBMDCs, and checkpoints
the repository data. Note that there is normally one HBMLM process on each
host that monitors all client processes on that host. An HBMLM process
is active as a part of the Globus/GUSTO system, and should not be started
by users as part of applications run on hosts participating in the Globus/GUSTO
testbed. The full execution cycle of the HBMLM follows.
The HBMLM first reads in its parameters, initializes the
(TCP) port for receiving registration/unregistration messages and the (UDP)
port for sending heartbeats. It then writes the parameter information
to an external file for use by the HBMCL API. Next it loads the checkpoint
data from the previous execution (if any), and then verifies the status
of any registered HBC processes. (This is done on UNIX systems using ps,
other systems are not currently supported.) The HBMLM next sends heartbeat
report messages to the appropriate HBMDCs, after which it checkpoints.
Finally, the HBMLM uses globus_poll_blocking() to control a (timed)
select to wait for registration and unregistration messages.
When registration and unregistration messages are received,
the internal table/repository data is updated appropriately, the process
reported to the appropriate HBMDCs, and the repository data is checkpointed.
For registration messages, the monitored processes are added to the repository
data if necessary and the HBMDC data for the monitored process is added
(or updated if the process is already being reported to a specified HBMDC).
For unregistration messages, the corresponding monitored process is flagged
as unregistered, and counters initialized for the number of times the monitored
process has been reported to each HBMDC as unregistered. Unregistered processes
are reported five times before they are purged from the repository. This
is done to provide a higher level of confidence in the notification, given
the use of the unreliable UDP protocol.
The formats of the HBMLM parameter file, the heartbeat
report message, and the HBMLM repository file are all given later.
The main functions of the HBMLM are as follows:
-
Initialization. The HBMLM initially reads its parameter
file, first checking for hbmlm.conf.<hostname>, where <hostname>
is the name of the host on which it is running. If that file is not found,
then it looks for hbmlm.conf. If no parameter file is found then
it terminates abnormally, otherwise it reads in the parameters from the
file. (Error messages are written to a log file). If either of the ports
designated in the parameter file for receiving registration/unregistration
messages (using TCP) or for sending heartbeat report messages (using UDP)
is not available, then the HBMLM finds available substitution ports. After
the ports have been successfully allocated the HBMLM saves the (possibly
updated) parameter information in the file hbmlm.conf.<hostname>,
rewriting that file if necessary. The HBMLM then looks for an existing
HBMLM checkpoint file, and if it finds one that was created/updated after
the most recent reboot of the system, rebuilds an internal table/repository
of monitored client processes in memory from the checkpoint data. Included
in the checkpoint data for each (monitored) client process are the times
at which the next heartbeat is to be generated for each HBMDC. If a rebuild
is done, then all of these next heartbeat times (for all HBMCL/HBMDC combinations)
are set to the current time. After completing this initialization, the
HBMLM checks the status of the monitored processes, sends heartbeat report
messages, and checkpoints the repository data. Then it waits for registration
messages.
-
Process HBC registration messages. When it receives
a registration message (using TCP), the HBMLM creates a work entry for
the client and the specified Data Collector. It then sends an acknowledgement
in reply to the registration message and waits for another message. For
each subsequent registration message it builds another Data Collector sub-entry
and sends a success or failure reply as appropriate. If it receives a registration
commit message then it incorporates the work entry data into its repository
and sends and acknowledgement. If it receives a registration cancel message
or the TCP connection goes down then it discards the work data. It then
sends an acknowledgement to the commit or cancel message and closes the
TCP connection. Finally, an initial heartbeat report message for the HBC
is sent to each appropriate HBMDC. When a registration is received, the
next heartbeat time for the client entry is updated to the earliest next
heartbeat time for any of its HBMDC sub-entries. The next heartbeat time
for the data collector sub-entries are set as follows. For a new HBMDC
sub-entry, if the sub-entry is the first for the Client, then the next
heartbeat time is initially set to the earliest next heartbeat time for
any Client (if there are no other Clients, then the current time is used),
otherwise it is set to the next heartbeat time for the Client (earliest
next heartbeat time for any of its HBMDC sub-entries). Finally, for new
HBMDC sub-entries and old ones for which the heartbeat interval decreased,
the next heartbeat time is decremented by the (new) heartbeat interval
for the HBMDC until it is < the current time.
-
Perform HBMLM Monitor/Report Cycle. If the HBMLM is
monitoring any HBCs, then it periodically verifies the status of the HBC
processes it is monitoring (using ps on UNIX systems), based on
the heartbeat intervals in the HBMDC sub-entries. This is done as follows.
When the next heartbeat time for any client arrives, the status for all
clients is checked. After updating the status of the monitored HBCs, the
HBMLM formats and sends heartbeat report messages (via UDP) on each HBC
to the appropriate HBMDCs, updates the corresponding next heartbeat times,
and checkpoints.
-
Checkpointing. The HBMLM checkpoints to a work file,
and then unlinks the checkpoint file name from the old file, links it to
the new file, and unlinks the work file name. This is done to help ensure
that the checkpoint file always reflects a complete and consistent view
of the client processes being monitored by the HBMLM.
-
Process HBC unregistration messages. When the HBMLM
receives an unregistration message it sets the status for the process to
the appropriate unregistration value and records the time at which the
message was received. If the HBMLM determines in the Monitor/Report cycle
that an HBC has abnormally terminated (in a manner that was not caught),
the HBMLM will set the status of the HBC process to reflect that fact,
and will set the unregistration time to the time at which the abend was
detected. Unregistered HBCs are reported 5 times to the appropriate HBMDC(s)
and then removed from the HBMLM table. (The multiple reporting is to allow
for the non-guaranteed delivery of UDP messages).
HBM Data Collector (HBMDC) API
The HBMDC API is a library of functions that perform monitoring
of HBCs and notification of exception events. The functions consist of
a group of reentrant, threadsafe procedures that can be used to construct
and maintain a number of data collector instances within a single process.
These procedures can be incorporated into programs that use the API notification
callback mechanisms to trigger responses to exceptional changes to the
status of HBCs. The core procedures of the HBMDC API are:
-
globus_module_activate(GLOBUS_HBM_DATA_COLLECTOR_MODULE),
to perform required activation and initialization;
-
globus_hbm_datacollector_create(), to create and initialize
a data collector instance,
-
globus_hbm_datacollector_set_clientevent_callback(),
for setting callbacks based on client process events;
-
globus_hbm_datacollector_user_checkpoint(), for creating
an application-triggered checkpoint file;
-
globus_hbm_datacollector_clear_unregistered_clients(),
for removing unregistered client processes from the table;
-
globus_hbm_datacollector_destroy(), for closing the
heartbeat port and freeing memory associated with the data collector instance;
and
-
globus_module_deactivate(GLOBUS_HBM_DATA_COLLECTOR_MODULE),
to perform required deactivation and clean-up.
These routines assume the existence of a user-coded procedure
used for evaluating the client message strings of client processes and
setting any appropriate callbacks based on client status events. A pointer
to this procedure is provided as a parameter to globus_hbm_datacollector_create().
The signatures for each of the above functions is given
below. Following each signature is a short narrative of the functions performed
by the procedure.
A Globus/GUSTO HBMDC has been developed using the HBMDC
API. This program (globus-hbm-datacollector) works in coordination
with the HBMCL registration to provide e-mail notification when monitored
Globus/GUSTO processes abend. The HBMCL message includes the e-mail address
of the responsible party to notify in the client message field of the registration,
it is passed by the HBMLM to the HBMDC, and the HBMDC uses it to send the
notifying e-mail if the HBC goes down. This program can be used as a model
for developing other, application-specific, HBMDCs.
HBM Procedure Signatures and File Formats
The signatures of the HBM Procedures follow.
HBMCL Procedures and External Registration/Unregistration
Program
Before the first call to a client API procedure (typically
globus_hbm_client_register()), a call must be made to globus_module_activate(GLOBUS_HBM_CLIENT_MODULE).
Similarly, after the last call to any HBM client API procedure [and not
before] (typically globus_hbm_client_unregister_all()), a call must
be made to globus_module_deactivate(GLOBUS_HBM_CLIENT_MODULE).
int globus_hbm_client_register(
int client_pid,
char *dc_spec_str,
globus_bool_t require_all,
char *lm_conffile,
globus_hbm_client_regerr_t *hbm_reg_return_ptr)
The parameters are used as follows:
client_pid:
PID of the client process to be registered.
dc_spec_str:
rsl string with the data collector information. This
string is of the form (HBMDCdata) or +(HBMDC1data)(HBMDC2data)...(HBMDCndata),
where the HBMDCidata is of the form &(keyword1=value1)(keyword2=value2)...(keywordk=valuek).
(Note that additional matching parentheses can be added for readability).
The valid keywords and corresponding value types are as follows:
host (required): the host on which the HBMDC is
running, specified either as a fully defined host name (e.g., globus.mcs.anl.gov),
or as a valid IP Number for the HBMDC in the standard format (e.g., 220.008.129.129).
portnum (required): the port number for the HBMDC
as a decimal integer, e.g., 1234.
interval (optional): the interval at which heartbeats
are to be generated in seconds, e.g., 10. If omitted then the HBMLM will
use the default value specified in the .conf file. If below the allowable
minimum (above the allowable maximum) the HBMLM will use the allowable
minimum (maximum).
rptname (optional): the name to use for the process
when generating heartbeats to the specified HBMDC.
message (optional): the message to be sent by
the HBMLM to the HBMDC in each heartbeat for the client, maximum length
of 256 characters. If omitted then an empty string will be used.
Each of the value fields must consist entirely of printable
characters. If it contains any character other than [a-z][A-Z][0-9]._@
then it must be delimited by single quote ('), double quotes(") or any
other character signalled by a carat (^). Two consecutive occurrences of
a delimiter inside a string are interpreted as a single occurrence of the
character in the string, e.g., "dog'cat" and 'dog''cat' are the same string.
require_all:
if this flag is GLOBUS_TRUE then no registrations
are done unless all of the data collector specifications validate properly.
If it is GLOBUS_FALSE, then registrations are done for those that
are ok, and the others are ignored. This functionality is implemented in
conjunction with the HBMLM, which accepts and validates the registrations
as they are received, but does not commit them until it gets a REGISTRATION_COMMIT
message. If it gets a REGISTRATION_CANCEL message, or the TCP connection
between the registration process and the HBMLM goes down before a REGISTRATION_COMMIT
message is recieved, then the registrations are discarded.
lm_conffile:
the (optional) local monitor configuration file name,
designated here <filename>. If <hostname> is the fully defined name
of the host on which the client and Local Monitor are running, then the
file <filename>.<hostname> is used as the source of the port
number for establishing a TCP connection with the hbmlm. If this option
is not used then first the current directory (./)and then GLOBUS_SYSCONFDIR
is checked for the file hbmlm.conf.<hostname>, and the first
found file is used as the source of the port number.
hbm_reg_return_ptr:
a pointer to a defined type consisting of a structure
for returning error information, defined as follows.
typedef struct {
int num_reg_ok;
int first_bad_dc;
int dc_error;
} globus_hbm_client_regerr_t;
The fields are set by hbm_client_register() as follows:
num_reg_ok: The number of data collectors for
which the client was registered.
first_bad_dc: The number of the first data collector
specification (i for HBMDCidata)with an error.
dc_error: The number of the error recognized for
HBMDCidata.
The procedure returns an integer return code, GLOBUS_SUCCESS
if all registrations were completed successfully, otherwise GLOBUS_FAILURE.
The procedure globus_hbm_client_register() registers
the process with the local HBMLM based on the information provided. It
obtains the (local) host name and IP number, then gets the HBMLM port number
for registrations and unregistrations from the HBMLM parameter file. After
initial validations, a TCP connection is established with the HBMLM. The
registration of the client for one or more Data Collectors is then done
as follows:
If the require_all flag is set to GLOBUS_TRUE:
Each registration message is formatted and sent in turn.
After sending the registration message, a 4 byte integer
code response message is waited for.
If the response is GLOBUS_FAILURE, then a REGISTRATION_CANCEL
message is sent and the registration is not done.
If the response is GLOBUS_SUCCESS and there are
more Data Collectors to register for, then the next registration message
is formatted, sent, and processed as above.
If an invalid Data Collector specification is detected
by globus_hbm_client_register() then a REGISTRATION_CANCEL message
is sent.
If the last registration message has been acknowledged
by GLOBUS_SUCCESS and there are no more Data Collector specifications
for this registration, then a REGISTRATION_COMMIT message is sent.
After sending the REGISTRATION_COMMIT or REGISTRATION_CANCEL
message, it waits for a GLOBUS_SUCCESS or GLOBUS_FAILURE.
Upon recieving it, it closes the TCP connection, sets the return values
as appropriate, and returns.
If the require_all flag is set to GLOBUS_FALSE:
Processing is the same as if it was set to GLOBUS_TRUE,
with the exception that a GLOBUS_FAILURE response to a registration
message does not result in a REGISTRATION_CANCEL message being sent. Rather,
a registration message is sent for each of the (validated) Data Collectors
specified, and a REGISTRATION_COMMIT message is sent unless all of the
registration responses were GLOBUS_FAILURE, in which case a REGISTRATION_CANCEL
message will be sent.
If at any time a response to a message is not received within
30 seconds, a timeout will occur, and it will be assumed that registration
cannot be completed. The TCP connection will then be closed.
The procedure returns GLOBUS_SUCCESS if registration
was successful for all Data Colllectors, otherwise it returns GLOBUS_FAILURE.
int globus_hbm_client_unregister_all(
int cl_pid,
unsigned int cl_unregister_mode)
The parameters are used as follows:
cl_pid:
the PID of the monitorred process to be unregistered,
used when formatting the unregistration message that is sent to the HBMLM.
cl_unregister_mode:
should be one of GLOBUS_HBM_MSGTYPE_UNREGISTER_NORMAL
or GLOBUS_HBM_MSGTYPE_UNREGISTER_ABNORMAL, to designate the unregistration
type, and used when formatting the unregistration message for the HBMLM.
If not one of these values, then GLOBUS_HBM_MSGTYPE_UNREGISTER_ABNORMAL
is used when formatting the unregistration message.
The procedure returns an integer return code (GLOBUS_SUCCESS
for successful, GLOBUS_FAILURE otherwise).
The procedure globus_hbm_client_unregister_all()
unregisters the specified process with the HBMLM. It formats the unregistration
message, establishes a TCP connection with the HBMLM, sends the unregistration
message, and waits for a response. On completion it returns the appropriate
return code. globus_hbm_client_unregister_all() re-reads the HBMLM
parameter file in case the information it gets there (HBMLM port number)
changes between registration and unregistration.
The external client registration/unregistration program
globus_hbm_client_register takes flagged parameters that correspond
directly with the parameters of the API routines globus_hbm_client_register()
and globus_hbm_client_unregister_all(). These parameters cannot
be combined (i.e., --reg and --pid cannot be combined into
--regpid). The program maps its parameters directly to the parameters
for the API procedures where appropriate. Extra parameters are ignored.
The parameter flags and fields are:
[--reg | -u | --un | --ua]
:
Exactly one of these must be used, with meaning register
| unregister (normal) | unregister normal | unregister abnormal.
[--pid <pid>] :
<pid> is the pid of the process to monitor.
[--dcspec <dcspec>] :
<dcspec> is an rsl dc spec string passed as
dc_spec_str as described above. Note that this string would usually
be delimited by double quotes (") or single quotes('), in which case the
used delimiter should not be used in the rsl string.
[--dchost <hostname> | --dchost <host ipnum>]
:
used when --dcspec is not to specify the host
name or ip number for the data collector.
[--port <portnum>] :
used when --dcspec is not to specify the port
number for the data collector.
[--rptname <process report name>] :
optionally used when --dcspec is not to specify
the name to be used when reporting the client process to the data collector.
[--int <interval>] :
optionally used when --dcspec is not to specify
the requested interval (in seconds) for reporting the client process to
the data collector.
[--msg <message>] :
optionally used when --dcspec is not to specify
the message to be sent when reporting the client process to the data collector.
[--reqall] :
an optional flag that if present results in the require_all
parameter to globus_hbm_register() being set to GLOBUS_TRUE,
otherwise it is set to GLOBUS_FALSE.
[--lmconf <filename>] :
an optional flag for specifying as <filename>
the local monitor .conf file passed as lm_conffile.
These flags allow the user to use the HBM via the external
register program with the same flexibility and power as the API, and also
provide a one-to-one correspondence between the parameters of the API registration
procedure and the registration program.
HBMLM Program
The HBMLM program takes the following flagged parameters:
[--conf <filename>] :
an optional flag for specifying as <filename>
the name of the local monitor .conf file. The hbmlm first looks for <filename>.<hostname>,
and if that is not found it looks for <filename>. After the parameters
are read in and validated, and the port number for accepting TCP registration/unregistration
connections set, they are rewritten to <filename>.<hostname>,
which is read by the client API to find the registration port number. If
the -conf flag is not used then the hbmlm will first check in the
current directory (./) for the file globus-hbm-localmonitor.conf.<hostname>,
and next for globus-hbm-localmonitor.conf. If neither of those files
is present it will look for the same file names in GLOBUS_SYSCONF_DIR.
The valid parameter fields are:
LocalMonitorPortNumReg: the (TCP) port to be used
for recieving registrations/unregistrations.
LocalMonitorPortNumRpt: the (UDP) port to be used
for sending heartbeats to data collectors.
DefaultReportInterval: the default interval (in
seconds) at which client processes are to be monitored and heartbeats generated.
CheckpointDirName: the name of the directory in
which the checkpoint file will be saved.
[--log <filename>] :
an optional flag used to change the base used for the
log file name to <filename>. If it is present, then .log.<hostname>
will be concatenated to the end of it. If it is not present, then the hbmlm
will first check for an environment variable (GLOBUS_HBMLM_LOGFILE)
for the name to use for the logfile, and if it is not there it will write
the file in the directory specified as GLOBUS_SYSCONF_DIR/../var,
with the name globus-hbm-localmonitor.log.<hostname>, where <hostname>
is the fully defined name of the host on which it is running.
[--chkpt <filename>]:
name to use as the base when writing the checkpoint file,
will be concatenated with the host name, so that the checkpoint file will
be written to "<filename>.<hostname>". If not present, then
the checkpoint file will be written in direcotry GLOBUS_SYSCONF_DIR/../var
as globus-hbm-localmonitor.chkpt.<hostname>.
HBMDC Procedures
Following are the signatures for the core procedures:
int globus_hbm_datacollector_create(
u_short portnum,
int eval_interval_secs,
char *ckpt_filename_restore_str,
char *ckpt_filename_save_str,
FILE *log_file,
void *proc_client_reg_callback_ptr(
globus_hbm_datacollector_handle_t
*dc_handle2_ptr,
globus_hbm_client_callbackdata_t
*client_callbackdata,
void *user_data2_ptr),
void *user_data_ptr,
globus_hbm_datacollector_handle_t
*dc_handle_ptr)
The parameters are used as follows:
portnum:
the number of the port to use for receiving heartbeats.
eval_interval_secs:
the interval in seconds at which the data collector is
to review the client data for late/missing messages, generate corresponding
callbacks, and checkpoint.
ckpt_filename_restore_str:
if not GLOBUS_NULL, then this fully defined file
name is opened, the (old) checkpoint data in the file is read and used
to reconstruct the data collector repository.
ckpt_filename_save_str:
if GLOBUS_NULL, then no automatic checkpointing
is done, otherwise checkpointing is done to the specified filename.
log_file:
if not GLOBUS_NULL, then error messages are written
(appended) to this file, otherwise the messages are not written.
proc_client_reg_callback_ptr(
globus_hbm_datacollector_handle_t
*dc_handle2_ptr, globus_hbm_client_callbackdata_t
*client_callbackdata,
void *user_data2_ptr):
pointer to the procedure the data collector will use for
registration callbacks when a first heartbeat message is received about
each client process (or a message with a different client handling message
to the data collector). When called, the first parameter is a pointer to
the handle to this data collector, the second is a pointer to a struct
with identifying data about the client process and a pointer to the client
message string, and the third is a pointer to user data (the user_data_ptr
parameter to hbm_data_collector_init()). The defined type globus_hbm_client_callbackdata_t
is as follows:
typedef struct globus_hbm_client_callbackdata_s {
u_long cl_host_ipnum;
u_int cl_pid;
char *cl_procname_str;
u_int cl_eventmask;
u_int cl_msgnum;
char *cl_msg_str;
} globus_hbm_client_callbackdata_t;
*proc_client_reg_callback_ptr() is called each time
a client report is received for a process with a new message (i.e., first
report for the client process, or report with a different message than
the previously received report). If specified as GLOBUS_NULL, then
all client messages are maintained in the repository as normal, but message
handling callbacks are not done. The parameter values used when the call
is made are the handle of the memory for this Data Collector (address of
its memory space), the client process data in client_callbackdata
(including the IP number of the host running on, its PID on that host,
the reported process name, the event triggering the callback [in this case
it will always be GLOBUS_HBM_EVENT_REGISTRATION], the number of
the message, and the client message), and the user_data passed in
as a parameter to globus_hbm_datacollector_create().
user_data_ptr:
user data to be passed to the message handler.
dc_handle_ptr:
a field where a pointer to the handle for the (new) Data
Collector can be saved.
The procedure returns an integer return code, set as
appropriate to GLOBUS_SUCCESS or GLOBUS_FAILURE. If successful,
then the address of the (new) memory space for the created data collector
is returned in dc_handle_ptr, otherwise it is set to GLOBUS_NULL.
The procedure globus_hbm_datacollector_create()
is used to create a new instance of a data collector within the program/process
space. It creates a new handle to a memory space for the process, initializes
it, and sets up the (UDP) port for listening for heartbeats.
globus_hbm_datacollector_create() first opens the
port designated by the parameter *port_fd_ptr for receiving heartbeats.
(f this is set to PORT_ANY, then *port_fd_ptr is set on successful
completion based on the port that was actually obtained.) If a checkpoint
file for restoring from was specified then the HBMDC opens it (if it exists)
and reads in the data from it, reconstructing the internal client host
and process tables. As the entry for each HBC is loaded, it undergoes the
same processing as if a (first) report message had been received for it,
i.e., a call is made to proc_client_reg_callback_ptr() for the process.
After the whole checkpoint file has been processed, the return code is
set and the handle is returned. If an error occurred that prevents the
HBMDC function from being performed properly then the memory space for
the invocation is de-allocated, a GLOBUS_NULL pointer is returned,
and the returned vaule is GLOBUS_FAILURE rather than GLOBUS_SUCCESS.
int globus_hbm_datacollector_set_clientevent_callback(
hbmdc_handle_t *dc_handle_ptr,
globus_hbm_client_callbackdata_t
*client_callbackdata_ptr,
unsigned int cl_hb_time_late_secs,
unsigned int cl_hb_time_missing_secs,
void
*user_data_ptr,
void (*proc_event_callback_ptr(
globus_hbm_datacollector_handle_t
*dc_handle2_ptr,
globus_hbm_client_callbackdata_t
*client_callbackdata,
void *user_data2_ptr),))
The parameters are used as follows:
dc_handle_ptr:
This is a pointer to the handle for the data collector
instance, as used when the call was made to globus_hbm_datacollector_create().
client_callbackdata_ptr:
This field is used to identify the client process for
which the callback is being set and the events for which it is being set.
The subfields which are used are cl_host_ipnum, cl_pid, cl_procname,
and cl_event_mask. The same struct is used as is passed to *proc_msg_handler_ptr()
to allow it to pass back the same pointer when setting callbacks. The cl_event_mask
subfield is used to specify the events (changes of client status to specific
values) under which the specified event handler is to be invoked. These
events can be specified as the logical OR of the defined events GLOBUS_HBM_EVENT_*:
GLOBUS_HBM_EVENT_REGISTRATION
GLOBUS_HBM_EVENT_ACTIVE_AFTER_HEARTBEAT_LATE_MISSING
GLOBUS_HBM_EVENT_ACTIVE_AFTER_SHUTDOWN
GLOBUS_HBM_EVENT_HEARTBEAT_LATE
GLOBUS_HBM_EVENT_HEARTBEAT_MISSING
GLOBUS_HBM_EVENT_SHUTDOWN_NORMAL
GLOBUS_HBM_EVENT_SHUTDOWN_ABNORMAL
GLOBUS_HBM_EVENT_SHUTDOWN_DIED
GLOBUS_HBM_EVENT_SHUTDOWN_NO_HEARTBEAT.
cl_hb_time_late_secs:
This field is ignored unless the event GLOBUS_HBM_EVENT_HEARTBEAT_LATE
is set, in which case it specifies the number of seconds which a heartbeat
must be late to generate that event.
cl_hb_time_missing_secs:
This field is ignored unless the event GLOBUS_HBM_EVENT_HEARTBEAT_MISSING
is set, in which case it specifies the number of seconds which a heartbeat
must be late to generate that event. In general, expected that the event
GLOBUS_HBM_EVENT_HEARTBEAT_MISSING will be considered more serious
than GLOBUS_HBM_EVENT_HEARTBEAT_LATE (and that cl_hb_time_missing_secs
will be greater than cl_hb_time_late_secs), but the interpretation
is up to the user.
user_data_ptr:
This (possibly GLOBUS_NULL) pointer is used for
specifying user data in the callback.
proc_event_handler_ptr():
this is a pointer to the procedure for handling the specified
events. The parameter types are the same as for *proc_client_reg_callback_ptr().
The only difference is that the field client_callbackdata.cl_eventmask
should be set to the bitwise OR of all of the events for which the callback
funtion should be used, and the fields client_callbackdata.cl_msgnum
and client_callbackdata.cl_msg are ignored.
*proc_event_handler_ptr() is called when an event
specified for the identified client process in a previous call of proc_client_reg_callback()
occurs. The parameters are those required to identify the client process
and the event which occurred. If this field is GLOBUS_NULL then
any callback functions previously set for the specified events are cancelled.
The procedure returns an integer to designate successful
(GLOBUS_SUCCESS) or unsuccessful (GLOBUS_FAILURE) completion.
The procedure globus_hbm_datacollector_set_clientevent_callback()
sets, clears, or changes the callback routines in effect for when the specified
events occur for the designated client process.
int globus_hbm_datacollector_user_checkpoint(
hbmdc_handle *dc_handle_ptr,
int
*ckpt_file_fd)
The parameters are used as follows:
dc_handle_ptr:
pointer to the handle for the data collector instance,
as used when the call was made to globus_hbm_datacollector_create().
ckpt_file_fd:
integer file descriptor number for the file to which
the checkpoint data is to be written. If -1 then the checkpoint will be
done to the filename specified in the call to globus_hbm_datacollector_create().
If no file was specified in either call then the checkpoint is not done
(return code is 1).
The procedure returns an integer to designate successful
(GLOBUS_SUCCESS) or unsuccessful (GLOBUS_FAILURE) completion.
The procedure globus_hbm_datacollector_user_checkpoint()
checkpoints the monitored host and client process data. If a file descriptor
value >= 0 is designated in the call to globus_hbm_datacollector_user_checkpoint()
then the checkpoint is done using that file descriptor value. If the file
descriptor value is -1, but a checkpoint filename was designated in the
call to globus_hbm_datacollector_create(), then the checkpoint is
done to that file. If the file descriptor value is -1 and no checkpoint
file was designated in the call to globus_hbm_datacollector_create(),
then no checkpoint is performed. If a checkpoint is to be performed to
file <filename>, then the file is initially written to <filename>.wk,
and when complete, the name of the file is changed to <filename>.
int globus_hbm_datacollector_clear_unregistered_clients(
globus_hbm_datacollector_handle_t
*dc_handle_ptr)
The parameters are used as follows:
dc_handle_ptr:
pointer to the handle for the data collector instance,
as used when the call was made to globus_hbm_datacollector_create().
The procedure returns an integer to designate successful
(GLOBUS_SUCCESS) or unsuccessful (GLOBUS_FAILURE) completion.
The procedure globus_hbmdc_clear_unregistered_clients()
removes all monitored client processes with a status of unregistered from
the monitored client table. (Hosts for which no monitored clients remain
are also removed.)
int globus_hbm_datacollector_destroy(
globus_hbm_datacollector_handle_t
*dc_handle_ptr,
int
force_mode,
int
*num_live_clients_ptr)
The parameters are used as follows:
dc_handle_ptr:
pointer to the handle for the data collector instance,
as used when the call was made to hbm_datacollector_create().
force_mode:
If GLOBUS_HBM_DATACOLLECTOR_FORCE_DESTROY_NO then
the HBMDC instance will terminate and free its associated handle only if
there are no remaining monitored clients that are not in unregistered status.
If GLOBUS_HBM_DATACOLLECTOR_FORCE_DESTROY_YES
causes the HBMDC instance to terminate and free its associated handle even
if there remain monitored clients that are not in unregistered status.
num_live_clients_ptr:
This field is used to return the number of remaining
clients that were not in a unregistered status.
The procedure returns an integer to designate successful
(GLOBUS_SUCCESS) or unsuccessful (GLOBUS_FAILURE) completion.
The procedure globus_hbm_datacollector_destroy()
checks the number of monitored client processes not in an unregistered
status, and sets *num_live_clients_ptr to that value. Then, if appropriate
based on the force_mode and number of monitored clients not in an
unregistered status, it will close the heartbeat port and free all memory
associated with the data collector instance. It returns an integer value,
GLOBUS_SUCCESS if the Data Collector instance was destroyed, GLOBUS_FAILURE
if it was not.
The Globus data collector program globus_hbm_datacollector
takes flagged parameters specify the names to use for the parameter (.conf),
log (.log), and checkpoint (.chkpt) files:
[--conf <filename>] :
an optional flag for specifying as <filename>
the name of the data collector .conf file. The valid parameters for the
datacollector are:
hbmlm first looks for <filename>.<hostname>,
and if that is not found it looks for <filename>. After the parameters
are read in and validated, and the port number for accepting TCP registration/unregistration
connections set, they are rewritten to <filename>.<hostname>,
which is read by the client API to find the registration port number. If
the -conf flag is not used then the hbmlm will first check in the
current directory (./) for the file globus-hbm-datacollector.conf.<hostname>,
and next for globus-hbm-datacollector.conf. If neither of those
files is present it will look for the same file names in GLOBUS_SYSCONF_DIR.
[--log <filename>] :
an optional flag used to change the base used for the
log file name to <filename>. If it is present, then .log.<hostname>
will be concatenated to the end of it. If it is not present, then the hbmlm
will write the file in the directory specified as GLOBUS_SYSCONF_DIR/../var,
with the name globus-hbm-datacollector.log.<hostname>, where
<hostname> is the fully defined name of the host on which it
is running.
[--chkpt <filename>] :
an optional flag for specifying as <filename>
the name to use as the base when writing the checkpoint file, will be concatenated
with the host name, so that the checkpoint file will be written to "<filename>.<hostname>".
If not present, then the checkpoint file will be written in direcotry GLOBUS_SYSCONF_DIR/../var
as globus-hbm-datacollector.chkpt.<hostname>.
Summary of Protocol Messages For HBM
There are a number of messages exchanged between the HBC
processes and their HBMLM, and between the HBMLMs and the appropriate HBMDCs.
The fields in these messages are of the following types: character (1 byte),
unsigned integers (32-bit in network format), strings (variable-length
null-terminated), UTC time (Universal Time Code as 32-bit unsigned integer
in network format).
HBC Registration
For each HBC registration the HBMLM verifies that the IP
Number for the registering HBC process (as provided as the source address
for the message) is a valid IP Number for the host on which the HBMLM is
running. The HBC processes register with the HBMLM for their host, providing
the following information:
Field |
Contents |
RegCmsgLength |
Length in bytes of the registration message. |
|
Format: Unsigned integer. |
RegCregCode |
Registration/unregistration Code |
|
Format: Unsigned Integer (4 byte).
Values:
GLOBUS_HBM_MSGTYPE_REGISTER
|
RegCprocessPID |
PID of the HBC process. |
|
Format: Unsigned integer. |
RegCprocessName |
Name of the HBC process as returned by ps. |
|
Format: String. |
RegCreportName |
Name the HBMLM is to use when reporting this HBC to this
HBMDC. |
|
Format: String. |
RegCDChbInterval |
Requested interval for generating heartbeats (in seconds). |
|
Format: Unsigned integer. |
RegCDCaddr |
Address of the HBMDC to which the HBMLM is to report
the status of the HBC process (IP Number and port). |
|
Format: sockaddr_in. |
RegCDCmsg |
User message from the client to the Data Collector, can
be used for designating callback events and responses. |
|
Format: String. |
After registering the HBC process the HBMLM sends a simple
acknowledgement with the following information:
Field |
Contents |
RegAckCretCd |
Registration return code (GLOBUS_SUCCESS if successful,
GLOBUS_FAILURE otherwise). |
|
Format: Unsigned integer. |
When the last registration message has been sent and acknowledged,
the HBC sends a message to tell the HBMLM to either commit the registration
or cancel it (based on the value of the require_all flag when hbm_client_register()
was called). The format of that message is:
Field |
Contents |
RegCMsgLength |
Length in bytes of the commit/cancel message. |
|
Format: Unsigned integer. |
RegCommitCd |
Registration commit code. |
|
Format: Unsigned integer.
Values:
GLOBUS_HBM_MSGTYPE_REGISTER_COMMIT
GLOBUS_HBM_MSGTYPE_REGISTER_CANCEL
|
After processing the Commit/Cancel message and checkpointing
(for Commit only) the HBMLM sends a simple acknowledgement with the following
information and disconnects:
Field |
Contents |
RegAckCretCd |
Registration return code (GLOBUS_SUCCESS if successful,
GLOBUS_FAILURE otherwise). |
|
Format: Unsigned integer. |
After recieving this last acknowledgement the HBC disconnects
as well.
HBC Unregistration
For each HBC unregistration the HBMLM verifies that the IP
Number for the unregistering HBC process (as provided as the source address
for the message) is a valid IP Number for the host on which the HBMLM is
running. To unregister the HBC process sends an unregister message with
the following information to the HBMLM with which it is registered:
Field |
Contents |
UnregCmsgLength |
Length in bytes of the unregistration message. |
|
Format: Unsigned integer. |
UnregCregCode |
Registration/unregistration Code |
|
Format: Unsigned integer.
Values:
GLOBUS_HBM_MSGTYPE_UNREGISTER_NORMAL
GLOBUS_HBM_MSGTYPE_UNREGISTER_ABNORMAL
|
UnregCprocessPID |
PID of the HBC process. |
|
Format: Unsigned integer. |
UnregCprocessName |
Name of the HBC process. |
|
Format: String. |
After unregistering the HBC process the HBMLM checkpoints
and sends a simple acknowledgement with the following information:
Field |
Contents |
RegAckCretCd |
Registration return code (GLOBUS_SUCCESS if successful,
GLOBUS_FAILURE otherwise). |
|
Format: Unsigned integer. |
HBMLM Reports to HBMDC
After each monitoring cycle (forking a child that executes
"ps" and returns the output for review via a pipe), each HBMLM reports
the following information to the appropriate HBMDC(s):
Field |
Contents |
HBMLM data: |
|
RptLMmsgLength |
Length in bytes of the report message. |
|
Format: Unsigned integer. |
RptLMhostIPNum |
Primary IP Number for the host on which the HBMLM process
is executing (the one used for communications when reporting). |
|
Format: Unsigned Integer. |
RptLMportNum |
Number of the (UDP) port used by the HBMLM process. |
|
Format: Unsigned integer. |
HBC data: |
|
RptCprocessPID |
PID of HBC process. |
|
Format: Unsigned integer. |
RptCprocessName |
Report name of the HBC process. |
|
Format: String. |
RptCstatus |
Status of the HBC process as kept by the HBMLM. |
|
Format: Unsigned integer.
Values:
GLOBUS_HBM_PROCSTATUS_ACTIVE (registered, "alive",
and consuming CPU time).
GLOBUS_HBM_PROCSTATUS_BLOCKED (registered and
"alive", but blocked, i.e., not consuming CPU time).
GLOBUS_HBM_PROCSTATUS_UNREGISTERED_NORMAL (normal
unregistration message received for this client).
GLOBUS_HBM_PROCSTATUS_UNREGISTERED_ABNORMAL (abnormal
unregistration message received for this client).
GLOBUS_HBM_PROCSTATUS_UNREGISTERED_ABEND (no unregister
message received, but process no longer alive on the host).
|
RptCregistrationTime |
Time at the HBMLM host when the HBC process was registered. |
|
Format: UTC time. |
RptCrptInterval |
Interval in seconds at which the HBMLM generates heartbeats
for this client to this HBMDC. |
|
Format: Unsigned integer. |
RptCrptNum |
Sequence number of this heartbeat (the first heartbeat
is number 1). |
|
Format: Unsigned integer. |
RptCblockedTime |
Time of the end of the latest review/report period in
which the HBC process consumed cpu time. |
|
Format: UTC time. |
RptCcpuTime |
CPU time consumed by the process (host-specific units)
as reported by ps. |
|
Format: Unsigned integer. |
RptCunregisterTime |
Time at which the HBMLM logged the HBC process as unregistered
due to either the receipt of an unregistration message or because the process
no longer exists on the host. |
|
Format: UTC time. |
RptCnumUnregisterMsg |
Number of times this HBC process was reported as unregistered. |
|
Format: Unsigned integer. |
RptCDCmsgNum |
The sequence number of the message for the client (incremented
only when the message changes). |
|
Format: Unsigned integer. |
RptCDCmsg |
Message for the client. |
|
Format: String. |
Checkpoint file formats
Both the HBMLMs and the HBMDC(s) periodically checkpoint
to a files. Each HBMLM and HBMDC writes to a work checkpoint file, then
renames it to the designated checkpoint filename. Records in the checkpoint
files are separated by carriage return/line feeds, and fields are separated
by semi-colons (except for the literal at the beginning of each record
that gives the record type). The fields in the checkpoint files are of
the following types: IP Number (as a string in "dot" notation as generated
by inet_ntoa (typically no leading zeros, e.g., 128.9.64.205 rather than
128.009.064.205)), string (variable-length, terminated by the semi-colon
field terminator or carriage-return/line field record terminator), unsigned
integers (as characters), and UTC time (Universal Time Code displayed as
YYYY/MM/DD hh:mm:ss GMT).
HBMLM checkpoint file
The HBMLM checkpoint file has the following data:
Field |
Contents |
HBMLM record: |
|
"LM Data:" |
String literal to designate the record type. |
LMhostIPNum |
Primary IP Number for the host on which the HBMLM process
is executing (the one used for communications when reporting). |
|
Format: IP Number. |
LMhostName |
Fully defined name of the host on which the HBMLM process
is executing. |
|
Format: String. |
LMportNumReg |
Number of the TCP port used by the HBMLM process for
registrations/unregistrations. |
|
Format: Unsigned integer. |
LMportNumRpt |
Number of the TCP port used by the HBMLM process for
sending heartbeats and receiving messages from Data Collectors (receiving
not yet implemented). |
|
Format: Unsigned integer. |
LMreportInterval |
Default interval in seconds at which the HBMLM monitors
and reports on registered HBC processes. |
|
Format: Unsigned integer. |
LMClientsCt |
Number of HBC processes that the HBMLM is monitoring
(and which are included in the checkpoint file). |
|
Format: Unsigned integer. |
LMDCsCt |
Number of DC entries total for all clients (and which
are included in the checkpoint file). |
|
Format: Unsigned integer. |
LMcheckpointTime |
Time at which the checkpoint was done. |
|
Format: UTC time. |
HBC record: |
|
"CL Data:" |
String literal to designate the record type. |
CprocessPID |
PID of HBC process. |
|
Source: HBC registration message.
Format: Unsigned integer. |
CprocessName |
Name of the HBC process. |
|
Source: HBC registration message/ps.
Format: String. |
CprocessStatus |
Process status as determined at the last evaluation. |
|
Source: derived from ps.
Format: Unsigned integer.
Values:
GLOBUS_HBM_PROCSTATUS_ACTIVE (registered, "alive",
and consuming CPU time).
GLOBUS_HBM_PROCSTATUS_BLOCKED (registered and
"alive", but blocked, i.e., not consuming CPU time).
GLOBUS_HBM_PROCSTATUS_UNREGISTERED_NORMAL (normal
unregistration message received for this client).
GLOBUS_HBM_PROCSTATUS_UNREGISTERED_ABNORMAL (abnormal
unregistration message received for this client).
GLOBUS_HBM_PROCSTATUS_UNREGISTERED_ABEND (no unregister
message received, but process no longer alive on the host).
|
CblockedTime |
Time of the end of the latest review/report period in
which the HBC process consumed cpu time. |
|
Format: UTC time. |
CcpuTime |
CPU time consumed by the process (host-specific units)
as reported by ps. |
|
Format: Unsigned integer. |
CDCsCt |
Number of Data Collector records/entries for this client. |
|
Format: Unsigned integer. |
HBMDC record: |
|
"DC Data:" |
String literal to designate the record type. |
DChostIPnum |
IP Number for the HBMDC to which the HBMLM is to report
the status of the HBC process. |
|
Source: HBC registration message.
Format: IP Number. |
DCportNum |
Port number for the HBMDC to which the HBMLM is to report
the status of the HBC process. |
|
Source: HBC registration message.
Format: Unsigned integer. |
DCprocessNameRpt |
The name used for this process when generating heartbeats
to this Data Collector. |
|
Format: String. |
DCregistrationTime |
Time at the HBMLM host when the HBC process was registered. |
|
Format: UTC time. |
DCrptInterval |
Interval in seconds at which heartbeats for this Client
are to be sent to this Data Collector. |
|
Format: Unsigned integer. |
DCrptNum |
Sequence number of the last heartbeat sent to this Data
Collector for this Client. |
|
Format: Unsigned integer. |
DCrptTimeLast |
Time that the last heartbeat was sent to this Data Collector
for this Client. |
|
Format: UTC time. |
DCrptTimeNext |
Time that the next heartbeat is to be sent to this Data
Collector for this Client. |
|
Format: UTC time. |
DCunregisterStatus |
Unregister status of the HBC process with respect to
this HBMDC. |
|
Format: Unsigned integer.
Values:
GLOBUS_HBM_UNREGISTERSTATUS_ACTIVE (signifies
that the process is not unregistered).
GLOBUS_HBM_UNREGISTERSTATUS_NORMAL (normal unregistration
message received for this client).
GLOBUS_HBM_UNREGISTERSTATUS_ABNORMAL (abnormal
unregistration message received for this client).
GLOBUS_HBM_UNREGISTERSTATUS_ABEND (no unregister
message received, but process no longer alive on the host).
|
DCunregisterTime |
Time at which the HBMLM logged the HBC process as unregistered
due to either the receipt of an unregistration message or because the process
no longer exists on the host. |
|
Format: UTC time. |
DCnumUnregisterMsg |
Number of times this HBC process has been reported as
unregistered to this HBMDC. |
|
Format: Unsigned integer. |
DCmsgNum |
The number of the message from the client to the Data
Collector. |
|
Format: Unsigned integer. |
DCmsg |
The message from the client to the Data Collector. |
|
Format: String. |
For Globus processes, the values of the hostName, processName,
and processPID fields define a unique key by which monitored processes
can be identified.
HBMDC checkpoint file
The HBMDC checkpoint file has the following data:
Field |
Contents |
HBMDC record: |
|
DChostIPNum |
IP Number for the host on which the HBMDC process is
executing (the one used for communications). |
|
Format: IP Number. |
DChostName |
Name of the host on which the HBMDC process is executing. |
|
Format: String. |
DCportNum |
Number of the (UDP) port used by the HBMDC process. |
|
Format: Unsigned integer. |
DCcheckpointTime |
Time at which the checkpoint was done. |
|
Format: UTC time. |
DCnumHBMLM |
Number of HBMLM records. |
|
Format: Unsigned integer. |
DCnumHBC |
Number of HBC process records. |
|
Format: Unsigned integer. |
HBMLM record(s): |
|
LMhostIPNum |
Primary IP Number for the host on which the HBMLM process
is executing. |
|
Format: IP Number. |
LMportNum |
Number of the (UDP) port used by the HBMLM process. |
|
Format: Unsigned integer. |
LMlastReportRcvTime |
Time at which the last report was received from this
HBMLM. |
|
Format: UTC time. |
LMnumClients |
Number of HBC processes that the HBMLM is reporting to
this HBMDC. |
|
Format: Unsigned integer. |
HBC record(s): |
|
ChostIPNum |
Primary IP Number for the host on which the HBC process
is executing. |
|
Format: IP Number. |
ChostName |
Name of the host on which the HBC process is executing. |
|
Format: String. |
CprocessName |
Name of the HBC process. |
|
Format: String. |
CprocessPID |
PID of HBC process. |
|
Format: Unsigned integer. |
CprocessStatus |
Process status as determined at the last evaluation. |
|
Format: Unsigned integer.
Values:
GLOBUS_HBM_PROCSTATUS_ACTIVE (registered, "alive",
and consuming CPU time).
GLOBUS_HBM_PROCSTATUS_BLOCKED (registered and
"alive", but blocked, i.e., not consuming CPU time).
GLOBUS_HBM_PROCSTATUS_OVERDUE (client not unregistered,
but heartbeat reports from HBMLM are late/missing).
GLOBUS_HBM_PROCSTATUS_UNREGISTERED_NORMAL (normal
unregistration message received for this client).
GLOBUS_HBM_PROCSTATUS_UNREGISTERED_ABNORMAL (abnormal
unregistration message received for this client).
GLOBUS_HBM_PROCSTATUS_UNREGISTERED_ABEND (no unregister
message received, but process no longer alive on the host).
GLOBUS_HBM_PROCSTATUS_UNREGISTERED_NO_RPT (client
considered unregistered because heartbeat reports are late/missing).
|
CregistrationTime |
Time at the HBMLM host when the HBC process was registered. |
|
Format: UTC time. |
CrptInterval |
Interval in seconds at which heartbeats are to be generated
by the HBMLM. |
|
Format: Unsigned integer. |
CblockedTime |
Time of the end of the latest review/report in which
the HBC process consumed cpu time. |
|
Format: UTC time. |
CcpuTime |
CPU time consumed by the process (host-specific units)
as reported by ps. |
|
Format: Unsigned integer. |
ClastRptSeqNum |
Sequence Number of the last heartbeat sent for this process
that was received. |
|
Format: Unsigned integer. |
ClastRptRcvTime |
Time at which the last heartbeat sent for this process
that was received arrived. |
|
Format: UTC time. |
CunregisterTime |
Time at which the HBMLM logged the HBC process as unregistered
due to either the receipt of an unregistration message or because the process
no longer exists on the host. |
|
Format: UTC time. |
CmsgNum |
Number of the client message. |
|
Format: Unsigned integer. |
Cmsg |
Client message. |
|
Format: Null-terminated string. |
Outstanding Issues
-
In future releases the HBMDC may wish to regulate the incoming
HBMLM report message traffic volume and/or its workload by dynamically
adjusting the interval of individual HBMLM processes. Protocols and messages
to allow this will be investigated.