Excerpt
from DRAFT POSIX Std. 1003.25
Disclaimer: This section from the Draft POSIX Standard
is provided for reference only. This is
the current revision of the draft as of September 20, 2001. This is not
guaranteed to be the latest revision.
If you have any comments or
questions, please send e-mail to lkessler@users.sourceforge.net.
Annex B (informative): Rationale and Notes
B.20.2 Logging
of Kernel Events
B.20.3 Event
Log Structure; Persistence of Records
B.20.4 Remote
Logs; Portable Logs
B.20.5 Integrity
of Event Data
B.20.13 Open an Event Log for Read Access
B.20.14 Read from an Event Log
B.20.15 Notify Process of Availability of
System Log Data
B.20.16 Reposition the Read Pointer
B.20.17 Compare Event Record Severities
B.20.18 Queries = Event Filters
B.20.19 String Equivalents of Event Attributes
The standard calls for a single system-wide log, to which all event records are written. Funneling all event records intact through a single logical stream makes it easier for an implementation to monitor and analyze events in a system-wide context, in order to determine where faults may exist. This capability is critical to the consensus model of using the event stream as a conduit for achieving fault-tolerance within the system.
The standard provides for logging of raw binary data, to enable efficient construction and processing of log event records. With purely textual data, hardware sense data and other binary failure data cannot be adequately supported, and analysis options are limited. The standard also supports the logging of textual data as a special case, and so in this respect is compatible with the functionality commonly found in syslog implementations.
The POSIX standard specifies an application’s interface to the operating system. Therefore, this event logging standard does not attempt to specify an API for logging events generated by the operating system kernel. There was strong consensus, however, that the event logging system should accommodate events generated by the kernel, that kernel events should be logged to the system log along with application-generated events, and that the format of kernel-generated events should conform to this event logging standard.
The structure and organization of event logs is left to the implementer's discretion, subject only to the constraints of the specified interfaces. In particular, although the standard requires that the posix_log_read() function yield a posix_log_entry structure that contains the event record’s attributes, there is no requirement that the attributes be stored in the log in that form.
The standard establishes an open/read/close style of interface for reading of event records, in order to support processing of archival log files and log files from other systems. This same interface is used to read the system log.
Applications that read the system log should be prepared for new events to be appended to the log at any time. Other than that, it was felt that, at least between maintenance activities (see Section 20.5), such applications should have a stable view of the event log: once read, event records should not disappear from the application’s view of the log.
The standard requires that new events be added at the end of the system log. This implies that a sequential read of the log will yield the records in chronological order according to when they were written to the log. However, a record’s timestamp is assigned at the time of the call to posix_log_write(). (It was felt that the timestamp should reflect the time of the event as nearly as possible.) The working group recognized that, for a variety of reasons, it may be difficult for some implementations to guarantee that events are written to the log in exactly in the order in which they were initiated – especially considering that kernel events may circumvent posix_log_write() entirely on their way to the log. Therefore, there is no requirement that events appear in the log in order of timestamp.
The implementation is free to support other methods of accessing event logs, so long as the aforementioned sequential read is supported according to the standard.
The standard does not provide for direct access to logs on remote systems, nor does it specify how event information is transferred from one system to another. The working group expects that implementers will provide extensions to the standard to meet this need. The format of an event record is binary rather than textual, so portability (if any) of an event log between different architectures is left to the implementation(s).
The working group briefly considered the idea of defining an architecture-independent event-log format. This would enable a conforming program running on one system to display or otherwise interrogate an event log written on another system – even one with a completely different architecture. However, several issues became immediately apparent. For example,
The working group concluded that portability of event logs is beyond the scope of the current standard.
There was much discussion of possible guarantees that should be required of the implementation regarding the availability of log data once a call to posix_log_write() is initiated. The working group envisioned the following timeline for the lifetime of an event that is logged via the posix_log_write() function. The timeline would be essentially the same for events logged by the kernel. As discussed later, some of the steps in this timeline may not occur in the indicated order, if at all.
Regarding this timeline, the standard states:
Beyond this, any guarantees about the integrity and/or availability of log data are up to the implementation.
It was felt that posix_log_write() should complete as quickly as possible, to minimize event-logging overhead in the calling process. Therefore, it was felt that completion of posix_log_write() should not have to wait for completion of the write to long-term storage (step 7) or delivery of associated notifications (step 8). It was felt that any screening done by posix_log_write() (step 2) should be very quick. Additional screening (step 6) could be deferred until after completion of posix_log_write().
Note that even if posix_log_write() succeeds, the associated event record may never become available for reading. Here’s why:
a. The implementation is permitted to drop the record at step 6.
b. The event record may be lost from temporary storage before it is written to long-term storage – for example, if the incoming event rate is so high that temporary storage overflows.
c. The write to long-term storage may fail – for example, if there is no more room in the log file’s filesystem.
d. A log maintenance activity may delete the record from the log before the record is ever read.
The standard requires that, within an event log, records have ascending record IDs. In particular, kernel events are not allowed to have a different record-ID sequence from that of application-generated events in the same log.
It is not required that there be no gaps in the sequence of record IDs. Therefore, the implementation is free to assign a record ID to a record that may later be dropped or deleted. It was felt that the range of possible event IDs is huge enough to allow a certain amount of extravagance in this area.
The record ID must be assigned before notifications associated with this event are delivered (since the notification may include the record ID, and also because record ID is a query criterion), and before the record is first sought (via posix_log_seek()) or read.
Events that occur at the time of an unexpected system shutdown are among the most interesting and useful to log. Although the standard is silent on this subject, the working group felt that events logged to temporary storage (but not to long-term storage) just before a system shutdown should be written, if possible, to the system log when the system is rebooted.
The standard does not specify how or when the system log is to be archived, compacted, or otherwise modified for maintenance purposes. However, it was recognized that such activities are commonplace, and it was felt that applications should not have to abandon access to the system log when these activities occur. Section 20.5 describes a mechanism for notifying applications when such a maintenance activity starts and ends, so that the applications can resume access to the log once the activity is completed.
The system log is “always available to accept event records.” It was felt that the implementation should not reject calls to posix_log_write() simply because they happen to occur during log maintenance activities. The implementation should either buffer these new records or suspend completion of posix_log_write() calls until maintenance is complete.
Performance of the event logging system was a concern primarily in the following areas:
As discussed in Section B.20.15, timely (as opposed to efficient) delivery of notifications was not deemed to be a vital performance consideration.
The syslog event-logging system, used with many UNIX systems, was considered as a basis for POSIX event logging. For the most part it was rejected, for the following reasons:
Other event-logging implementations exist that have overcome these shortcomings. Some of these implementations support the syslog() function for backward compatibility, but feed the syslog()-generated messages into a more flexible logging system.
In view of the widespread use of syslog, it was considered important to be able to implement syslog’s primary features using the POSIX event-logging interface. For example, the posix_log_printf() function supports syslog()’s printf-like formatting capability for text-based event records. The posix_log_memtostr() function enables a simple, strictly conforming program to produce a textual version of a POSIX event log. Functions like posix_log_seek() and posix_log_query_match() make it relatively easy to classify events and/or focus only on events of interest. The posix_log_notify_add() feature enables the creation of programs that watch for new events and take appropriate actions as they occur.
The standard’s set of severity levels is taken directly from syslog, and compatibility with syslog here was felt to be important. It was generally felt that eight levels of severity should be enough for anybody, although the implementation is free to define additional severities.
The standard also accepts syslog’s set of facilities, again largely for compatibility with syslog. (While different implementations of syslog tend to use the same set of severity levels, there is much less agreement on the set of facilities. The set specified in the standard is intended to be a common subset.) It is fully expected that implementations and/or applications will define additional facilities. The set inherited from syslog was not felt to be adequate, but there was general consensus as to the futility of trying to define a complete set that would be widely accepted.
This section of the rationale discusses Section 20.2 of the normative text, with the following exceptions:
The data type posix_log_facility_t is an opaque type that is not an array type. (Array types were disallowed for this and some other types because of the complexities associated with passing arrays as function arguments in C.) Existing implementations typically use either character strings or integers as facility codes. The posix_log_factostr() and posix_log_strtofac() functions were included in recognition of the fact that a facility’s code may not be the same as its name.
Integer codes have the advantage of compactness and simplicity, but have the disadvantage that different systems may assign different numbers to the same facility. For example, the Volume Manager may be facility number 55 on system A, but number 59 on system B. This could create problems when analyzing system A’s event log on system B. One approach to this problem is to make the facility’s code a function of its name – for example, a hash code or checksum. It was felt that the solution to this problem is implementation- or installation-dependent, and is therefore beyond the scope of the standard.
Note that although posix_log_facility_t cannot be an array type, it can be a struct whose only member is an array – for example:
typedef struct {
char fac_name[20];
} posix_log_facility_t;
The standard is silent about the format of the “variable-data” portion of an event record, with one exception: the POSIX_LOG_STRING format code is provided for the common case where the variable-data portion is a character string. A variety of other formats were discussed, but none were considered suitable for inclusion in the standard.
The posix_log_entry struct contains those attributes that are included in every event record. There was much discussion as to what sorts of attributes should be included in this standard set. The attributes that were chosen meet most or all of the following criteria:
The record ID (log_recid member) is intended to uniquely identify a particular instance of a particular kind of event. It can be used to locate the event record within the log (e.g., using posix_log_seek()), or to indicate the record’s order in the log relative to other records.
The size and format attributes (log_size and log_format) specify the size and format of the variable-length data portion.
The facility and event type (log_facility and log_event_type) are intended to uniquely identify a type of event. Different facilities can use the same event-type code for completely different types of events. The event type can have a variety of uses:
The user ID (log_uid), process ID (log_pid), and time stamp (log_time) were widely viewed as useful or even essential.
The group ID (log_gid), process group ID (log_pgrp), thread ID (log_thread), and processor ID (log_processor) were viewed more critically; but it was felt that these attributes could be very helpful in diagnosing certain types of problems. In any case, they are compact and easy for the implementation to capture.
The log_processor member was originally of type int, and was called log_cpu. However, with the increasing variety of multiprocessor and/or clustered architectures, and the increasing tendency to partition processor pools and other resources into multiple virtual systems, it was felt that the implementation should be free to express a processor's ID as something other than an integer. For similar reasons, some objected to the term “cpu” as shorthand for “processor.” (For many multiprocessor systems, no one processor is “central.”)
The log_flags member was introduced to support the POSIX_LOG_TRUNCATE flag, and to accommodate other implementation- or application-defined flags. Other flags were considered for inclusion in the standard, but rejected or moved to other parts of event record. (For example, POSIX_LOG_STRING started out as an event type, was later made a flag, and was later made a format when the log_format member was introduced.)
There were two schools of thought on the handling of a call to posix_log_write() where the length of the variable data exceeds {POSIX_LOG_ENTRY_MAXLEN}. Some thought that the call should fail, to avoid logging corrupted (truncated) data. Some thought that the call should succeed with truncated data, to avoid losing the entire record. The standard’s definition of the {POSIX_LOG_TRUNCATE} flag allows this issue to be decided by the implementation. There was little support for allowing different behaviors for different applications on the same implementation.
In an event record that has a format of {POSIX_LOG_STRING} and has the {POSIX_LOG_TRUNCATE} flag set, the character string is still guaranteed to be null-terminated. It was felt that the additional burden this placed on the implementation of posix_log_write() was minor compared to the benefit to application programs.
There is no function to write a record to the system log (or any log) by specifying a posix_log_entry struct and optional data buffer. As a result, there is no way for a strictly conforming program to copy all or part of an event log to another log. This capability, if desired, was viewed as a function of the underlying implementation’s log administration duties.
A log descriptor (posix_logd_t) could very well incorporate a file descriptor. It might even be a file descriptor. There was general consensus, however, against the notion that a log descriptor has to be a file descriptor. Hence the opaque posix_logd_t type, and a per-process limit on log descriptors ({POSIX_LOG_OPEN_MAX}) that is distinct from the per-process limit on file descriptors.
In early drafts of the standard, the posix_log_open() function included a posix_log_query_t argument. The intent was that a sequential read of the log using the resulting log descriptor would yield only the records that match the query object. This idea was eventually rejected in favor of the current, more flexible, version of posix_log_seek(), which could be used to provide the same effect.
The working group discussed permitting a NULL value
for the entry parameter of posix_log_read(), for use when the
read is used only to skip to the next event record. This was not considered particularly useful, and in any case the
implementation must read at least the event’s log_size attribute in
order to determine where to find the next record.
Due to existing
practice there was a need for a notification interface for event logging. Notification eliminates the need for an
application to poll the log for entries of interest. Also, since notification
is available, there is no need for a blocking form of posix_log_read().
There was
discussion of two fundamentally different reasons for posting notifications: